Low Complexity Hardware Oriented H.264/AVC Motion Estimation Algorithm and Related Low Power and Low Cost Architecture Design

HUANG, Yiqing

Graduate School of Information, Production and Systems

Waseda University

February 2010
Abstract

The ever increasing bit-rate on network applications such as broadcasting digital television makes storage capacity larger than ever before. Especially, the advent of Super Hi-Vision (SHV) which has feature of high resolution further intensifies the tough situation. Since limitation exists in network bandwidth and disk storage, the video compression technique is becoming more important than before. As the latest video coding standard, H.264/AVC can provide superior performance to previous standards. However, it also consists of huge complexity. When ASIC (Application Specific Integration Circuits) based real-time hardware system is considered, the intensive complexity in H.264/AVC will cause problems in hardware cost and power consumption. Therefore, to solve the problem, this dissertation focuses on two key issues which are low complexity hardware oriented algorithm and its related architecture.

In H.264/AVC based system, motion estimation (ME) which is the major part of inter prediction is the most significant component. It consists of integer ME (IME) and fractional ME (FME) and occupies almost 90% computation, which makes it a must to divide IME and FME into two separate stages in real-time hardwired encoder. Besides motion estimation part, hardware engine of intra prediction is another time consuming part because of its abundant prediction modes. Moreover, the rate distortion based mode decision part which makes a final judgment of inter and intra modes also consumes lot of computation in the final stage of whole encoding system. Many software based fast algorithms have already been proposed to release complexity of H.264/AVC based system. However, most of these algorithms can not be efficiently realized in hardware because of constraints in hardware design. In hardware, factors such as predictable data flow, regular access of memory and full hardware utilization are important to the whole system’s performance. Without considering these factors, hardware cost, throughput and power consumption will increase greatly. So, hardware oriented low complexity algorithm and related low cost and low power hardware architecture are important issues to H.264/AVC based real-time encoder design.

Based on analysis of existing works and current problem, this dissertation mainly
targets on low cost and low power H.264/AVC real-time hardwired encoder. In detail, it focuses on IME, FME, intra and mode decision, which are four computation intensive parts in H.264/AVC based system. Firstly, low complexity algorithm which follows hardware data flow is proposed. Secondly, based on proposed algorithm, flexible and highly parallel architectures are given out. Moreover, architecture and circuit optimizations are proposed to further reduce the hardware cost and power consumption.

The whole dissertation consists of 6 chapters as follows.

In the first chapter, introduction in video compression field is given out. The development and feature of video coding standards and emphasis of this dissertation are described in detail.

In the second chapter, hardware oriented low complexity motion estimation algorithms are given out. The complexity reduction is achieved in MRF, search range and matching pattern of H.264/AVC based system. Firstly, for MRF technique, gradient and block matching information are used for fast MRF algorithms. The proposed algorithms release the MRF complexity according to macroblock (MB) features in spatial and temporal domains. Secondly, based on the statistical analysis, it is shown that motion feature is conformity across several frames and search range can be adaptive adjusted according to the motion feature of MB. So, two proposals of search range adjustment is given out in this dissertation. For MB with extreme small motion, search range is restricted into 1/8 of original value. For MB with other cases, the search range is adjusted recursively according to the motion feature of MB on previous frame. Thirdly, since pixel difference can reflect spatial feature of current MB, it is used to classify matching pattern of ME process. An pixel difference based adaptive sub-sampling scheme is proposed, which uses three hardware oriented patterns for MB with different spatial features. By combining all the proposed schemes, the overall algorithm can achieve up to 95.72% complexity reduction with average 0.072dB PSNR loss and 0.902% bit-rate increase based on hardware data flow.

In the third chapter, two flexible IME architectures for adaptive sub-sampling algorithm, namely adaptive propagate partial SAD (APPSAD) and reconfigurable SAD Tree (RSADT), are proposed. By using configurable SAD, the proposed RSADT architec-
ture achieves data organizations in both architectural and memory level, which speeds up processing time and saving power consumption. For APPSAD, the original processing element (PE) is expanded into four different types. According to different matching patterns, only the related type of PE is enabled and power consumption of other types of PE can be saved. Moreover, circuit optimization is applied on both APPSAD and RSADT are optimized. The propagation chain, original PE and adder trees are simplified, with no redundant registers and adders. So, hardware cost and power consumption are further reduced. With TSMC 0.18um CMOS library, it is shown that the proposed architectures can achieve 61.71% saving of processing cycles and up to 39.8% power reduction of existing works.

In the fourth chapter, two low design effort SHV engines for FME and intra prediction are proposed. Firstly, for FME engine, two optimizations in the algorithm level, namely inter mode pre-filtering and one-pass algorithm are proposed. For inter mode pre-filtering, it analyze the motion cost of sub-blocks in IME stage and only focuses on two modes which have smaller cost than others. As for one-pass algorithm, it firstly decides the sliding window based on integer motion cost of neighboring positions. Then, only half and quarter pixel within the sliding window are processed simultaneously, which saves hardware cost and processing time. In the hardware level, with quarter sub-sampling technique in FME stage, a 16-Pel interpolation structure is proposed, which speeds up 4 times of original 4-Pel design while keep almost the same hardware amount. With MB and frame level parallel processing flow, compared with representative design which requires 2.16GHz for 4k×4k@60fps, the proposed FME engine can accomplish real-time processing with only 145MHz. For intra engine, the predictor generation is the most time consuming part. From the analysis of data dependency issue of intra prediction, it is observed that the maximum parallel processing scale is two sub-block instead of original one sub-block way. In this dissertation, one lossless two sub-block parallel data flow are proposed, which saves 37.5% processing time of original one sub-block way. Also, in the original intra predictor generation engine, lots of repetitive computation exists among different modes. In the proposed fully utilized intra predictor generation architecture, no repetitive generation of predictors exists and it is applicable for all intra prediction
modes. With proposed architecture, the whole predictor generation process can be finished within only 22.5% cycles of original design. By combining parallel data flow and fully utilized architecture, the proposed intra predictor generation engine is capable of handling 4k×2k@60fps specification.

In the fifth chapter, high complexity problem in H.264/AVC mode decision is discussed. By utilizing spatial and temporal information, complexity reduction is achieved in two stages. Firstly, gradients of current MB and motion vector of encoded MB on both current and previous frames are utilized for pre-stage skip mode check. Secondly, during the motion stage, it is observed that information of motion vector predictor (MVP), block overlapping status and rate distortion cost can indicate the accuracy of matching process. In detail, the MVP represents the accuracy of predicted start point. The block overlapping status of different inter modes indicates the motion trend of object. As for rate distortion cost, it is an objective measurement of matching result. Thus, such information is used for early decision of whole encoding process in the proposed mode decision algorithm. Compared with existing works, the proposed algorithm can achieve up to 53.4% speed-up ratio with trivial quality loss.

In the sixth chapter, the whole dissertation is concluded and future trend in video compression fields is also briefly discussed. In this dissertation, it focuses on IME, FME, intra and mode decision which are four most important parts in H.264/AVC real-time encoding system. Hardware oriented low complexity algorithm and low cost, low power hardware architectures are proposed. By combining hardware oriented algorithms with proposed architectures, compared with recent 4-stage real-time encoder design, about 90.68% power in IME part can be reduced. As for SHV targeted FME and intra engines, about 93.31% and 67.24% estimated power reduction in hardware design.
# Contents

1 Introduction  
   1.1 Background and purpose of this dissertation ............................ 1  
   1.2 Scope of this dissertation ............................................. 4  

2 Hardware oriented fast H.264/AVC motion estimation algorithm  7  
   2.1 Introduction ............................................................. 7  
   2.2 Hardware oriented multiple reference frame elimination ............... 10  
      2.2.1 Aliasing problem and impact of edge detection .................... 11  
      2.2.2 Gradient based multiple reference frame elimination .............. 14  
      2.2.3 Quantization parameter based threshold adjustment .............. 16  
      2.2.4 Similarity-analysis based multiple reference frame elimination ... 21  
   2.3 Hardware oriented search range adjustment ............................ 27  
      2.3.1 Motion feature based search range adjustment .................... 27  
      2.3.2 Recursive 6-ring search range adjustment ........................ 29  
   2.4 Pixel difference based adaptive sub-sampling .......................... 31  
   2.5 Experiments, comparison and analysis .................................. 34  
   2.6 Conclusion remarks .................................................... 48  

3 Flexible integer motion estimation architecture 50  
   3.1 Introduction ............................................................. 50  
   3.2 Reconfigurable SAD tree architecture .................................. 53  
      3.2.1 System architecture .............................................. 53  
      3.2.2 Architecture level data organization and circuit modification ... 55
CONTENTS

3.2.3 Memory level pixel organization .................................. 58
3.2.4 Cross reuse structure for CSAD generation ....................... 60

3.3 Adaptive propagate partial SAD architecture ....................... 62
  3.3.1 System architecture ........................................... 62
  3.3.2 Memory organization ........................................... 65
  3.3.3 Compressor tree in standard cell library ....................... 69
  3.3.4 Circuit optimization for single processing element ............ 70
  3.3.5 Compressor tree based eight stage circuit optimization ..... 72

3.4 Experiments, comparison and analysis .............................. 75
3.5 Conclusion remarks .................................................. 83

4 Low design effort VLSI engine for super high-vision application 84
  4.1 Introduction ..................................................... 84
  4.2 Low complexity fractional motion estimation algorithm ........ 89
    4.2.1 Mode reduction based mode pre-filtering scheme ............ 89
    4.2.2 Motion cost oriented directional one-pass scheme .......... 91
    4.2.3 Overall hybrid schemes .................................... 93
  4.3 Architecture level parallel improved schemes ................... 94
    4.3.1 Parallel improved 16-Pel processing ....................... 94
    4.3.2 MB-parallel schedule ....................................... 97
    4.3.3 Unified pixel block loading ................................ 97
    4.3.4 Parity pixel organization for parallel processing .......... 99
  4.4 Low design effort architecture for H.264/AVC intra predictor generation 100
    4.4.1 Parallel processing flow for intra predictor generation .... 100
    4.4.2 Fully utilized parallel intra predictor generation architecture 102
  4.5 Experimental result of low design effort engines ............... 110
  4.6 Conclusion remarks ............................................... 116

5 Analysis of macroblock feature to fast inter mode decision 117
  5.1 Introduction ..................................................... 117
  5.2 Pre-stage inter mode decision schemes .......................... 119
5.2.1 MV oriented spatial-temporal inter mode check ................................ 119
5.2.2 Edge gradient based inter mode filtering ............................................ 121
5.3 Motion feature based fast inter mode decision schemes ......................... 125
  5.3.1 MVP accuracy and block overlapping analysis .................................. 125
  5.3.2 Smoothness of sum of absolute difference (SAD) ............................... 126
  5.3.3 Rate distortion cost analysis on big inter modes ............................... 127
5.4 Overall algorithm and experiments ....................................................... 128
5.5 Conclusion remarks ............................................................................. 133

6 Conclusions and future work ................................................................. 134

Acknowledgement ..................................................................................... 138

References ............................................................................................... 140

Publications ............................................................................................. 146
List of Figures

1.1 Overview of video coding standards ........................................ 2
1.2 Block diagram of H.264/AVC video coding system .................. 3
1.3 Overview of this dissertation .............................................. 5
2.1 Complexity in H.264 motion estimation .............................. 8
2.2 4-stage pipeline based video coding system ......................... 10
2.3 Aliasing in Hybrid Video Coding ........................................ 12
2.4 RD Curves of QCIF 'football' and 'mobile' ......................... 12
2.5 2-D Fourier Spectrum Amplitude of 'football_qcif' and 'mobile_qcif' .... 13
2.6 Convolution mask of Sobel operator .................................. 14
2.7 MB partition in VBSME algorithm ..................................... 16
2.8 Edge gradient analysis flow chart ..................................... 17
2.9 Tolerance graph of ‘foreman_qcif’ ................................... 20
2.10 Coding block sizes of QCIF sequences ............................. 22
2.11 Spiral search order ...................................................... 23
2.12 Number of MBs with BISP in MVP .................................. 24
2.13 Distribution of final best mode ...................................... 26
2.14 Impact of search range to video quality ............................ 28
2.15 6-Ring search range adjustment .................................... 31
2.16 Impact of direct sub-sampling ....................................... 33
2.17 Three sub-sampling patterns ........................................ 34
2.18 Flow chart of adaptive sub-sampling ............................... 35
2.19 Comparison of QCIF and CIF RD Curves ............................ 40
LIST OF FIGURES

2.20 Comparison of 720p RD Curves ........................................... 41
2.21 PE idle ratio ................................................................. 46
2.22 Clock cycle saving ratio ...................................................... 46
2.23 4-Stage encoding system with proposed algorithm ..................... 47

3.1 Sub-sampling patterns and full pixel pattern ................................ 51
3.2 Data reuse problem in SAD Tree structure ................................ 52
3.3 Original SAD Tree structure .................................................. 53
3.4 Proposed reconfigurable SAD tree architecture ............................ 54
3.5 Pixel data organization .......................................................... 57
3.6 4-Pel scaled CSAD .............................................................. 58
3.7 Modification in SU .............................................................. 58
3.8 Original reference shift array .................................................. 59
3.9 Modified reference shift array .................................................. 60
3.10 Memory level pixel organization ............................................. 61
3.11 Cross reuse structure for CSAD generation ................................ 62
3.12 Adaptive propagate partial SAD architecture .............................. 64
3.13 8x8 PE array in PPSAD architecture ....................................... 65
3.14 Pixel classification and memory organization .............................. 66
3.15 Memory separation and overlapping ......................................... 66
3.16 Data flow of APPSAD architecture ......................................... 68
3.17 Compressors in standard cell library ........................................ 70
3.18 CMPR42X1 with Multiple-bits Wide Input .................................. 71
3.19 Optimization of processing element ........................................ 71
3.20 Compressor tree structure for Stage_1 ..................................... 72
3.21 Compressor tree structure for Stage_2 ..................................... 73
3.22 Compressor tree structure for Stage_3, Stage_5 and Stage_7 ............. 73
3.23 Compressor tree structure for Stage_4, Stage_6 and Stage_8 ............. 74
3.24 Clock saving of HDTV sequences ............................................ 77
3.25 IME block diagram with APPSAD architecture ........................... 80
3.26 Hardware cost saving of 8x8 PE array ..................................... 81
LIST OF FIGURES

3.27 Power dissipation of 8x8 PE array ........................................ 82
3.28 Power consumption comparison ............................................ 82

4.1 Spectrum comparison of HDTV1080p with SHV ....................... 85
4.2 Impact of mode reduction on SHV .......................................... 90
4.3 Mode reduction based mode pre-filtering scheme ....................... 91
4.4 Motion Cost Oriented One-pass Scheme ................................ 92
4.5 Pseudo codes of FME algorithm ........................................... 94
4.6 RD curve comparison ....................................................... 95
4.7 16-Pel interpolation process .............................................. 96
4.8 MB parallel processing schedule ......................................... 98
4.9 Unified pixel block loading scheme ..................................... 99
4.10 Solution to memory access conflict ................................... 100
4.11 Original processing flow .................................................. 102
4.12 Proposed processing flow ................................................ 103
4.13 Proposed predictor generation engine ................................ 105
4.14 Proposed architecture for I4MB modes ............................... 106
4.15 Proposed architecture for I16MB plane mode ......................... 108
4.16 Proposed architecture configured for I16MB and I4MB DC Mode 109
4.17 4kx4k Super Hi-Vision FME architecture ............................. 111
4.18 Scheme for SHV FME engine .......................................... 112
4.19 Pixel saving ratio of UPB scheme ....................................... 112

5.1 Inter Block Modes in H.264/AVC ......................................... 118
5.2 Spatial-temporal Skip Mode Check ..................................... 121
5.3 Pseudo Codes of Pre-Stage Inter Mode Decision ..................... 121
5.4 Inter Mode Distributions .................................................. 123
5.5 Gradient Distributions of 20th Frame ................................ 123
5.6 BIP Distribution of 16×16 Mode in 100 Frames ..................... 126
5.7 Overall Flow Chart of Proposed Algorithm ........................... 129
5.8 Comparison of RD Curves ............................................... 130
6.1 Whole conclusion of dissertation ........................................... 136
## List of Tables

<table>
<thead>
<tr>
<th>Table</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>2.1</td>
<td>Impact of $THRG$ on sequences</td>
<td>19</td>
</tr>
<tr>
<td>2.2</td>
<td>Simulation conditions for BISP on previous frame</td>
<td>25</td>
</tr>
<tr>
<td>2.3</td>
<td>Simulation conditions for BISP on five reference frames</td>
<td>29</td>
</tr>
<tr>
<td>2.4</td>
<td>BISP Distribution on 1st to 5th Reference Frame</td>
<td>30</td>
</tr>
<tr>
<td>2.5</td>
<td>Homo MB Ratio (%) for 1/4 Subsampling</td>
<td>36</td>
</tr>
<tr>
<td>2.6</td>
<td>Ratio (%) of MB with MRF Elimination</td>
<td>37</td>
</tr>
<tr>
<td>2.7</td>
<td>Ratio (%) of MB with Small Range Constraint</td>
<td>38</td>
</tr>
<tr>
<td>2.8</td>
<td>Quality Comparison with Full Search</td>
<td>42</td>
</tr>
<tr>
<td>2.9</td>
<td>ME Time Reduction Ratio with Full Search(%)</td>
<td>43</td>
</tr>
<tr>
<td>2.10</td>
<td>Quality Comparison with UMHexagon Search</td>
<td>44</td>
</tr>
<tr>
<td>2.11</td>
<td>Speed-up of UMHexagon Search</td>
<td>45</td>
</tr>
<tr>
<td>3.1</td>
<td>Quality analysis of adaptive sub-sampling</td>
<td>52</td>
</tr>
<tr>
<td>3.2</td>
<td>Comparison with Extended SAD Tree</td>
<td>78</td>
</tr>
<tr>
<td>3.3</td>
<td>Comparison of RSADT with Previous Designs</td>
<td>79</td>
</tr>
<tr>
<td>3.4</td>
<td>Comparison of APPSAD with Previous Designs</td>
<td>80</td>
</tr>
<tr>
<td>4.1</td>
<td>Predictors of I4MB modes in $4\times4$ sub-block</td>
<td>104</td>
</tr>
<tr>
<td>4.2</td>
<td>Output predictors of I4MB modes in $4\times4$ sub-block</td>
<td>107</td>
</tr>
<tr>
<td>4.3</td>
<td>Output predictors of I16MB plane mode</td>
<td>108</td>
</tr>
<tr>
<td>4.4</td>
<td>Hardware statistics ($1.62V,125^{\circ}C$)</td>
<td>113</td>
</tr>
<tr>
<td>4.5</td>
<td>Experimental result and comparison</td>
<td>115</td>
</tr>
<tr>
<td>4.6</td>
<td>Comparison of processing cycles for one $4\times4$ sub-block</td>
<td>115</td>
</tr>
<tr>
<td>Table</td>
<td>Description</td>
<td>Page</td>
</tr>
<tr>
<td>-------</td>
<td>-----------------------------------------------------------------------------</td>
<td>------</td>
</tr>
<tr>
<td>5.1</td>
<td>Complexity Analysis based on $-\Delta MET$ (%)</td>
<td>131</td>
</tr>
<tr>
<td>5.2</td>
<td>Quality Analysis based on C1 and C2 (C1: $\Delta PSNR$ (dB); C2: $\Delta Bits$ (%))</td>
<td>131</td>
</tr>
</tbody>
</table>
Acronyms

AMPD: advanced mode pre-decision
APPSAD: adaptive propagate partial SAD
B.MB: bottom macroblock
BDBR: bjøntegaard bit-rate
BDPSNR: bjøntegaard peak source to noise ratio
BIP: best integer point
BISP: best integer search position
BitR: bit-rate
BL.MB: bottom-left macroblock
BMMB: big mode macroblock
BR.MB: bottom-right macroblock
CMO: cross mode overlapping
CMPR32: 3-2 compressor
CMPR42: 4-2 compressor
Co.MB: co-located macroblock
CRS: cross reuse structure
CSAD: configurable Sum of absolute difference
Cur.MB: current macroblock
DB: de-blocking
dynamic_SR: dynamic search range scheme
EC: entropy coding
fm: full mode
FME: fractional motion estimation
HD: high definition
Homo: homogeneous
HW_utiliz: hardware utilization
I4MB: intra 4×4 prediction modes
I16MB: intra 16×16 prediction modes
IBO: inner block overlapping
ICI: immediate carry-in
ICO: immediate carry-out
IMC: integer motion cost
IME: integer motion estimation
IMV: integer motion vector
IP: intra prediction
L.MB: left macroblock
LU.MB: left-up macroblock
MAFD: mean of absolute frame difference
MB: macroblock
MCDOP: motion cost oriented one-pass
ME: motion estimation
MET: motion estimation time
Min_freq: minimum required frequency
mr: mode reduction
MP: matching pattern
MRF: multiple reference frame
MRMPF: mode reduction based mode pre-filtering
MSU: modified snake scan unit
MV: motion vector
MVP: motion vector predictor
NMB: normal macroblock
Non Homo: nonhomogeneous
PA: pixel assemble
PD: pixel difference
PDA: pixel difference analysis
PE: processing element
PE_CONV: conventional processing element
PPSAD: propagate partial sad
pro_SR: proposed search range scheme
PSNR: peak source to noise ratio
PU: processing unit
PUH: processing unit for half pixel refinement
PUQ: processing unit for quarter pixel refinement
QP: quantization parameter
R.MB: right macroblock
RD: rate distortion
RSA: reference shift array
RSADT: reconfigurable SAD Tree
RU.MB: right-up macroblock
SA: similarity analysis
SAD: sum of absolute difference
SAD8x8_BL: bottom-left 8x8 SAD
SAD8x8_BR: bottom-right 8x8 SAD
SAD8x8_LU: left-up 8x8 SAD
SAD8x8_RU: right-up 8x8 SAD
SHV: super hi-vision
SR: search range
SU: snake scan unit
U.MB: upper macroblock
UBP: unified pixel block
VBS: variable block size
Chapter 1

Introduction

1.1 Background and purpose of this dissertation

Sixteen years ago, the advent of MPEG-2 standard enriches our life with worldwide digital television system. From that time, MPEG-2 has become a key technique which is widely used in transmission of High Definition (HD) TV signals over satellite, cable, and the storage of high-quality SD video signals onto DVDs. However, the increasing demand for more service over network, or desire for vivid and impressive daily life makes bit rates on network roar dramatically. Nowadays, high bit rate connections are almost everywhere around us. The ever increasing tough situation on network transmission continuously pushes video compression technique forward.

Currently, the latest video coding standard is H.264/AVC which firstly comes to existence in 2003 [1]. Compared with previous standards, the performance improvement of H.264/AVC is quite significant [2]. Figure 1.1 demonstrates the development of video coding standards. Compared with MPEG-4 [3], H.263 [4], and MPEG-2 [5], the H.264/AVC standard can achieve 39%, 49% and 64% bit-rate reduction, respectively. In the near future, H.265 may come into existence and the performance improvement of new standard is always a heated topics. Figure 1.2 gives out the whole block diagram of H.264/AVC based hybrid encoding system. The bold italic font marked on the diagram represent the new techniques introduced by H.264/AVC standard. For example, in H.264/AVC, it adopts techniques such as variable block size (VBS), multiple reference frame (MRF),
1.1 Background and purpose of this dissertation

Figure 1.1: Overview of video coding standards

intra prediction (IP), context adaptive entropy coding (EC), in-loop de-blocking (DB) and so on. These techniques mainly fall into three categories. Firstly, H.264/AVC introduces techniques which target at higher prediction accuracy. The ME and IP parts fully exploit the temporal and spatial redundancy. Besides skip modes, there are seven inter modes with different block sizes in inter prediction. Considering the MRF technique, the efficiency of condensing temporal information is very high. As for IP modes, there exists nine intra $4 \times 4$ modes and four intra $16 \times 16$ modes. All these inter and intra modes are involved in a rate distortion based encoding process, which ensures the best outcome result over available resources. Secondly, H.264/AVC introduces techniques which focus on image enhancement. To remove the visible artifacts of block based hybrid compression scheme, it uses an adaptive in-loop de-blocking filter, where the strength of filtering is controlled by the values of several syntax elements. Also, the interpolation of half and quarter pixel for fractional motion estimation is an efficient way to compensate the inevitable aliasing problem, which also leads to better image quality. Thirdly, H.264/AVC introduces new mathematics model which greatly improves the compression capability. The powerful entropy coding method, namely CABAC, provides a good solution to the ever increasing bit rates.

Although, there are many appealing points in H.264/AVC standard, the shortcoming of this standard is also quite obvious. Compared with previous standards, the complexity problem of H.264/AVC become a ‘hot potato’ and many researchers focus on this topic for several years. The computation complexity of each part is also marked on Fig. 1.2.
1.1 Background and purpose of this dissertation

Figure 1.2: Block diagram of H.264/AVC video coding system

Besides IP, mode decision and interpolation, the ME part is the most significant one, which occupies almost 90% computation. In order to reduce computation complexity while keep video quality, many software algorithms are proposed to speedup ME process. However, when hardware is considered, the efficiency of software level algorithm is greatly decreased. The high throughput of ME part makes pipeline stage a must, which deteriorates the efficiency of many fast algorithms. Also, important issues in hardware field are quite different from software region which has abundant power resource and computation capability as long as the computer is strong enough. In hardware fields such as ASIC design, issues like hardware cost, parallel processing, power dissipation, data reuse, memory size and hardware utilization are of great importance. Therefore, there exist a gap between software algorithm and hardware design. The purpose of this dissertation is to fill in this gap and propose hardware friendly fast algorithm together with some low hardware cost and low power architectures. The related research topics in this thesis are marked with broken lines in Fig.1.2.
1.2 Scope of this dissertation

This dissertation focuses on hardware friendly low complexity fast motion estimation algorithm and related low cost architecture. To attain this goal, this work focuses on three areas of research:
1. hardware friendly algorithm
2. low cost hardware architecture
3. fast mode decision scheme

To cover these three areas, the dissertation consists of six chapters as shown in Fig. 1.3.

Chapter 2 describes the origin of video quality loss in sampling based digital signal system. Based on theoretical and statistical analysis, several hardware friendly complexity reduction schemes are proposed. The proposed algorithm is based on hardware data flow and it releases the complexity in MRF technique, redundant search points and full pixel matching pattern. Experimental results show that, the proposed hardware friendly algorithm can achieve up to 95.26% complexity reduction and is orthogonal to existing software oriented fast algorithms. Moreover, all the proposed schemes can be easily implemented in pipeline stage based real-time encoding system.

In chapter 3, two HDTV targeted flexible hardware architectures are given out. The proposed structures adopts adaptive sub-sampling algorithm which can not be efficiently realized on existing SAD Tree and propagate partial SAD (PPSAD) architectures. In the proposed architectures, architectural level and memory level data organization is adopted, which enables full data reuse, hardware utilization and lower power consumption feature for adaptive sub-sampling algorithm. Compared with original design, the proposed reconfigurable SAD Tree and adaptive PPSAD architectures can achieve 38.8% and 39.8% reduction of power dissipation.

In chapter 4, the dissertation focuses on the high throughput issue of Super Hi-Vision (SHV) application. With the advent of SHV concept, the hardware implementation of SHV based real-time encoding system has become a hot topic. From the analysis of
existing works, the simple extension of these works to SHV will cause high design effort, large hardware resource and redundant memory access. In the propose architecture, algorithm level optimization and hardware level parallel processing are both adopted to satisfy the throughput issue. With only 145MHz work speed, one SHV 4k×4k@60fps targeted fractional motion estimation engine is given out. As for intra prediction, the
predictor generation part is the most significant component towards high throughput application. In this dissertation, one highly parallel intra predictor generation structure is given out. Based on parallel processing flow and dedicated fully reuse structure, about 77.5% processing time is saved compared to original design.

For H.264/AVC based real-time system, mode decision is another important part considering the complexity of whole encoding system. The trade-off among video quality, complexity reduction and image feature is always a tough research topic. In chapter 5, one novel inter mode decision algorithm is introduced. The propose scheme achieves complexity reduction in a multi-stage way, which makes it suitable for image with different motion features. Compared with existing works, the proposal is superior to other schemes among various types of sequences.

Chapter 6 summarizes the whole research activities and gives out a brief view of future research direction which will further push my current research towards higher level and wider application fields.
Chapter 2

Hardware oriented fast H.264/AVC motion estimation algorithm

2.1 Introduction

As mentioned in previous chapter, the H.264/AVC standard is superior to previous ones in terms of image quality and compression capability. However, it is also computation intensive due to many dedicated techniques. Literature [6] gives out complexity distribution of each part. The motion estimation (ME) part which occupies almost 90% computation turns out to be the most significant part. As shown in Fig. 2.1, the overwhelming complexity in ME mainly comes from five aspects. They are search pattern, search range, sampling pattern, VBS, and reference frame number. During ME of current MB, the VBS technique will divide one 16×16 block into 16×8, 8×16, and 8×8 modes. When 8×8 mode is selected, it can be further divided into 8×4, 4×8 and 4×4 modes. Motion estimation is executed on each mode. For sampling pattern, as shown in Fig. 2.1, when quarter sub-sampling is used, only 1/4 of original pixels are used for block matching process. So, 75% calculation in block matching process can be saved. However, the direct sub-sampling will cause quality degradation. The relationship of reference frame number to complexity is linear. When 5 reference frames are used, the complexity will increase 5 times compared with 1 reference frame under the same conditions. The setting of search range determines the number of candidate search points, which also affects complexity a
2.1 Introduction

Figure 2.1: Complexity in H.264 motion estimation

lot. In terms of search pattern, many existing patterns such as diamond search [7] [8], four-step search [9], three-step search [10], predictive zonal search [11] [12] and hexagon pattern [13] have already been proposed to reduce search points. Fig. 2.1 is an example of hexagon search pattern.

In order to reduce computation complexity while keep video quality, many works have been done [6, 13, 14, 15, 16]. Literature [14] proposes a fast motion estimation algorithm which is based on analysis of motion vectors (MVs) in previous frames. In literature [15], it uses the MVs of previous frames and up-layer blocks to reduce computation complexity
2.1 Introduction

of search points and reference frames. In case of [16], the proposed algorithm first builds up three error surface by using initial 3 block modes (16×16, 8×8, 4×4). The decision of whether to test other modes or finer sub-block partition is based on the error surface analysis. The work of literature [6] uses four heuristic criterions to early terminate the ME process. These algorithms can achieve 30% to 90% reduction in ME time. As for search pattern based fast motion estimation, the UMHexagon search [13] can achieve ME time reduction up to 90%.

In hardware field, as mentioned by many works [17, 18, 19], it is a must to divide motion estimation engine into two stages due to the huge throughput in every clock cycle. As shown in Fig. 2.2, the integer motion estimation (IME) engine is arranged in the first stage while fractional motion estimation (FME) is in the second stage. Therefore, early termination on FME stage like [6] does not work because the IME which occupies 52% computation has already finished its work before handling best MVs to FME stage. As for motion vector based fast algorithms [14][15], they are not favorable for hardware because the storage of all MVs in previous frames is a great burden on system’s hardware cost. For instance, with 24×24 search window size, 10 bits are required for storage of one MB’s MV in [14]. When 5 reference frames are adopted, even in the CIF format, the extra SRAM will be 19.8k bits. With the increase of image size (HDTV for example), the related extra memory will cause a serious burden on the system. For [16], since the rate distortion cost is only available in the last stage based on the hardwired video coding system, it is impossible to apply this algorithm in real-time encoding process. In terms of search pattern based fast algorithm [13], the irregular access of memory and unpredictable data flow make this algorithm difficult for hardware implementation. So, the existing software oriented algorithms are either impractical or inefficient for hardware design. For hardwired video encoding system, the widely adopted search scheme is full search algorithm which has best video quality, regular memory access and fixed processing control [20].

In this chapter, several hardware friendly fast motion estimation schemes are given out, which achieves complexity reduction while maintains full search data flow unchanged. Firstly, for MRF technique, two low complexity schemes are introduced. Based on mathematics analysis, the aliasing problem in image processing field is discussed. Image with
2.2 Hardware oriented multiple reference frame elimination

Figure 2.2: 4-stage pipeline based video coding system

high frequency feature is regarded as aliasing sensitive one and MRF technique is applied on such image. In this dissertation, I use Sobel edge detector to classify MB with different frequency feature. Also, simulation shows that for image which consists of abundant stationary parts, MRF can be eliminated. In this dissertation, similarity analysis is executed on central nine positions during block matching on first frame. The MRF technique on stationary MB is disabled to achieve further reduction of complexity. Secondly, in terms of search range, two adaptive search range adjustment schemes are given out. For small motion MB, search range is restricted in a local centering field and redundant search points are removed consequently. For ordinary motion MB, one recursive 6-ring search range adjustment scheme is introduced to achieve complexity reduction for such MB. Furthermore, in the aspect of matching pattern, one adaptive sub-sampling scheme is given out to release complexity and compensate quality loss of direct sub-sampling technique. The detail of each scheme is shown in the remaining parts of this chapter.

2.2 Hardware oriented multiple reference frame elimination

In this section, the aliasing problem in conventional video encoding system is analyzed. After that, two complexity reduction schemes for MRF technique are given out.
2.2 Hardware oriented multiple reference frame elimination

2.2.1 Aliasing problem and impact of edge detection

In [21], it has already proved that aliasing is the main reason that deteriorates video quality. The adoptions of MRF and sub-pel interpolation in H.264 are actually to compensate for the aliasing problem. Here, I will analyze the aliasing problem in spatial and frequency domains and then give out influence of edge gradient on frequency spectrum.

In order to ease the analysis, only one dimension signal is analyzed and the spatial sampling interval is assumed to be $I = 1$. Let $l(x)$ be spatial continuous signal. The $l_t(x)$ and $l_{t-1}(x)$ are signals at time instance $t$ and $t-1$. Their spatial Fourier transforms are shown in Eq. 2.1. The $d_x$ is the distance between $l_t(x)$ and $l_{t-1}(x)$. It is shown that $L_{t-1}(j\omega_x)$ and $L_t(j\omega_x)$ are the same except their phase difference.

Let $s_t(x_n)$ and $s_{t-1}(x_n)$ be sampling results of space continuous signals $l_t(x)$ and $l_{t-1}(x)$ and their Fourier transform is shown in Eq. 2.2. Equation. 2.2 shows that aliasing problem can be avoided if Eq. 2.3 which represents the band limit low pass filter in the image sensor system is satisfied.

$$
l_t(x) \Leftrightarrow L_t(j\omega_x) = L_{t-1}(j\omega_x) \cdot e^{-jd_x\omega_x} \tag{2.1}
$$

$$
S_t(j\omega_x) = S_{t-1}(j\omega_x) \cdot e^{-jd_x(\omega_x-k2\pi)} = \sum_{k=-\infty}^{+\infty} L_{t-1}(j\omega_x - jk2\pi) \cdot e^{-jd_x(\omega_x-k2\pi)} \tag{2.2}
$$

$$
L_{t-1}(j\omega_x) = 0, \quad |\omega_x| \geq \pi \tag{2.3}
$$

However, due to the nonexistence of idea low pass filter, the aliasing problem occurs inevitably in video coding system, as shown in Fig. 2.3. Another important result which is derived from Eq. 2.2 and Eq. 2.3 is that the image rich of high frequency signals is vulnerable to be affected by aliasing problem.

Figure. 2.4 is the rate distortion (RD) curves of two qcif sequences. It is shown that ‘mobile_qcif’ is more sensitive to MRF than ‘football_qcif’. The quality degradation of ‘mobile_qcif’ with 1 and 5 reference frames is up to 1.5 dB, which is unacceptable for video coding system. In fact, from the features of sequences, it is shown that many textures are contained in ‘mobile_qcif’ and sharp edges in the spatial domain will generate rich
2.2 Hardware oriented multiple reference frame elimination

$$|S_{t-1}(j\omega)|$$

$$|L_{t-1}(j\omega)|$$

$$|L_{t-1}(j\omega+jk2\pi)|$$

$$|L_{t-1}(j\omega-jk2\pi)|$$

Figure 2.3: Aliasing in Hybrid Video Coding

Figure 2.4: RD Curves of QCIF ‘football’ and ‘mobile’

high frequency signals after Fourier transform. The abundant high frequency ingredient in ‘mobile_qcif’ is the main reason of the occurrence of aliasing.

Figure 2.5 is the 2-D Fourier spectrum amplitude of two sequences. Hamming window is used to compensate the spectrum leakage. The spectrum analysis obviously shows that high frequency signal in ‘mobile_qcif’ is much more abundant than ‘football_qcif’. Thus, from the above theoretical analysis, it is proved that aliasing is the main reason of video quality degradation and Fourier spectrum can reflect the importance of MRF for video sequence.

The intuitive way of adjusting reference frame number is through analysis of Fourier spectrum. However, such kind of decision criterion is impractical because the compu-
2.2 Hardware oriented multiple reference frame elimination

Figure 2.5: 2-D Fourier Spectrum Amplitude of 'football_qcif' and 'mobile_qcif'

...ation complexity will increase dramatically. In fact, the signal’s frequency spectrum is coordinate with its gradient amplitude. Edge information in MB will reflect the spread of...
2.2 Hardware oriented multiple reference frame elimination

\[
\begin{array}{ccc}
-1 & 0 & +1 \\
-2 & 0 & +2 \\
-1 & 0 & +1 \\
\end{array}
\]

\[
\begin{array}{ccc}
+1 & +2 & +1 \\
0 & 0 & 0 \\
-1 & -2 & -1 \\
\end{array}
\]

\( G_x \)  \hspace{2cm}  \( G_y \)

Figure 2.6: Convolution mask of Sobel operator

frequency spectrum in that MB and gradient analysis is feasible to be used as a decision criterion. In the edge detection based reference frame elimination scheme, I use result of gradient amplitude of each MB to restrict number of reference frames.

2.2.2 Gradient based multiple reference frame elimination

In edge detection field, there exist many operators. Among all of them, Sobel operator is widely used to get 2-D spatial gradient by emphasizing the edges which represent high spatial frequency. So, I use Sobel operator in the proposed fast algorithm. In fact, the Sobel operator is already applied in many mode decision algorithms \[22][23][24] and its merit is proved by these algorithms. The convolution mask of Sobel edge detector is described in Fig. 2.6. In luminance picture, if \( P(m, n) \) denotes the pixel value at \((m, n)\) position, as shown in Eq. 2.4 and Eq. 2.5, its gradients in x-direction and y-direction are \( G_x(m, n) \) and \( G_y(m, n) \). \( G(m, n) \), which is the gradient of \( P(m, n) \) is calculated by addition of \( G_x(m, n) \) and \( G_y(m, n) \), as shown in Eq. 2.6.

\[
G_x(m, n) = |P(m-1, n-1) + 2P(m-1, n) + P(m-1, n+1) - P(m+1, n-1) - 2P(m+1, n) - P(m+1, n+1)|
\]  

(2.4)
2.2 Hardware oriented multiple reference frame elimination

\[ G_y(m, n) = |P(m - 1, n - 1) + 2P(m, n - 1) + P(m + 1, n - 1) - P(m - 1, n + 1) - 2P(m, n + 1) - P(m + 1, n + 1)| \]  

(2.5)

\[ G(m, n) = G_x(m, n) + G_y(m, n) \]  

(2.6)

\[
\begin{cases} 
  G(m, n) < \text{THR}_G, & \text{Homo} \\
  \text{otherwise,} & \text{Non Homo}
\end{cases}
\]  

(2.7)

Figure 2.7 is the MB partition in H.264 VBSME algorithm. The edge detection criterion is applied base on the partition enclosed in dashed lines. The minimum block size in gradient analysis is 8 × 8. The block is regarded as homogeneous block if the analysis result is within certain threshold. Figure 2.8 is the flow chart of proposed edge gradient analysis procedure. Firstly, gradient analysis is executed on each 8 × 8 block in the MB. One 8 × 8 block is judged as homogeneous (Homo) block if it satisfies Eq. 2.7. Based on the result of four 8 × 8 block, one 16 × 8 block is judged as homo block if its sub 8 × 8 blocks are all homo blocks. For example, B16 × 80 is homo block if B8 × 800 and B8 × 801 are all homo blocks. The 16 × 16 block is regarded as homo block only if all of its four sub 8 × 8 blocks are homo blocks. Otherwise, it is treated as nonhomogeneous (Non Homo) 16 × 16 block. Here, the setting of \( \text{THR}_G \) is a critical factor that affects both computation complexity and video quality. If the \( \text{THR}_G \) is set too high, the video quality will degrade greatly although complexity reduction can be achieved to some extent. On the other hand, too low \( \text{THR}_G \) can not release the intensive computation of MRF algorithm. In the following part, I will analyze the setting of \( \text{THR}_G \) in detail through experimental result. The edge gradient analysis is executed at the same time of loading pixels of current MB. It is finished before IME starts and will decides the reference frame number for the following block matching process.
2.2 Hardware oriented multiple reference frame elimination

2.2.3 Quantization parameter based threshold adjustment

From the theoretical point of view, the threshold setting is always a trade-off between quality and complexity. The prediction error $e$ in block matching process can be assumed as a jointly Gaussian source with zero mean and variance $\sigma^2$. According to [25], the distortion of quantization $D$ is approximated as $QP^2/3$, where QP is the quantization parameter. So, the rate distortion function [26] can be represented as Eq. 5.12, where $R(D)$ is the related transmission bit-rate for distortion $D$. The $\sigma^2$ represents maximum distortion based on Gaussian model. When distortion $D$ equals to zero, it indicates that original signal is reconstructed without any loss in image detail. All the information of image (including textures and noise) is exacted the same as original source image. Maximum transmission bit-rate is required for keeping the related information. In fact, such case is one ultimate state which will never happen in real video encoding system, like H.264/AVC. The reason is that the transform and quantization will cause some loss in image detail, which makes distortion between original source image and reconstructed one occur inevitably. On the other hand, when $D$ is larger than $\sigma^2$, the related transmission bit-rate for $D$ will become zero. This conclusion is in accordance with QP setting in H.264 encoding system. With the increase of QP, the smoothness of reconstructed frames
2.2 Hardware oriented multiple reference frame elimination

Figure 2.8: Edge gradient analysis flow chart

is increased, which results in decline of image’s details. The related residue value is also decreased. It means that quality degradation for edge abundant image is quite obvious under big QP. In the extreme case, all the details are removed by one very large QP and the residue information is vanished, which indicates that no transmission bit-rate is required. Thus, from theoretical analysis of [25] and [26], the threshold can be simply regarded as linear relationship with QP value.

\[
R(D) = \begin{cases} 
\frac{1}{2} \log_2 \frac{\sigma^2}{D}, & 0 \leq D \leq \sigma^2 \\
0, & D > \sigma^2 
\end{cases} \tag{2.8}
\]

From the statistical point of view, exhaustive experiments are executed to get optimum threshold value. I apply edge gradient based reference frame elimination scheme on typical sequences. Since the setting of QP value will affect video quality which is represented by PSNR and bit-rate (BitR) variation, I define the \( \Delta PSNR \) and \( \Delta BitR \) as two tolerance
2.2 Hardware oriented multiple reference frame elimination

constraints under different QPs, as shown in Eq. 2.9. The $PSNR_{pro}$ and $BitR_{pro}$ represent the result based on proposed algorithm while $PSNR_{jm}$ and $BitR_{jm}$ are the result based on original JM full search algorithm. Equation 2.9 can clearly show the PSNR and BitR difference of each point on RD curves.

$$\begin{align*}
\Delta PSNR &= |PSNR_{pro} - PSNR_{jm}| \\
\Delta BitR &= |10\log_{10}BitR_{pro} - 10\log_{10}BitR_{jm}|
\end{align*}$$

(2.9)

Several $THR_G$ value is applied on typical MRF sensitive sequences to test the impact of $THR_G$. The sequences used are ‘foreman_qcif’ and ‘mobile_qcif’ which are both MRF sensitive sequences. Five reference frames are enabled and 200 frames are encoded under baseline profile. Table 2.1 is the experimental result with $THR_G$ ranging from 160 to 360. On the whole, it is shown that for the same QP, the video quality degrades with the increase of $THR_G$. Specifically, for ‘foreman_qcif’, if 0.067 dB is set as maximum tolerance constraint of PSNR loss and 0.025dB as maximum tolerance constraint of bit-rate gain, then it is shown that large $THR_G$ is only suitable for big QP value. The data with asterisk represent the violation data against constraint. Figure 2.9 is the tolerance graph of ‘foreman_qcif’ based on Table 2.1. It depicts the relationship of $THR_G$, $\Delta PSNR$ and $\Delta BitR$. Each black circle on the axe represents the $\Delta PSNR$ under certain QP. Each white square represents the corresponding $\Delta BitR$. The solid circle line is the PSNR tolerance constraint while the broken circle line is the BitR constraint. Based on Table 2.1 and Fig. 2.9, it is shown that when $THR_G$ is set linearly with QP, maximum ME time can be achieved while video quality loss is under constraint. Different sequences have different tolerance degree. However, the linear relationship between $THR_G$ and QP is the same for MRF sensitive sequences. For example, as shown in Table 2.1, if I set 0.03dB for PSNR constraint and 0.015dB for BitR constraint of ‘mobile_qcif’, then it is also possible to get the linear relationship between QP and its $THR_G$. In fact, the increase of QP means that the reference frame will be more smooth so that the ratio of homo block is increased, which makes it reasonable to change $THR_G$ according to QP. Therefore, I set the $THR_G$ of edge gradient based reference frame number adjustment scheme as $10QP$ to achieve much ME time reduction while keep good video quality.
### 2.2 Hardware oriented multiple reference frame elimination

Table 2.1: Impact of $THR_G$ on sequences

<table>
<thead>
<tr>
<th>$THR_G$</th>
<th>QP</th>
<th>S1</th>
<th>S2</th>
<th>$\Delta PSNR \times 10^{-2}$ (dB)</th>
<th>$\Delta BitR (dB) \times 10^{-2}$</th>
<th>$MET_R$ (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>160</td>
<td>20</td>
<td>5.5</td>
<td>1.9</td>
<td>1.5</td>
<td>0.6</td>
<td>56.10</td>
</tr>
<tr>
<td></td>
<td>24</td>
<td>3.5</td>
<td>0.8</td>
<td>0.5</td>
<td>0.2</td>
<td>52.25</td>
</tr>
<tr>
<td></td>
<td>28</td>
<td>3.5</td>
<td>1.2</td>
<td>2.5</td>
<td>0.6</td>
<td>48.17</td>
</tr>
<tr>
<td></td>
<td>32</td>
<td>1.5</td>
<td>1.6</td>
<td>0.3</td>
<td>0.4</td>
<td>44.41</td>
</tr>
<tr>
<td>200</td>
<td>20</td>
<td>6.3</td>
<td>1.8</td>
<td>2.5</td>
<td>1.4</td>
<td>57.32</td>
</tr>
<tr>
<td></td>
<td>24</td>
<td>4.6</td>
<td>2.4</td>
<td>1.1</td>
<td>0.7</td>
<td>54.36</td>
</tr>
<tr>
<td></td>
<td>28</td>
<td>3.5</td>
<td>1.7</td>
<td>2.5</td>
<td>0.4</td>
<td>50.45</td>
</tr>
<tr>
<td></td>
<td>32</td>
<td>2.0</td>
<td>0.8</td>
<td>0.9</td>
<td>1.9</td>
<td>45.83</td>
</tr>
<tr>
<td>240</td>
<td>20</td>
<td>*7.0</td>
<td>*3.2</td>
<td>*2.6</td>
<td>*1.6</td>
<td>60.18</td>
</tr>
<tr>
<td></td>
<td>24</td>
<td>5.2</td>
<td>2.7</td>
<td>0.2</td>
<td>0.7</td>
<td>56.89</td>
</tr>
<tr>
<td></td>
<td>28</td>
<td>5.0</td>
<td>1.9</td>
<td>1.4</td>
<td>0.5</td>
<td>53.15</td>
</tr>
<tr>
<td></td>
<td>32</td>
<td>2.8</td>
<td>0.6</td>
<td>2.0</td>
<td>0.3</td>
<td>49.44</td>
</tr>
<tr>
<td>280</td>
<td>20</td>
<td>*7.4</td>
<td>*3.8</td>
<td>*4.7</td>
<td>*3.1</td>
<td>62.62</td>
</tr>
<tr>
<td></td>
<td>24</td>
<td>*6.8</td>
<td>*3.8</td>
<td>*2.6</td>
<td>1.2</td>
<td>59.97</td>
</tr>
<tr>
<td></td>
<td>28</td>
<td>6.7</td>
<td>2.4</td>
<td>0.5</td>
<td>0.4</td>
<td>56.03</td>
</tr>
<tr>
<td></td>
<td>32</td>
<td>1.1</td>
<td>0.1</td>
<td>2.0</td>
<td>0.1</td>
<td>52.74</td>
</tr>
<tr>
<td>320</td>
<td>20</td>
<td>*8.4</td>
<td>*4.5</td>
<td>*4.8</td>
<td>*6.7</td>
<td>64.39</td>
</tr>
<tr>
<td></td>
<td>24</td>
<td>*7.4</td>
<td>*5.5</td>
<td>2.4</td>
<td>*4.8</td>
<td>62.03</td>
</tr>
<tr>
<td></td>
<td>28</td>
<td>*6.8</td>
<td>*3.8</td>
<td>0.6</td>
<td>*3.5</td>
<td>58.43</td>
</tr>
<tr>
<td></td>
<td>32</td>
<td>2.0</td>
<td>1.6</td>
<td>2.2</td>
<td>1.4</td>
<td>54.81</td>
</tr>
<tr>
<td>360</td>
<td>20</td>
<td>*8.7</td>
<td>*6.1</td>
<td>*6.3</td>
<td>0.117</td>
<td>65.62</td>
</tr>
<tr>
<td></td>
<td>24</td>
<td>*7.8</td>
<td>*7.7</td>
<td>*3.8</td>
<td>0.109</td>
<td>63.54</td>
</tr>
<tr>
<td></td>
<td>28</td>
<td>*9.0</td>
<td>*6.9</td>
<td>1.7</td>
<td>*6.6</td>
<td>59.96</td>
</tr>
<tr>
<td></td>
<td>32</td>
<td>*7.5</td>
<td>*6.6</td>
<td>1.1</td>
<td>1.2</td>
<td>56.19</td>
</tr>
</tbody>
</table>

S1: foreman_qcif, S2: mobile_qcif
2.2 Hardware oriented multiple reference frame elimination

Figure 2.9: Tolerance graph of ‘foreman’
2.2 Hardware oriented multiple reference frame elimination

2.2.4 Similarity-analysis based multiple reference frame elimination

In H.264/AVC based real-time encoding systems, the widely adopted ME algorithm is a full search algorithm that provides regular access to memory, predictable control, and the optimal video quality [20]. In full search algorithm, the sum of the absolute difference (SAD) is selected as a criterion to determine the best position on the reference frame plane. It is obvious that considerable computational resources are wasted because only the MV that has minimum cost is stored while other MVs are discarded at the end of the search process. This wasteful situation becomes more significant if the MRF algorithm is introduced. In fact, since many static parts exist in each sequence, the computation of all search positions is not always necessary. In this section, statistical analysis of typical sequences will be given out and one similarity analysis (SA) based multiple reference frame elimination scheme is proposed.

To simplify the statistical analysis, I select ‘foreman_qcif’, ‘news_qcif’, ‘grandma_qcif’, and ‘container_qcif’ as four typical sequences and extract the final coding mode for a certain frame. Figure 2.10 shows the tracing result. The different sizes of black and white boxes overlaid on the images represent different block modes that are chosen after rate distortion (RD) optimization. It is shown that, if a large region has a similar trend of motion, it is more likely to be coded with a large block size. In detail, for sequences such as ‘container_qcif’ and ‘grandma_qcif’, there are many temporal stationary background parts which are mostly coded by a large blocks. Rapid moving parts such as the dancer in ‘news_qcif’ and the facial expression in ‘grandma_qcif’ are coded in small blocks. Although the lady’s suit in ‘news_qcif’ contains a large amount of edge information, it is also coded by large blocks because it is treated as stationary background. In the case of ‘foreman_qcif’, even though many background MBs exist in the sequence, many MBs are still coded with small blocks because of the facial expression and the dithering of the vidicon.

In JM software, the hardware-friendly full search algorithm is executed on different search positions. It adopts the spiral searching method, which searches from the center
2.2 Hardware oriented multiple reference frame elimination

Figure 2.10: Coding block sizes of QCIF sequences

to the outside positions. Figure 2.11 shows an example of spiral-order graph for the first 49 positions. The number in the circle represents the searching order. Position 0 is the motion vector predictor (MVP) point, which is calculated on the basis of neighboring blocks. The block matching process of ME starts from this position. The position that has the minimum cost (MV cost + SAD) is regarded as the best integer search position (BISP) and its corresponding MV is stored. From the previous analysis, it is known that for sequences with a large stationary part, the probability of selecting a big coding mode is very high. Therefore, if an MB with a stationary feature can be detected at an early stage, the ME computation can be reduced because splitting of the MB into small modes
2.2 Hardware oriented multiple reference frame elimination

Figure 2.11: Spiral search order

and the MRF technique are both unnecessary for such an MB.

Figure 2.12 shows the ratio of MBs whose BISP fall into the MVP position (call such MBs as MVP_MBs). Here, I only show histograms for the P16×16 mode and for the first block (block_0) of the P16×8 and P8×16 modes. Since there is considerable similarity among the MRFs, only distribution of MVP_MBs in the previous reference frame is given out. The simulation conditions are listed in Table 2.2. First, note that the distribution in Fig. 2.12(a) is very similar to those in Fig. 2.12(b) and Fig. 2.12(c). This similarity also occurs among Fig. 2.12(d), Fig. 2.12(e), and Fig. 2.12(f), which means that the features of the sequence are similarly among the P16×16, P16×8, and P8×16 modes. To reduce the computation complexity, I only focus on the P16×16 mode of the first reference frame in my algorithm. Second, for sequences such as ‘container_qcif/cif’, ‘grandma_qcif’, and ‘news_qcif/cif’, many MBs have their best position in the MVP point, which means that the initial MVP is of high accuracy. On the other hand, for sequences such as ‘football_qcif/cif’ and ‘canoq_cif’, the percentage of MVP_MBs is low. Thus, the accuracy of the MVP can reflect the characteristics of MBs in different sequences.

Moreover, even though the current MB selects position 0 as the BISP on the previous reference frame, the final best mode may vary among the SKIP mode, and the P16×16,
2.2 Hardware oriented multiple reference frame elimination

Figure 2.12: Number of MBs with BISP in MVP

P16×8, P8×16, and P8×8 modes due to more accurate matching under other modes. Here, the inter search modes below 8×8 are also included in P8×8 mode when determining the final best mode in the H.264/AVC standard. Therefore, for MVP MB, the final best macroblock mode is traced for different quantization parameters (QPs). The experimental result is shown in Fig. 2.13. Here, I list the results of 3 QCIF and 3 CIF sequences as
2.2 Hardware oriented multiple reference frame elimination

Table 2.2: Simulation conditions for BISP on previous frame

<table>
<thead>
<tr>
<th>Sequences</th>
<th>QCIF &amp; CIF</th>
<th>QP</th>
<th>20</th>
</tr>
</thead>
<tbody>
<tr>
<td>Search Range</td>
<td>±16 &amp; ±24 Frames Encoded</td>
<td>200</td>
<td></td>
</tr>
<tr>
<td>etc</td>
<td>no B Slice, CAVLC, 5 Reference Frames RDO is ON, GOP is IPPP</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

an example. The x axis represents the SAD value range in the MVP position while the y axis is the percentage of MB quantities. Specifically, the histogram reflects the percentage of MVP MB quantities under different QP and SAD values. For sequences with many stationary parts such as ‘container_qcif/cif’ and ‘news_qcif/cif’, many MBs select the MVP as the best search position when QP is small. With increasing SAD value at position 0, the ratio of MVP MBs decreases; on the other hand, for big QP, this ratio increases rapidly with the SAD value. In the case of sequences with a large amount of motion such as ‘football_qcif/cif’, the initial MVP is inaccurate and most MVP MBs have a large SAD value. The curves overlaid on the histogram represent the ratio of MBs whose final coding mode is big mode (SKIP mode, P16×16, P16×8, or P8×16), which means that the MBs are coded in the big mode with less MB splitting. From Fig. 2.13, it is shown that the ratio of MVP MBs whose final mode is the big mode decrease rapidly in the case of small QP such as 16 and 20. In case of a big QP, this ratio decreases slowly. In fact, for a big QP, after the quantization and reconstruction of reference frames, the reference pixels become more homogeneous with a considerable loss of high-frequency components, which leads to big coding modes after RD.

\[
SA_{on \ ref_1} (SP0 to SP8, P16 \times 16 \ Mode) = \begin{cases} 
BISP = 0 & SAD_{8\times8} \leq THR_{SAD}, BMMB \\
otherwise, NMB
\end{cases}
\] (2.10)

On the basis of the above analysis, the ME and mode decision process can be sped-up for sequences with many stationary parts. I use a threshold \(THR_{SAD}\) to indicate the degree of similarity of IME in the first reference frame \(\text{ref}_1\) and use it to guide the result of mode decision. To reduce the extra computation that is introduced into the ME process, I only focus on the P16×16 mode in my algorithm. The SA-based big-mode MB
2.2 Hardware oriented multiple reference frame elimination

(BMMD) detection scheme is shown in Eq. 2.10. It means that during the IME process, the SA is performed on the 9 central positions of ref₁ (the gray circles in Fig. 2.11). The MB is defined as a BMMD if its BISP at these 9 positions is 0 and all four of its 8x8 sized SAD (SAD₈ₓ₈) values are within THRₛₐḍ; otherwise it is treated as a normal MB (NMB). For a BMMD, the IME process is early terminated after IME of P₁₆ₓ₁₆, P₁₆ₓ₈, and P₈ₓ₁₆ modes for the 9 central positions of the previous frame; and only big modes are
enabled during mode decision stage. On the basis of experimental results, the threshold is defined according to the QP value. In detail, when QP is less than 24, $THR_{SAD}$ is set as $6 \times QP$, otherwise it is set as $7 \times QP$.

### 2.3 Hardware oriented search range adjustment

In the H.264/AVC motion estimation, search range is another important factor which influence the computation complexity greatly. For example, when search range (SR) is decided, the number of search points can be calculated based on Eq. 2.11, where $SP_{num}$ is the number of search point and $SR$ is the dedicated search range. So, when $SR$ equals 24, the $SP_{num}$ will become 2401 which is a quite large number for hardware engine. Therefore, hardware oriented search range adjustment scheme is needed.

$$SP_{num} = (2SR + 1) \times (2SR + 1) \quad (2.11)$$

#### 2.3.1 Motion feature based search range adjustment

In H.264/AVC based encoding system, different sequences have different features; a large SR is not necessary for all sequences. Figure 2.14 shows two RD curves under different SR. It is shown that changing the SR does not cause significant video quality loss in ‘foreman QCIF’. On the other hand, the quality degradation in the case of ‘football QCIF’ is very obvious, which means that a big SR is necessary for ‘football QCIF’.

For MB with different motion (small or large motion), complexity reduction can be achieved based on the motion feature analysis. Since different type of sequences may have different best integer search point (BISP) distributions, I trace BISP result on each reference frames under $16 \times 16$ mode, as shown in Table 2.4. The simulation conditions are shown in Table 2.3.

Firstly, it is shown that BISP distribution of the same sequence among different reference frames demonstrates the same motion feature. For example, in ‘container QCIF’, the BISP distribution in first reference frame shows that many BISPs are located within centering 25 positions, the situation of which is almost the same with BISP distributions
in other four reference frames. So the BISP distribution in first reference frame can represent the motion feature of this MB and I only focus on the first reference frame in my search range adjustment scheme.

Secondly, Table 2.4 show that the BISP distribution of ‘football_qcif’ is different from other 4 sequences. The BISPs located between 169th and 1088th position are much more than other 4 sequences, which shows its large motion trend. It also implies that the initial motion vector predictor (MVP) is far from accurate for ‘football_qcif’.

Thirdly, for sequences except ‘football_qcif’, large proportion of BISPs are located within the inner 25 position, which shows the small motion trend. Comparing ‘foreman_qcif’ and ‘carphone_qcif’ with ‘container_qcif’ and ‘news_qcif’, the proportion of BISPs that are located in position 0 is much smaller in ‘foreman_qcif’ and ‘carphone_qcif’. It means that there are many static background MBs in ‘container_qcif’ and ‘news_qcif’ and MVP is of high accuracy for motion estimation in these sequences.

Therefore, from the statistic analysis of typical sequences, it is shown that the BISP location can reflect the motion feature of the MB. For MB with big motion, large search range is necessary to keep the overall best search point within available search range. On the other hand, for MB which shows static or small motion feature, many redundant

Figure 2.14: Impact of search range to video quality
Table 2.3: Simulation conditions for BISP on five reference frames

<table>
<thead>
<tr>
<th>QP</th>
<th>Sequences</th>
<th>Frames Encoded</th>
<th>QCIF</th>
</tr>
</thead>
<tbody>
<tr>
<td>24</td>
<td>± 16</td>
<td>100</td>
<td>etc</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>etc</td>
</tr>
<tr>
<td></td>
<td>no B Slice, CAVLC, 5 Reference Frames</td>
<td>RDO is ON, GOP is IPPP</td>
<td></td>
</tr>
</tbody>
</table>
### 2.3 Hardware oriented search range adjustment

#### Table 2.4: BISP Distribution on 1st to 5th Reference Frame

<table>
<thead>
<tr>
<th>BISP</th>
<th>IME on 1st Reference Frame</th>
<th>IME on 2nd Reference Frame</th>
<th>IME on 3rd Reference Frame</th>
<th>IME on 4th Reference Frame</th>
<th>IME on 5th Reference Frame</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Seq 1</td>
<td>Seq 2</td>
<td>Seq 3</td>
<td>Seq 4</td>
<td>Seq 5</td>
</tr>
<tr>
<td>0</td>
<td>4961</td>
<td>5556</td>
<td>9434</td>
<td>9263</td>
<td>2040</td>
</tr>
<tr>
<td>1~8</td>
<td>3749</td>
<td>3095</td>
<td>159</td>
<td>433</td>
<td>2996</td>
</tr>
<tr>
<td>9~24</td>
<td>564</td>
<td>436</td>
<td>38</td>
<td>34</td>
<td>1090</td>
</tr>
<tr>
<td>25~48</td>
<td>204</td>
<td>235</td>
<td>64</td>
<td>17</td>
<td>706</td>
</tr>
<tr>
<td>49~80</td>
<td>100</td>
<td>107</td>
<td>7</td>
<td>10</td>
<td>454</td>
</tr>
<tr>
<td>81~120</td>
<td>53</td>
<td>74</td>
<td>7</td>
<td>6</td>
<td>388</td>
</tr>
<tr>
<td>121~168</td>
<td>43</td>
<td>71</td>
<td>5</td>
<td>4</td>
<td>321</td>
</tr>
<tr>
<td>169~1088</td>
<td>127</td>
<td>227</td>
<td>87</td>
<td>34</td>
<td>1806</td>
</tr>
</tbody>
</table>

Seq 1: foreman, Seq 2: carphone, Seq 3: container
Seq 4: news, Seq 5: football
2.4 Pixel difference based adaptive sub-sampling

is given in Eq. 2.12. After IME on the \( m \)th reference frame (\( ref_m \)), the BISP of this frame (\( BISP(m) \)) in the 16×16 mode is analyzed. If it is between the values of \( SP_{num} \) for \( SR_i \) and \( SR_{i+1} \), then the \( SR \) in the \((m+1)\)th reference frame (\( SR(m+1) \)) is changed to \( SR_{i+2} \). If the \( BISP(m) \) value surpasses the \( SP_{num} \) of \( SR_5 \) in Fig. 2.15, then original JM SR is used for next ME process. The proposed search range scheme adaptively shrinks the \( SR \) for small-motion MBs. For normal or big motion MB, large \( SR \) value is still available to keep the best MV.

\[
\begin{align*}
(2SR_i + 1)^2 & \leq BISP(m) < (2SR_{i+1} + 1)^2, \\
SR(m+1) &= SR_{i+2}, \; i \in [0, 4] \\
BISP(m) &> (2SR_5 + 1)^2, \\
SR(m+1) &= SR_{jm}
\end{align*}
\]

2.4 Pixel difference based adaptive sub-sampling

In hardware application, sub-sampling is widely used to release computation complexity and achieve compact hardware architecture. The concept of sub-sampling in ME is to use part of pixels to represent the whole MB so that computation reduction can be
2.4 Pixel difference based adaptive sub-sampling

achieved. In [27], it also adopts direct half sub-sampling technique and 50% computation is saved. However, the sub-sampling will also introduce video quality loss because the further sampling on the pixels will intensify the aliasing problem [21] caused by video sensor.

When sub-sampling is applied on \( s_t(x_n) \), the related Fourier transform \( S_t(j\omega_x) \) will be derived to \( Y_t(j\omega_x) \), as shown in Eq. 2.13, where \( S_t(j\omega_x) \) is the Fourier transform of \( s_t(x_n) \). It means that the original ideal cut-off frequency of band limit low pass filter is extended, which will result in further entangling of frequency components. The inevitable aliasing problem becomes even worse. So, the direct sub-sampling in both horizontal and vertical direction (quarter sub-sampling) is a very risky decision. Another conclusion that can be obtained from Eq. 2.13 is that the degree of aliasing problem may vary greatly based on different \( \omega_x \). Since the IME engine handles the image MB by MB, the frequency feature of MB will determine the result of direct sub-sampling.

\[
Y_t(j\omega_x) = \frac{1}{2} \left[ S_t(j\frac{\omega_x}{2}) + S_t(j(\frac{\omega_x}{2} - \pi)) \right]
\]  

(2.13)

The difference among the pixels is a direct reflection of the spread of frequency spectrum. With the frequency feature of current MB, the block matching process which is usually based on SAD (sum of absolute difference) calculation in hardware [28] can be simplified. For example, if the pixels’ values within one MB are close to each other, then this MB is a homogeneous MB and half or quarter sub-sampling technique can be adopted to achieve computation reduction. On the other hand, if big pixel difference occurs within one MB, then much high frequency ingredient exists in this MB. So full pixel pattern has to be used for block matching in order to ensure precise estimation.

Figure 2.16 is the video quality comparisons based on JM software [29]. It is shown that, the direct quarter sub-sampling (ds) on ‘foreman_qcif’ will averagely cause 0.3 dB video quality loss compared with hardware friendly full search algorithm. In the case of ‘container_qcif’, the quality degradation is negligible. Through observing the feature of these two sequences, it is also obvious that MBs in ‘container_qcif’ are much more homogeneous. In detail, the pixels in MBs of ‘container_qcif’ are very similar to each other so that sub-sampling will not cause great influence on the block matching process
2.4 Pixel difference based adaptive sub-sampling

![Graph showing PSNR (dB) vs Bit Rate (kbps) for different sequences: foreman_qcif_jm, foreman_qcif_ds, container_qcif_jm, container_qcif_ds.]

Figure 2.16: Impact of direct sub-sampling

of these MBs. Thus, classifying the MBs into sub-sampling allergic MB and sub-sampling insensitive MB is of great importance.

In this dissertation, I use pixel difference analysis to obtain the feature of MB in a hardware friendly way. Figure 2.17 shows three hardware friendly sub-sampling patterns. Pattern 1 is the quarter sub-sampling pattern which uses one pixel (black point) to represent its neighboring three pixels. Pattern 2 and pattern 3 are horizontal and vertical half sub-sampling patterns. Each pixel in these two patterns is selected to represent its horizontal or vertical neighboring pixel respectively. The pixel difference analysis method is shown in Eq. 2.14 and Eq. 2.15, where $P(i,j)$ is the pixel value in position $(i,j)$. It means that during load of current MB, the horizontal pixel difference ($PD_h(i,j)$) and vertical pixel difference ($PD_v(i,j)$) of this MB are examined. Since only horizontal, vertical, and diagonal neighbors of the pixel are used to get the horizontal and vertical pixel difference, the extra computation is small. Figure 2.18 is the flow chart of proposed adaptive sub-sampling method. If $PD_h(i,j)$ of each position within one MB is smaller than threshold $THR_{PD}$, then horizontal half sub-sampling on this MB (pattern 2 in Fig. 2.17) is applied. In this way, pixel information in vertical direction is preserved. On the other
2.5 Experiments, comparison and analysis

![Pattern 1 Pattern 2 Pattern 3](image)

Figure 2.17: Three sub-sampling patterns

hand, horizontal pixel information is kept by using vertical half sub-sampling (pattern 3 in Fig. 2.17) if each \( PD_v(i,j) \) is within threshold \( THR_{PD} \). When both \( PD_h(i,j) \) and \( PD_v(i,j) \) of each position are within \( THR_{PD} \), then quarter sub-sampling (pattern 1 in Fig. 2.17) is applied on this MB. From exhaustive experiments, I finally set \( THR_{PD} \) as \( 4 \times QP \) (quantization parameter) to achieve much computation reduction according to different QP values.

\[
P D_h(i,j) = \left| P(i,j) + P(i,j+1) - P(i+1,j) - P(i+1,j+1) \right| \quad (2.14)
\]

\[
P D_v(i,j) = \left| P(i,j) + P(i+1,j) - P(i,j+1) - P(i+1,j+1) \right| \quad (2.15)
\]

2.5 Experiments, comparison and analysis

In order to verify the effectiveness of proposed fast ME algorithm, I combine all the schemes together and apply my algorithm on 8 QCIF, 8 CIF and 4 HDTV720p format sequences by using JM 11.0 software. The QP values are 20, 24, 28, and 32. Since my algorithm targets at complexity reduction for hardware, the comparison is first based on algorithm adopted in existing hardware engine. For H.264/AVC hardwired engine, factors such as data reuse, hardware utilization and predictable control are critical ones to the design. The most widely adopted algorithm for motion estimation engine is full search algorithm [17][18]. So, I first implement my schemes in JM full search algorithm. The simulation conditions are shown as follows.
2.5 Experiments, comparison and analysis

In my proposed algorithm, the ME’s sub-sampling pattern is determined by pixel difference analysis. Table 2.5 shows the ratio of MBs that are classified as homogeneous MBs and adaptive sub-sampling is applied on these MBs. Here, I only give out pixel difference analysis result of MBs whose \( PD_v \) and \( PD_h \) are both within threshold as an example. It means that the quarter sub-sampling will be applied on these MBs. Table 2.5 shows that for high frequency abundant sequences such as ‘mobile_cif’ and ‘tempete_qcif’, the homogeneous MB ratio is not high (averagely 10.50%). However, for sequences like ‘grandma_qcif’, ‘container_cif’, ‘city_720p’ and ‘crew_720p’, many homogeneous MBs exist (averagely 74.98%), so that adaptive sub-sampling scheme contributes much to these sequences.

Secondly, the MRF elimination ratio is shown in Table 2.6. It is shown that for
2.5 Experiments, comparison and analysis

Table 2.5: Homo MB Ratio (%) for 1/4 Subsampling

<table>
<thead>
<tr>
<th>QP</th>
<th>20</th>
<th>24</th>
<th>28</th>
<th>32</th>
<th>Average</th>
</tr>
</thead>
<tbody>
<tr>
<td>foreman qcif</td>
<td>31.28</td>
<td>36.99</td>
<td>48.38</td>
<td>66.86</td>
<td>45.88</td>
</tr>
<tr>
<td>mobile qcif</td>
<td>5.58</td>
<td>8.97</td>
<td>16.33</td>
<td>25.36</td>
<td>11.72</td>
</tr>
<tr>
<td>container qcif</td>
<td>38.80</td>
<td>40.23</td>
<td>41.42</td>
<td>42.30</td>
<td>40.69</td>
</tr>
<tr>
<td>grandma qcif</td>
<td>64.42</td>
<td>70.84</td>
<td>76.89</td>
<td>82.56</td>
<td>56.38</td>
</tr>
<tr>
<td>news qcif</td>
<td>33.35</td>
<td>38.82</td>
<td>44.06</td>
<td>48.28</td>
<td>41.13</td>
</tr>
<tr>
<td>tempete qcif</td>
<td>6.65</td>
<td>9.38</td>
<td>13.40</td>
<td>18.97</td>
<td>12.10</td>
</tr>
<tr>
<td>coastguard qcif</td>
<td>29.02</td>
<td>38.42</td>
<td>45.18</td>
<td>51.44</td>
<td>41.02</td>
</tr>
<tr>
<td>carphone qcif</td>
<td>44.53</td>
<td>49.63</td>
<td>54.02</td>
<td>62.06</td>
<td>52.56</td>
</tr>
<tr>
<td>stefan cif</td>
<td>25.15</td>
<td>28.75</td>
<td>33.22</td>
<td>39.43</td>
<td>31.64</td>
</tr>
<tr>
<td>mobile cif</td>
<td>5.63</td>
<td>7.55</td>
<td>10.07</td>
<td>12.40</td>
<td>8.91</td>
</tr>
<tr>
<td>football cif</td>
<td>63.94</td>
<td>69.02</td>
<td>74.68</td>
<td>82.55</td>
<td>72.55</td>
</tr>
<tr>
<td>container cif</td>
<td>52.94</td>
<td>55.67</td>
<td>56.49</td>
<td>57.77</td>
<td>55.72</td>
</tr>
<tr>
<td>news cif</td>
<td>57.99</td>
<td>63.47</td>
<td>68.87</td>
<td>73.12</td>
<td>65.86</td>
</tr>
<tr>
<td>tempete cif</td>
<td>22.20</td>
<td>29.77</td>
<td>38.80</td>
<td>49.40</td>
<td>35.04</td>
</tr>
<tr>
<td>coastguard cif</td>
<td>38.68</td>
<td>49.36</td>
<td>58.75</td>
<td>66.03</td>
<td>53.20</td>
</tr>
<tr>
<td>paris cif</td>
<td>23.92</td>
<td>26.14</td>
<td>28.71</td>
<td>31.34</td>
<td>27.53</td>
</tr>
<tr>
<td>parkrun 720p</td>
<td>24.24</td>
<td>32.93</td>
<td>39.37</td>
<td>44.68</td>
<td>49.66</td>
</tr>
<tr>
<td>mobcal 720p</td>
<td>35.98</td>
<td>44.59</td>
<td>54.27</td>
<td>63.32</td>
<td>49.54</td>
</tr>
<tr>
<td>city 720p</td>
<td>61.89</td>
<td>73.85</td>
<td>82.04</td>
<td>87.90</td>
<td>76.42</td>
</tr>
<tr>
<td>harbor 720p</td>
<td>50.58</td>
<td>64.33</td>
<td>76.16</td>
<td>85.42</td>
<td>69.12</td>
</tr>
</tbody>
</table>

Sequences with large proportion of static part such as ‘container qcif/cif’, ‘grandma qcif’ and ‘news qcif/cif’ (averagely 55.08%), much complexity can be eliminated by our MRF elimination algorithm. In case of ‘mobile qcif/cif’ and ‘tempete qcif/cif’, the ratio is very small (averagely 7.30%) because the motion on the edge abundant background deteriorates our algorithm greatly.

Thirdly, I also test the search range reduction scheme on different sequences individually. The experimental results are shown in Table 2.7. Here, I only list the ratio of small
2.5 Experiments, comparison and analysis

<table>
<thead>
<tr>
<th>QP</th>
<th>20</th>
<th>24</th>
<th>28</th>
<th>32</th>
<th>Average</th>
</tr>
</thead>
<tbody>
<tr>
<td>foreman_qcif</td>
<td>12.25</td>
<td>16.63</td>
<td>18.95</td>
<td>23.26</td>
<td>17.77</td>
</tr>
<tr>
<td>mobile_qcif</td>
<td>3.13</td>
<td>4.66</td>
<td>5.64</td>
<td>6.43</td>
<td>4.97</td>
</tr>
<tr>
<td>container_qcif</td>
<td>50.01</td>
<td>51.14</td>
<td>50.91</td>
<td>50.53</td>
<td>50.65</td>
</tr>
<tr>
<td>grandma_qcif</td>
<td>57.20</td>
<td>61.07</td>
<td>56.15</td>
<td>58.74</td>
<td>58.29</td>
</tr>
<tr>
<td>news_qcif</td>
<td>54.67</td>
<td>56.57</td>
<td>42.62</td>
<td>35.41</td>
<td>47.32</td>
</tr>
<tr>
<td>tempete_qcif</td>
<td>4.62</td>
<td>5.42</td>
<td>5.64</td>
<td>6.51</td>
<td>5.55</td>
</tr>
<tr>
<td>coastguard_qcif</td>
<td>14.43</td>
<td>21.62</td>
<td>28.53</td>
<td>33.53</td>
<td>24.53</td>
</tr>
<tr>
<td>carphone_qcif</td>
<td>33.43</td>
<td>38.85</td>
<td>40.04</td>
<td>40.71</td>
<td>38.26</td>
</tr>
<tr>
<td>stefan_cif</td>
<td>20.63</td>
<td>22.03</td>
<td>22.58</td>
<td>23.67</td>
<td>22.23</td>
</tr>
<tr>
<td>mobile_cif</td>
<td>3.35</td>
<td>3.94</td>
<td>4.30</td>
<td>4.90</td>
<td>4.12</td>
</tr>
<tr>
<td>football_cif</td>
<td>40.88</td>
<td>47.77</td>
<td>52.56</td>
<td>56.28</td>
<td>49.37</td>
</tr>
<tr>
<td>container_cif</td>
<td>55.93</td>
<td>58.25</td>
<td>56.78</td>
<td>56.45</td>
<td>56.85</td>
</tr>
<tr>
<td>news_cif</td>
<td>64.54</td>
<td>67.05</td>
<td>60.89</td>
<td>56.80</td>
<td>62.32</td>
</tr>
<tr>
<td>tempete_cif</td>
<td>11.59</td>
<td>13.48</td>
<td>15.37</td>
<td>17.92</td>
<td>14.59</td>
</tr>
<tr>
<td>coastguard_cif</td>
<td>14.43</td>
<td>21.62</td>
<td>28.53</td>
<td>33.53</td>
<td>24.53</td>
</tr>
<tr>
<td>paris_cif</td>
<td>34.57</td>
<td>34.31</td>
<td>27.91</td>
<td>26.48</td>
<td>30.82</td>
</tr>
<tr>
<td>parkrun_720p</td>
<td>11.02</td>
<td>17.64</td>
<td>23.86</td>
<td>29.15</td>
<td>20.42</td>
</tr>
<tr>
<td>mobcal_720p</td>
<td>21.93</td>
<td>24.94</td>
<td>28.47</td>
<td>31.60</td>
<td>26.74</td>
</tr>
<tr>
<td>city_720p</td>
<td>30.73</td>
<td>38.38</td>
<td>45.57</td>
<td>58.90</td>
<td>43.40</td>
</tr>
<tr>
<td>harbor_720p</td>
<td>11.76</td>
<td>17.64</td>
<td>24.83</td>
<td>32.79</td>
<td>21.76</td>
</tr>
</tbody>
</table>

Table 2.6: Ratio (%) of MB with MRF Elimination

Motion MB, which means that MB which adopts recursive search range adjustment is not included. It is shown that, for most small motion sequences such as ‘news_qcif/cif’, ‘mobile_qcif’, ‘coastguard_qcif/cif’, ‘paris_cif’ and ‘harbor_720p’, about 97.71% MBs adopt search range adjustment through our motion feature analysis. For sequences such as ‘foreman_qcif’ and ‘carphone_qcif’, the ratio decreases slightly (averagely 92.11%), because the motion in these sequences is a little more severe than former sequences. In case of ‘football_cif’ and ‘stefan_cif’, since there are many large motion MBs, our motion
2.5 Experiments, comparison and analysis

Table 2.7: Ratio (%) of MB with Small Range Constraint

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>20</td>
<td>94.51</td>
<td>99.41</td>
<td>98.21</td>
<td>96.73</td>
<td>98.49</td>
<td>96.33</td>
<td>99.17</td>
<td>91.22</td>
<td>80.46</td>
<td>96.18</td>
<td>67.24</td>
<td>98.32</td>
<td>98.49</td>
<td>96.33</td>
<td>99.17</td>
<td>98.36</td>
<td>95.36</td>
<td>93.69</td>
<td>97.03</td>
<td>97.92</td>
</tr>
<tr>
<td>24</td>
<td>93.92</td>
<td>99.27</td>
<td>96.02</td>
<td>97.09</td>
<td>96.96</td>
<td>96.10</td>
<td>98.77</td>
<td>91.89</td>
<td>82.80</td>
<td>96.19</td>
<td>67.93</td>
<td>96.78</td>
<td>96.96</td>
<td>96.11</td>
<td>97.19</td>
<td>97.89</td>
<td>96.23</td>
<td>95.78</td>
<td>97.05</td>
<td>97.95</td>
</tr>
<tr>
<td>28</td>
<td>92.94</td>
<td>99.26</td>
<td>93.55</td>
<td>96.67</td>
<td>97.87</td>
<td>96.11</td>
<td>97.19</td>
<td>91.37</td>
<td>84.39</td>
<td>96.25</td>
<td>64.75</td>
<td>95.04</td>
<td>97.87</td>
<td>95.93</td>
<td>93.95</td>
<td>97.19</td>
<td>97.07</td>
<td>96.58</td>
<td>97.67</td>
<td>97.85</td>
</tr>
<tr>
<td>32</td>
<td>91.31</td>
<td>98.44</td>
<td>91.48</td>
<td>96.19</td>
<td>96.42</td>
<td>95.93</td>
<td>93.95</td>
<td>89.66</td>
<td>85.37</td>
<td>96.44</td>
<td>64.75</td>
<td>94.01</td>
<td>96.42</td>
<td>95.93</td>
<td>93.95</td>
<td>96.61</td>
<td>97.67</td>
<td>96.70</td>
<td>96.06</td>
<td>97.54</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>Average</td>
<td></td>
<td>Average</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>93.17</td>
<td>99.10</td>
<td>94.82</td>
<td>96.67</td>
<td>97.44</td>
<td>96.12</td>
<td>97.27</td>
<td>91.04</td>
<td>83.26</td>
<td>96.27</td>
<td>66.80</td>
<td>96.04</td>
<td>97.44</td>
<td>96.12</td>
<td>97.27</td>
<td>97.66</td>
<td>96.58</td>
<td>95.69</td>
<td>96.73</td>
<td>97.82</td>
</tr>
</tbody>
</table>

The feature analysis scheme which is based on the accuracy of MVP reflects the necessity of large search range for these sequences. So the average ratio of MBs that adopt search range adjustment in these sequences is only about 75.03%.

The overall fast ME algorithm which combines adaptive sub-sampling, MRF elimination and search range adjustment schemes are tested. Figure 2.19 and Fig. 2.20 are the rate distortion (RD) curve comparisons between proposed algorithm and JM full search algorithm which has the best video quality. Since the difference between two RD curves
is very trivial, I use BDBR (Bjøntegaard Delta BitRate) and BDPSNR (Bjøntegaard PSNR) [30] which are respectively average difference of bit-rate and PSNR between curves of original algorithm and proposed algorithm, to evaluate video quality. The sign (+) in BDBR represents bit rate gain, and (−) sign in BDPSNR indicate the quality degradation. The BDBR and BDPSNR of each sequence are listed in Table 2.8. It is shown that the maximum BDBR and BDPSNR differences among all sequences appear in ‘stefancif’. About +1.561% BDBR and −0.224dB BDPSNR can be observed in ‘stefancif’. Averagely, the quality degradation and bit-rate increase are very trivial compared with original full search algorithm.

The ME time reduction ($MET_R$) under each QP is calculated based on Eq. 2.16, where $MET_{JM}$ and $MET_{pro}$ represent the ME time of original JM full search algorithm and proposed algorithm respectively. The experimental result is also shown in Table 2.9. By using proposed fast ME algorithm, 83.69% to 95.72% ME time can be reduced compared with full search algorithm. Averagely, the proposed hardware oriented algorithm can achieve 88.53% reduction of ME time among all these sequences.

\[ MET_R = \frac{MET_{JM} - MET_{pro}}{MET_{JM}} \times 100 \]  

(2.16)

Furthermore, the proposed schemes are also orthogonal to other software oriented fast algorithms and can be combined together to achieve further complexity reduction. Instead of making comparisons with various software oriented algorithms, which are either impractical or inefficient for hardware flow, I only focus on UMHexagon search [13] which is famous among software oriented algorithms and already adopted by JM software. The UMHexagon method [13] is superior in speeding up the ME process and can achieve almost the same video quality as full search algorithm. Here, the proposed algorithm is embedded into UMHexagon search to show the impact of my algorithm. The pixel difference analysis will determine the number of block matching pixels for each search. The MRF elimination algorithm works together with UMHexagon’s early termination. For search range adjustment scheme, I keep my algorithm together with dynamic search range algorithm in UMHexagon search, as shown in Eq. 2.17. Specifically, after IME on the first reference frame, if the x and y coordinates of 16×16’s motion vector are both within
2.5 Experiments, comparison and analysis

Figure 2.19: Comparison of QCIF and CIF RD Curves

±1/8 $SR_{JM}$, then current MB is defined as a small motion MB; and the search range ($pro_{SR}$) adjustment scheme on the following ME process (the SR of rest ME process is set as ±1/8 $SR_{JM}$); otherwise, the original dynamic search range ($dynamic_{SR}$) scheme
2.5 Experiments, comparison and analysis

Figure 2.20: Comparison of 720p RD Curves

is used. Since dynamic search range exists in UMHexagon search, the recursive search range adjustment scheme is disabled. Based on the same simulation conditions described above, the video quality comparison between proposed algorithm and UMHexagon is given out in Table. 2.10. The speedup ratio $\gamma$ is defined as $MET_{UMHS}/MET_{pro}$, where $MET_{UMHS}$ is the ME time consumed by UMHexagon search. Table. 2.11 is the speedup ratio under four QPs. It is shown that the proposed algorithm keeps almost the same video quality (worst case BDBR and BDPSNR is $+1.554\%$ and $-0.114$ dB in football_cif and carphone_qcif) as UMHexagon search while can achieve speedup ratio up to 2.73 of the fast algorithm among all these sequences.

$$SR = \begin{cases} 
  pro_{SR}, & \text{small motion MB} \\
  dynamic_{SR}, & \text{otherwise} 
\end{cases} \quad (2.17)$$

As for hardwired video coding system, the pixel difference analysis on current MB only acts as a pre-process before IME, the adaptive sub-sampling scheme is a hardware friendly proposal, which helps to save clock cycles and power in the architecture level. For MRF elimination scheme and search range adjustment, since I do not rely on the relationship of MVs in former reference frames, they are also hardware oriented schemes which target at reducing clock cycle and saving power in the system level. In fact, the proposed fast ME algorithm keeps the original full search data flow and the existing architectures such as propagate partial SAD [17] [28] and SAD Tree [17] can realize my algorithm with some
2.5 Experiments, comparison and analysis

Table 2.8: Quality Comparison with Full Search

<table>
<thead>
<tr>
<th>Video</th>
<th>Frame</th>
<th>BDBR (%)</th>
<th>BDPSNR (dB)</th>
</tr>
</thead>
<tbody>
<tr>
<td>foreman</td>
<td>QCIF</td>
<td>+1.023</td>
<td>-0.092</td>
</tr>
<tr>
<td>mobile</td>
<td>QCIF</td>
<td>+0.369</td>
<td>-0.020</td>
</tr>
<tr>
<td>grandma</td>
<td>QCIF</td>
<td>+0.115</td>
<td>-0.004</td>
</tr>
<tr>
<td>container</td>
<td>QCIF</td>
<td>+0.669</td>
<td>-0.025</td>
</tr>
<tr>
<td>news</td>
<td>QCIF</td>
<td>+0.134</td>
<td>-0.068</td>
</tr>
<tr>
<td>tempete</td>
<td>QCIF</td>
<td>+0.855</td>
<td>-0.048</td>
</tr>
<tr>
<td>coastguard</td>
<td>QCIF</td>
<td>+1.024</td>
<td>-0.080</td>
</tr>
<tr>
<td>carphone</td>
<td>QCIF</td>
<td>+1.001</td>
<td>-0.143</td>
</tr>
<tr>
<td>stefan</td>
<td>CIF</td>
<td>+1.561</td>
<td>-0.224</td>
</tr>
<tr>
<td>mobile</td>
<td>CIF</td>
<td>+0.577</td>
<td>-0.032</td>
</tr>
<tr>
<td>football</td>
<td>CIF</td>
<td>+1.276</td>
<td>-0.121</td>
</tr>
<tr>
<td>container</td>
<td>CIF</td>
<td>+0.623</td>
<td>-0.021</td>
</tr>
<tr>
<td>news</td>
<td>CIF</td>
<td>+1.264</td>
<td>-0.108</td>
</tr>
<tr>
<td>tempete</td>
<td>CIF</td>
<td>+1.118</td>
<td>-0.087</td>
</tr>
<tr>
<td>coastguard</td>
<td>CIF</td>
<td>+1.195</td>
<td>-0.102</td>
</tr>
<tr>
<td>paris</td>
<td>CIF</td>
<td>+1.011</td>
<td>-0.065</td>
</tr>
<tr>
<td>parkrun</td>
<td>720p</td>
<td>+0.438</td>
<td>-0.023</td>
</tr>
<tr>
<td>mobcal</td>
<td>720p</td>
<td>+1.235</td>
<td>-0.060</td>
</tr>
<tr>
<td>city</td>
<td>720p</td>
<td>+1.437</td>
<td>-0.006</td>
</tr>
<tr>
<td>harbor</td>
<td>720p</td>
<td>+1.134</td>
<td>-0.117</td>
</tr>
</tbody>
</table>

optimization in the control module.

\[
P_{\text{idle \_ratio}} = 1 - 0.5 \times R(hss + vss) \\
- 0.25 \times R(qss) - 1.0 \times R(nss) \tag{2.18}
\]

\[
clk_{\text{\_sav \_MRF}} = MB_{\text{\_num}} \times R(MRF_{\text{\_skip}}) \\
\times (\text{Ref}_{\text{\_num}} - 1) \times SP_{\text{\_num}} \tag{2.19}
\]
2.5 Experiments, comparison and analysis

Table 2.9: ME Time Reduction Ratio with Full Search(%) 

<table>
<thead>
<tr>
<th>QP</th>
<th>20</th>
<th>24</th>
<th>28</th>
<th>32</th>
<th>Average</th>
</tr>
</thead>
<tbody>
<tr>
<td>foreman_qcif</td>
<td>87.37</td>
<td>87.74</td>
<td>87.52</td>
<td>87.52</td>
<td>87.49</td>
</tr>
<tr>
<td>mobile_qcif</td>
<td>87.48</td>
<td>87.32</td>
<td>87.21</td>
<td>86.68</td>
<td>87.23</td>
</tr>
<tr>
<td>grandma_qcif</td>
<td>92.65</td>
<td>92.32</td>
<td>91.97</td>
<td>91.59</td>
<td>92.36</td>
</tr>
<tr>
<td>container_qcif</td>
<td>92.65</td>
<td>92.31</td>
<td>91.97</td>
<td>91.59</td>
<td>92.36</td>
</tr>
<tr>
<td>news_qcif</td>
<td>90.56</td>
<td>90.37</td>
<td>88.82</td>
<td>87.64</td>
<td>89.35</td>
</tr>
<tr>
<td>tempete_qcif</td>
<td>87.37</td>
<td>87.04</td>
<td>86.94</td>
<td>86.76</td>
<td>87.03</td>
</tr>
<tr>
<td>coastguard_qcif</td>
<td>91.31</td>
<td>91.70</td>
<td>91.37</td>
<td>90.92</td>
<td>91.33</td>
</tr>
<tr>
<td>carphone_qcif</td>
<td>88.00</td>
<td>88.19</td>
<td>87.56</td>
<td>87.47</td>
<td>87.81</td>
</tr>
<tr>
<td>stefan_cif</td>
<td>85.22</td>
<td>85.57</td>
<td>85.78</td>
<td>86.37</td>
<td>85.62</td>
</tr>
<tr>
<td>mobile_cif</td>
<td>90.40</td>
<td>90.23</td>
<td>89.93</td>
<td>89.59</td>
<td>90.12</td>
</tr>
<tr>
<td>football_cif</td>
<td>83.69</td>
<td>85.39</td>
<td>86.55</td>
<td>87.30</td>
<td>84.85</td>
</tr>
<tr>
<td>container_cif</td>
<td>95.59</td>
<td>95.06</td>
<td>94.19</td>
<td>93.15</td>
<td>94.80</td>
</tr>
<tr>
<td>news_cif</td>
<td>94.01</td>
<td>93.73</td>
<td>92.84</td>
<td>92.05</td>
<td>93.16</td>
</tr>
<tr>
<td>tempete_cif</td>
<td>90.18</td>
<td>90.45</td>
<td>90.68</td>
<td>90.76</td>
<td>90.52</td>
</tr>
<tr>
<td>coastguard_cif</td>
<td>93.77</td>
<td>94.06</td>
<td>94.09</td>
<td>93.69</td>
<td>93.90</td>
</tr>
<tr>
<td>paris_cif</td>
<td>91.71</td>
<td>91.50</td>
<td>90.91</td>
<td>90.51</td>
<td>91.16</td>
</tr>
<tr>
<td>parkrun_720p</td>
<td>93.08</td>
<td>93.73</td>
<td>94.23</td>
<td>94.54</td>
<td>93.89</td>
</tr>
<tr>
<td>mobcal_720p</td>
<td>93.00</td>
<td>93.75</td>
<td>94.34</td>
<td>94.56</td>
<td>93.91</td>
</tr>
<tr>
<td>city_720p</td>
<td>94.81</td>
<td>95.16</td>
<td>95.43</td>
<td>95.56</td>
<td>95.24</td>
</tr>
<tr>
<td>harbor_720p</td>
<td>95.18</td>
<td>95.49</td>
<td>95.66</td>
<td>95.72</td>
<td>95.51</td>
</tr>
</tbody>
</table>

\[ \text{clk}_{\text{sav}} \text{SR} = MB_{\text{num}} \times R(SR_{\text{adj}}) \times (Ref_{\text{num}} - 1) \times [SP_{\text{num}} - (\frac{SR_{JM}}{4} + 1)^2] \]  

\[ \text{clk}_{\text{sav}} \text{rat} = \frac{\text{clk}_{\text{sav}} \text{MRF} + \text{clk}_{\text{sav}} \text{SR}}{\text{clk}_{\text{ori}}} \]  

Here, I pick SAD Tree architecture as a case study. Firstly, when adaptive sub-
2.5 Experiments, comparison and analysis

Table 2.10: Quality Comparison with UMHexagon Search

<table>
<thead>
<tr>
<th>Video</th>
<th>Format</th>
<th>BDBR (%)</th>
<th>BDPSNR (dB)</th>
</tr>
</thead>
<tbody>
<tr>
<td>foreman</td>
<td>qcif</td>
<td>+1.046</td>
<td>-0.093</td>
</tr>
<tr>
<td>mobile</td>
<td>qcif</td>
<td>+0.415</td>
<td>-0.022</td>
</tr>
<tr>
<td>grandma</td>
<td>qcif</td>
<td>+0.948</td>
<td>-0.039</td>
</tr>
<tr>
<td>container</td>
<td>qcif</td>
<td>+0.855</td>
<td>-0.030</td>
</tr>
<tr>
<td>news</td>
<td>qcif</td>
<td>+0.911</td>
<td>-0.068</td>
</tr>
<tr>
<td>tempete</td>
<td>qcif</td>
<td>+0.762</td>
<td>-0.041</td>
</tr>
<tr>
<td>coastguard</td>
<td>qcif</td>
<td>+1.004</td>
<td>-0.075</td>
</tr>
<tr>
<td>carphone</td>
<td>qcif</td>
<td>+1.335</td>
<td>-0.114</td>
</tr>
<tr>
<td>stefan</td>
<td>cif</td>
<td>+1.531</td>
<td>-0.080</td>
</tr>
<tr>
<td>mobile</td>
<td>cif</td>
<td>+0.902</td>
<td>-0.052</td>
</tr>
<tr>
<td>football</td>
<td>cif</td>
<td>+1.554</td>
<td>-0.096</td>
</tr>
<tr>
<td>container</td>
<td>cif</td>
<td>+0.554</td>
<td>-0.080</td>
</tr>
<tr>
<td>news</td>
<td>cif</td>
<td>+0.976</td>
<td>-0.099</td>
</tr>
<tr>
<td>tempete</td>
<td>cif</td>
<td>+1.141</td>
<td>-0.088</td>
</tr>
<tr>
<td>coastguard</td>
<td>cif</td>
<td>+1.261</td>
<td>-0.076</td>
</tr>
<tr>
<td>paris</td>
<td>cif</td>
<td>+1.134</td>
<td>-0.068</td>
</tr>
<tr>
<td>parkrun</td>
<td>720p</td>
<td>+0.438</td>
<td>-0.024</td>
</tr>
<tr>
<td>mobcal</td>
<td>720p</td>
<td>+1.210</td>
<td>-0.061</td>
</tr>
<tr>
<td>city</td>
<td>720p</td>
<td>+1.209</td>
<td>-0.103</td>
</tr>
<tr>
<td>harbor</td>
<td>720p</td>
<td>+1.088</td>
<td>-0.045</td>
</tr>
</tbody>
</table>

sampling is applied on SAD Tree architecture, the original data flow of SAD Tree can be kept unchanged with modification only in the control module. So, the processing element (PE) can be set idled in different sub-sampling cases. In the original data flow, the whole 256 PEs in the architecture are busy every clock cycle. In my case, the PE’ idle ratio ($PE_{idle\_ratio}$) within each frame can be calculated based on Eq. 2.18, where $R(hss + vss)$ represents the sum of horizontal only sub-sampled MB ratio and vertical only sub-sampled MB ratio; $R(qss)$ is the quarter sub-sampled MB ratio; $R(nss)$ is the ratio of MB with
2.5 Experiments, comparison and analysis

Table 2.11: Speed-up of UMHexagon Search

<table>
<thead>
<tr>
<th>QP</th>
<th>20</th>
<th>24</th>
<th>28</th>
<th>32</th>
<th>Average</th>
</tr>
</thead>
<tbody>
<tr>
<td>foreman_qcif</td>
<td>1.60</td>
<td>1.72</td>
<td>1.82</td>
<td>1.92</td>
<td>1.77</td>
</tr>
<tr>
<td>mobile_qcif</td>
<td>1.48</td>
<td>1.52</td>
<td>1.62</td>
<td>1.65</td>
<td>1.57</td>
</tr>
<tr>
<td>grandma_qcif</td>
<td>2.13</td>
<td>2.43</td>
<td>2.38</td>
<td>2.51</td>
<td>2.36</td>
</tr>
<tr>
<td>container_qcif</td>
<td>2.04</td>
<td>2.06</td>
<td>1.91</td>
<td>2.06</td>
<td>2.02</td>
</tr>
<tr>
<td>news_qcif</td>
<td>1.92</td>
<td>1.95</td>
<td>1.80</td>
<td>1.83</td>
<td>1.87</td>
</tr>
<tr>
<td>tempete_qcif</td>
<td>1.47</td>
<td>1.58</td>
<td>1.62</td>
<td>1.65</td>
<td>1.58</td>
</tr>
<tr>
<td>coastguard_qcif</td>
<td>1.80</td>
<td>2.04</td>
<td>2.16</td>
<td>2.12</td>
<td>2.03</td>
</tr>
<tr>
<td>carphone_qcif</td>
<td>1.72</td>
<td>1.87</td>
<td>1.99</td>
<td>2.13</td>
<td>1.93</td>
</tr>
<tr>
<td>stefan_cif</td>
<td>1.33</td>
<td>1.39</td>
<td>1.42</td>
<td>1.46</td>
<td>1.40</td>
</tr>
<tr>
<td>mobile_cif</td>
<td>1.54</td>
<td>1.62</td>
<td>1.67</td>
<td>1.67</td>
<td>1.63</td>
</tr>
<tr>
<td>football_cif</td>
<td>2.11</td>
<td>2.32</td>
<td>2.48</td>
<td>2.73</td>
<td>2.41</td>
</tr>
<tr>
<td>container_cif</td>
<td>2.37</td>
<td>2.34</td>
<td>2.27</td>
<td>2.19</td>
<td>2.29</td>
</tr>
<tr>
<td>news_cif</td>
<td>2.42</td>
<td>2.47</td>
<td>2.40</td>
<td>2.31</td>
<td>2.40</td>
</tr>
<tr>
<td>tempete_cif</td>
<td>1.53</td>
<td>1.64</td>
<td>1.74</td>
<td>1.87</td>
<td>1.69</td>
</tr>
<tr>
<td>coastguard_cif</td>
<td>1.72</td>
<td>1.97</td>
<td>2.18</td>
<td>2.35</td>
<td>2.06</td>
</tr>
<tr>
<td>paris_cif</td>
<td>1.68</td>
<td>1.73</td>
<td>1.70</td>
<td>1.72</td>
<td>1.71</td>
</tr>
<tr>
<td>parkrun_720p</td>
<td>1.50</td>
<td>1.62</td>
<td>1.75</td>
<td>1.88</td>
<td>1.69</td>
</tr>
<tr>
<td>mobcal_720p</td>
<td>1.36</td>
<td>1.42</td>
<td>1.51</td>
<td>1.61</td>
<td>1.48</td>
</tr>
<tr>
<td>city_720p</td>
<td>1.55</td>
<td>1.72</td>
<td>1.90</td>
<td>2.14</td>
<td>1.83</td>
</tr>
<tr>
<td>harbor_720p</td>
<td>1.41</td>
<td>1.50</td>
<td>1.62</td>
<td>1.80</td>
<td>1.58</td>
</tr>
</tbody>
</table>

no sub-sampling. For horizontal or vertical sub-sampling, 50% PEs can be set idle. In case of quarter sub-sampling, only 25% PEs are kept active. Therefore, many PEs can be set idle during the ME process. Figure 2.21 is an example of ‘container_qcif’ under 100 encoding frames, it is assumed that QP is 28 and search range is fixed at 16. It is obvious that much calculation in PEs can be saved in the architecture level and average 46.02% PEs are set as idle during encoding of 100 frames. Secondly, for MRF elimination and SR adjustment scheme, the control module can set the whole IME engine to idle state.
2.5 Experiments, comparison and analysis

Figure 2.21: PE idle ratio

Figure 2.22: Clock cycle saving ratio

when early termination occurs or shorten the processing clock cycles for other reference frames based on motion feature analysis. The clock cycle saving of MRF elimination ($clk_{\text{sav},\text{MRF}}$) and SR adjustment ($clk_{\text{sav},\text{SR}}$) schemes is expressed in Eq. 2.19 and Eq. 2.20, where $R(MRF_{\text{skip}})$ and $R(SR_{\text{adj}})$ are the MRF skipped MB ratio and search range adjusted MB ratio respectively. The $MB_{\text{num}}$ and $SP_{\text{num}}$ represent the number of MB within one frame and search points within the search window. $Ref_{\text{num}}$ is the reference frame number and $SR_{JM}$ is the original JM search range (16/24 for QCIF/CIF, 64 for HDTV720p). Figure 2.22 gives out percentage of clock cycle saving ($clk_{\text{sav},\text{rat}}$)
2.5 Experiments, comparison and analysis

Figure 2.23: 4-Stage encoding system with proposed algorithm

Based on Eq. 2.21. To simplify the situation, MB with recursive search range adjustment is not included. The \( clk_{ori} \) represents the original clock cycles caused by SAD Tree architecture. I use 5 reference frames, 16×16 search window and 100 frames are encoded under QP 28 for case study. Averagely, 72.32% clock cycles can be saved by proposed schemes among these QCIF format sequences.

For memory access, since the proposed algorithm does not disturb the data flow of original full search algorithm, the same memory access scheme in existing IME engine is kept unchanged. The merit is that my schemes also help to save memory access. For example, in case of 1-set SAD Tree architecture [17], it will load 17 pixels (16 pixels for block matching and 1 pixel for column shift in snake scan method [17]) within each clock cycle. With saving in clock cycles by proposed algorithm, the corresponding memory access is also saved. In this dissertation, I focus on the hardware oriented algorithm and do not implement proposed algorithm directly into existing structures. The reason is that my schemes only incur some optimization in control logic and one pixel analysis module. So the extra modification and hardware to existing efficient engines such as PPSAD and SAD Tree [17] are very trivial. In detail, for each processing element (PE) in IME architecture, the three sub-sampling patterns only require one extra ‘enable/disable’ signal which is managed by system control. The block overlapping analysis only checks integer motion vectors (IMVs) of 16×16 to 8×8 modes and determines whether to end
whole IME process for current MB. The information of these IMVs is easily obtained at the end of IME on 1st frame, so the complexity of block matching on following frames can be saved without complicated decision procedure. As for search range adjustment, it also depends on the IMV’s information of 16×16 mode at the end of 1st frame’s search. The only thing for system control is to set an early stop for block matching on the search window based on our motion feature analysis result.

Figure 2.23 gives out optimized 4-stage based real-time hardwired encoder and my schemes are marked with italic font. It is shown that the IME engine is separately arranged in a single stage and the FME part is in another stage. For intra prediction (IP), entropy coding (EC) and deblocking filter (DB) engines, they are arranged in 3rd and 4th stages respectively. Based on the pipeline stage, any fast algorithms that use information in the 2nd to 4th stage such as (rate distortion cost [16], information after FME [6] [14] [15]) are impractical because the IME already finishes all its work when such information is available. In case of my schemes, they all work in the IME stage, which is compatible to the existing pipeline stage based design.

2.6 Conclusion remarks

In this chapter, one hardware oriented fast motion estimation algorithm is proposed. The algorithm targets at complexity reduction in three aspects. Firstly, the aliasing problem which is the main reason of video quality degradation is analyzed. By adopting edge detection technique, the complexity incurred by MRF technique is released. Also, one similarity analysis based MRF elimination scheme is also introduced for further reduction of complexity for MB with stationary feature. Secondly, motion feature of current MB is extracted during block matching process. Redundant search points for small motion MB is eliminated by restricting small motion MB’s search area within a small centering region. Moreover, an recursive search range adjustment scheme is employed for MBs with different motion feature. Thirdly, by executing pixel difference analysis which is arranged before IME engine, an adaptive sub-sampling scheme is introduced for complexity reduction of full pixel pattern. Altogether, by combining all these schemes, the proposed algorithm
2.6 Conclusion remarks

can achieve 83.69% to 95.72% ME time reduction with trivial video quality loss compared with full search algorithm. Averagely, about 88.53% ME time is reduced among different sequences. Furthermore, the proposed fast ME algorithm is orthogonal to existing software oriented fast motion estimation algorithms, which can achieve speeds-up ratio of conventional UMHexagon search up to 2.73. Since the proposed algorithm operates in a hardware friendly way, it can be easily implemented in the 4-stage pipeline based real-time video encoding system.
Chapter 3

Flexible integer motion estimation architecture

3.1 Introduction

In the previous chapter, several schemes for hardware oriented algorithms are introduced. With these schemes, the complexity reduction can be achieved based on hardware data flow. Also, the related clock cycle saving ratio based on MRF elimination and search range adjustment schemes are analyzed. It is obvious that, with some control modules, the proposed MRF and search range schemes can be easily applied to existing architectures, such as SAD Tree and propagate partial SAD architectures. The control part for these schemes are belong to the system level adjustment.

In terms of adaptive sub-sampling, it can not be efficiently applied on existing fixed architectures. Here, Eq. (3.1) is introduced for MB classification. In detail, if all the \( PD_h(i, j) \) of current MB is within a pre-defined threshold \( \text{THR}_{PD} \), this MB is called horizontal homogeneous MB \( (H_{homo}) \). The concept for \( V_{homo} \) can be traced with analogy. For MB which has all its \( PD_h(i, j) \) and \( PD_v(i, j) \) are within \( \text{THR}_{PD} \), it is called strong homogeneous MB \( (S_{homo}) \). Otherwise, it is a none homogeneous MB \( (N_{homo}) \). Three hardware friendly sub-sampling patterns (pattern 1 to 3), as shown in Fig. 3.1, are used to reduce complexity of IME according to \( H_{homo}, V_{homo} \) and \( S_{homo} \) cases. The pattern 4 is full pixel pattern which is used for \( N_{homo} \) MB. The pixel difference analysis (PDA)
Pattern 1 Pattern 2 Pattern 3 Pattern 4

Figure 3.1: Sub-sampling patterns and full pixel pattern

is executed during loading of current MB. The $THR_{PD}$ is set as $4 \times QP$ (quantization parameter) based on empirical and exhaustive experiments. As shown in Table 3.1, the PDA based adaptive sub-sampling will have better video quality than direct sub-sampling scheme. Another merit of adaptive sub-sampling is that it is friendly to power aware system for different customer’s demand. In fact, the direct half sub-sampling [31] and quarter sub-sampling [19] are sub-classes of adaptive algorithm.

$$
\begin{align*}
H_{homo}: PD_h(i,j) &< THR_{PD} \\
V_{homo}: PD_v(i,j) &< THR_{PD} \\
S_{homo}: (PD_h(i,j) < THR_{PD}) &\& (PD_v(i,j) < THR_{PD}) \\
N_{homo}: otherwise \\
i &\in [1,15]; \; j &\in [1,15]
\end{align*}
$$

(3.1)

However, direct application of adaptive algorithm on existing fixed architecture will cause poor data reuse and hardware utilization. Moreover, repetitive of pixels loaded from SRAM will degrade the efficiency of adaptive sub-sampling scheme, especially when large image size such as HDTV application is incurred. Fig. 3.2 gives out a demonstration of data reuse problem. Assume that current MB is a $H_{homo}$ MB and pattern 1 is adopted for current MB’s IME. In the original SAD Tree structure, it loads 16 pixels in each cycle for SAD calculation and 1 extra pixel for column shift in snake scan method [17]. Based on the original data flow, only 50% pixels are useful for SAD calculation. As for hardware utilization, also 50% processing elements (PEs) can not be fully utilized. In case of $S_{homo}$, the waste of hardware resource rises up to 75%. So, flexible architectures are required for adaptive scheme.
3.1 Introduction

Table 3.1: Quality analysis of adaptive sub-sampling

<table>
<thead>
<tr>
<th></th>
<th>direct_ss</th>
<th>BDBR (%)</th>
<th>BDPSNR (dB)</th>
</tr>
</thead>
<tbody>
<tr>
<td>crew_720p</td>
<td>+0.65</td>
<td>-0.025</td>
<td></td>
</tr>
<tr>
<td>city_720p</td>
<td>+1.23</td>
<td>-0.057</td>
<td></td>
</tr>
<tr>
<td>stockholm_720p</td>
<td>+2.19</td>
<td>-0.208</td>
<td></td>
</tr>
<tr>
<td>knightshields_720p</td>
<td>+2.38</td>
<td>-0.773</td>
<td></td>
</tr>
<tr>
<td>harbour_720p</td>
<td>+1.35</td>
<td>-0.077</td>
<td></td>
</tr>
<tr>
<td>parkrun_720p</td>
<td>+2.78</td>
<td>-0.770</td>
<td></td>
</tr>
<tr>
<td></td>
<td>adapt_ss</td>
<td>BDBR (%)</td>
<td>BDPSNR (dB)</td>
</tr>
<tr>
<td>crew_720p</td>
<td>+0.22</td>
<td>-0.011</td>
<td></td>
</tr>
<tr>
<td>city_720p</td>
<td>+0.12</td>
<td>-0.006</td>
<td></td>
</tr>
<tr>
<td>stockholm_720p</td>
<td>+0.25</td>
<td>-0.014</td>
<td></td>
</tr>
<tr>
<td>knightshields_720p</td>
<td>+0.16</td>
<td>-0.008</td>
<td></td>
</tr>
<tr>
<td>harbour_720p</td>
<td>+0.14</td>
<td>-0.010</td>
<td></td>
</tr>
<tr>
<td>parkrun_720p</td>
<td>+0.39</td>
<td>-0.025</td>
<td></td>
</tr>
</tbody>
</table>

direct_ss: direct quarter sub-sampling
adapt_ss: PDA based adaptive sub-sampling

Figure 3.2: Data reuse problem in SAD Tree structure

The proposed flexible IME architectures are based on original SAD Tree and propagate partial SAD (PPSAD) structures. Firstly, with memory level and architecture level pixel organization, problems in data reuse and hardware utilization are well solved. Secondly, with configurable SAD and interactive data loading scheme, the processing cycle
3.2 Reconfigurable SAD tree architecture

Figure 3.3: Original SAD Tree structure

and power dissipation of previous designs are greatly reduced. Moreover, circuit level optimization is applied in the proposed architecture which further saves hardware cost and power dissipation. The details are in the following sections.

3.2 Reconfigurable SAD tree architecture

3.2.1 System architecture

The proposed reconfigurable SAD Tree (RSADT) architecture is shown in Fig. 3.4. The left up part is the PDA part. It provides the pattern selection signal for proposed structure. During loading of current MB pixels (Pels), with 4 shift registers (shift_reg), 4 absolute difference operations (abs_diff_opt) and 2 adders, the $V_{PD}$ and $H_{PD}$ can be obtained. Therefore, the sub-sampling pattern is decided before IME starts to work. The extra calculation which is introduce to the system will not degrade system performance because the loading of current MB pixel occurs only once during the whole IME process.
3.2 Reconfigurable SAD tree architecture

In the proposed architecture, three major modifications are applied compared with SADT [17], as shown in Fig. 3.3. Firstly, instead of pipelining at partial 4×4 or 8×8 SAD scale, the proposed structure pipelines at Pel scale, that is 4-Pel scale and 16-Pel scale. The purpose of this adoption is to achieve full data reuse for the adaptive algorithm. Secondly, two pipeline stages are inserted in the RSADT (4-Pel and 16-Pel). Compared with one pipeline stage in SADT, the whole system clock speed is enhanced. Thirdly, based on 4-Pel SADs, an architecture level pixel organization scheme is introduced to form 4-Pel scaled configurable SAD (CSAD). So, the data reuse and hardware utilization problems are solved. Based on these 4-Pel CSAD and memory level pixel organization, the processing cycles can be shortened for MB with different homogeneity. Furthermore, a cross reuse structure for 16-Pel scaled CSAD generation is proposed to realize adaptive scheme efficiently. In the following section, description in detail will be given out.
3.2 Reconfigurable SAD tree architecture

3.2.2 Architecture level data organization and circuit modification

For architecture design, the data organization is always a critical problem to the whole system performance. The organization can happen in the memory level or in the architecture level. In the proposed architecture, I apply data organization in both levels.

Firstly, an architecture level data organization is proposed for RSADT architecture. Figure 4.5(a) shows one 8×8 residue block. The original SADT pipelines at 4×4 SAD (SAD_{4×4}) and generates 8×8 SAD by accumulating four SAD_{4×4}. Equation (3.2) and Eq. (3.3) are the expressions of left-up SAD_{4×4} and the whole SAD_{8×8} of Fig. 4.5(a), where G represents a group of marks which indicates the specific pixels for one SAD_{4×4}. The related SAD_{4×4} for three sub-sampling patterns will change to Eq. (3.4) accordingly. Also, the related 8×8 SAD can be derived as Eq. (3.5). From Fig. 4.5(a) and Eq. (3.2) to Eq. (3.5), it is obvious that, when pattern 1 is adopted, the accumulated SAD of position B_k and D_k (k = 1 ~ 16) are SAD_{8×8-H} of neighboring search point (SP2). Based on the same principle, SAD_{8×8-V} at SP1 and SP3 are available simultaneously when pattern 2 is used. In case of pattern 3, the SAD_{8×8-S} at SP1 to SP4 can be obtained at one time. So, I reorganize the SAD value as Fig. 4.5(b). It is clear that every two rows represent one SAD_{4×4} at SP1. On the other hand, when these SADs are accumulated vertically, it can form three types of outputs, that is four SAD_{8×8-S} at SP1 to SP4, two SAD_{8×8-H} at SP1 and SP2, or two SAD_{8×8-V} at SP1 and SP3. Then, one pipeline stage is inserted and sixteen 4-Pel scaled configurable SAD (CSAD) is formed, as shown in Fig. 3.6. Equation (3.6) gives out the formation of all 4-Pel scaled CSADs.

\[

SAD_{4×4} = \sum_{k\in G} A_k + B_k + C_k + D_k
\]

\[G = \{1, 2, 5, 6\}\]

\[

SAD_{8×8} = \sum_{k=1}^{16} A_k + B_k + C_k + D_k
\]

\[

\begin{align*}
H_{homo} : SAD_{4×4-H} &= \sum_{k\in G} A_k + C_k \\
V_{homo} : SAD_{4×4-V} &= \sum_{k\in G} A_k + B_k \\
S_{homo} : SAD_{4×4-S} &= \sum_{k\in G} A_k
\end{align*}
\]
3.2 Reconfigurable SAD tree architecture

\[\begin{align*}
H_{homo} : SAD_{8x8}H & = \sum_{k=1}^{16} Ak + Ck \\
V_{homo} : SAD_{8x8}V & = \sum_{k=1}^{16} Ak + Bk \\
S_{homo} : SAD_{8x8}S & = \sum_{k=1}^{16} Ak
\end{align*}\] (3.5)

As shown in Fig. 3.6, when four horizontal CSAD values are added together, it will form one 16-Pel scaled \(SAD_{4x4}\) at SP1. Similarly, four 16-Pel scaled \(SAD_{8x8}S\) can be obtained vertically. The decision of SAD generation is based on PDA. As shown in Fig. 3.4, all the sixteen 16-Pel scaled CSADs are pipelined. Based on these 16-Pel CSADs, adaptive output result is available, that is \(SAD_{4x4}\) at one SP, \(SAD_{8x8}H\) or \(SAD_{8x8}V\) at two SPs, or \(SAD_{8x8}S\) at four SPs. So, the processing capability is doubled or quadrupled for MB with different homogeneity.

Secondly, since the adaptive scheme is applied in hardware, the original two dimensional reference shift array (RSA) must be modified. Figure 3.8 is the original RSA structure which contains 272 SUs in 16 rows and 17 columns. The basic module of RSA is the snake scan unit (SU), as shown in Fig. 3.7. Assume that current SU is in \(i\)th row and \(j\) column (SU[\(i,j\)]). To enable snake scan, the related upper, lower and right data inputs are shown in Eq. (3.7), where SU[\(i,j\)]\(_O\) is the output of SU[\(i,j\)]. The modified SU (MSU) is given out in Fig. 3.7. The data input number is doubled and the relation with other MSU is shown in Eq. (3.8). With MSU module, the original RSA structure is changed to Fig. 3.9.
3.2 Reconfigurable SAD tree architecture

Figure 3.5: Pixel data organization

\[
\begin{align*}
CSAD1 &= A1 + A2 + A5 + A6 \\
CSAD2 &= B1 + B2 + B5 + B6 \\
CSAD3 &= C1 + C2 + C5 + C6 \\
CSAD4 &= D1 + D2 + D5 + D6 \\
CSAD5 &= A3 + A4 + A7 + A8 \\
CSAD6 &= B3 + B4 + B7 + B8 \\
CSAD7 &= C3 + C4 + C7 + C8 \\
CSAD8 &= D3 + D4 + D7 + D8 \\
CSAD9 &= A9 + A10 + A13 + A14 \\
CSAD10 &= B9 + B10 + B13 + B14 \\
CSAD11 &= C9 + C10 + C13 + C14 \\
CSAD12 &= D9 + D10 + D13 + D14 \\
CSAD13 &= A11 + A12 + A15 + A16 \\
CSAD14 &= B11 + B12 + B15 + B16 \\
CSAD15 &= C11 + C12 + C15 + C16 \\
CSAD16 &= D11 + D12 + D15 + D16
\end{align*}
\]
3.2 Reconfigurable SAD tree architecture

![Diagram of reconfigurable SAD tree architecture]

Figure 3.6: 4-Pel scaled CSAD

![Diagram showing modification in SU]

Figure 3.7: Modification in SU

$$\begin{align*}
Upper &= SU[i - 1, j]_O \\
Lower &= SU[i + 1, j]_O \\
Right &= SU[i, j + 1]_O \\
Upper' &= MSU[i - 1, j]_O, Lower' &= MSU[i + 2, j]_O \\
Right' &= MSU[i, j + 2]_O
\end{align*}$$

(3.7)

(3.8)

3.2.3 Memory level pixel organization

The memory pixel organization also has to be modified to enable full data reuse. The original memory organization of SADT architecture is shown in Fig. 3.10(a). Here, I only show search window of $32 \times 32$ (last 15 columns of data is added for the block matching...
3.2 Reconfigurable SAD tree architecture

For Column Shift Use

Figure 3.8: Original reference shift array

of positions on the 32th column) as an example. For the SADT architecture, by using memory mapping algorithm [28], the data loaded from search window memory is fully utilized. In each clock cycle, 17 pixels are loaded from the memory and transferred to the RSA. As shown in Fig. 3.8, 16 pixels are used for block matching calculation and 1 pixel is prepared for column shift in the snake scan method.

Since adaptive algorithm is applied on proposed RSADT architecture, the original memory pixel organization should be optimized to keep full data reuse. Figure 3.10(b) demonstrates the proposed scheme. The whole reference pixels are classified into columns with odd rows and even rows. Then, they are arranged into two memory groups (A and B), which output pixel row to the modified RSA. The adoption of group division is mainly for the adaptive patterns. For example, in case of pattern 2 and pattern 3, two succeeding pixel rows are required for the modified RSA. To enable data reuse in pattern
3.2 Reconfigurable SAD tree architecture

Figure 3.9: Modified reference shift array

1, one extra pixel column (18th column) is added for column shift in MSU. Finally, the memory overlapping algorithm [28] is used for both memory groups.

3.2.4 Cross reuse structure for CSAD generation

Thirdly, the 4-Pel scaled CSAD is fully utilized and one cross reuse structure (CRS) for 16-Pel scaled CSAD generation is proposed. Figure. 3.11 is the proposed CRS structure. They are the same circuits with different configurations. In the intuitive implementation of Fig. 3.6, 8 adders and 4 big multiplexors are required to generate four 16-Pel scaled CSAD of one 8×8 block. For HDTV application, when 8 parallel IME engine is adopted, there will be 256 adders and 128 multiplexors. With the increase of synthesis clock speed, the hardware cost of these adders and multiplexors will be dilated greatly. In my design, I fully utilize the 4-Pel scaled CSAD and only four adders are need for generating all the 16-Pel scaled CSAD. As shown in Fig. 3.11, based on the control signal (Ctrl) from PDA module, the CRS can be used to get unsub-sampled 16-Pel scaled CSADs like
3.2 Reconfigurable SAD tree architecture

Fig. 3.10: Memory level pixel organization

Fig. 3.11(a) or sub-sampled CSADs like Fig. 3.11(b). Thus, the 4-Pel CSAD to 16-Pel CSAD generation process is fulfilled efficiently with our cross reuse structure.
3.3 Adaptive propagate partial SAD architecture

Based on the same adaptive algorithm, one adaptive propagate partial SAD architecture (APPSAD) is also proposed. Figure 3.12 is the proposed APPSAD. Compared with fixed PPSAD architecture in [17]. Three major optimizations are applied in the architecture.
3.3 Adaptive propagate partial SAD architecture

Firstly, since the proposed architecture is target for HDTV application, the contribution of small inter mode is very trivial. So, I use mode reduction technique in the APPSAD architecture, which means that inter mode below 8×8 is discarded. Due to this adoption, the hardware costs related with small inter mode is removed.

Secondly, in the previous PPSAD architecture, 64 PEs are grouped together and used to accumulate one 8×8 SADs. In APPSAD architecture, the original fixed structure is modified for adaptive algorithm. Figure 3.13(a) is the intuitive implementation of adaptive algorithm on previous architecture. PEs with black color are activated for all the patterns just like conventional PEs (name it PE\textsubscript{CONV}) while PEs represented with triangle, square or grey circle are pattern dependent ones, which will be activated or deactivated according to different sub-sampling patterns. Besides, since the number of partial SADs will vary based on different sub-sampling patterns, some multiplexors are added into the architecture to enable adaptive feature. For example, when vertical sub-sampling is adopted, all the PEs on even lines of Fig. 3.13(a) are bypassed by configuring all the multiplexors. It is obvious that in the intuitive way, many multiplexors are required, which will intensify the complexity in control logic. The hardware size will also be dilated consequently due to this operation. In APPSAD structure, as shown in Fig. 3.12, I group all the PEs according to their types. For example, all the PE\textsubscript{CONV} within one 8×8 block are grouped together and only one multiplexor is used to realize control logic of adaptive algorithm. Thus, 75% number of multiplexors are removed.

Thirdly, besides conventional PE (PE\textsubscript{CONV}), extra three different PEs exist in our design, namely PE\textsubscript{DHS}, PE\textsubscript{DVS} and PE\textsubscript{FPS}, which is represented with square, triangle and grey circle in Fig. 3.13(a), respectively. Each of these elements are activated or deactivated at different sub-sampling patterns. In detail, PE\textsubscript{DHS} is disabled in horizontal sub-sampling case and PE\textsubscript{DVS} is deactivated in vertical sub-sampling case. As for PE\textsubscript{FPS}, it is only enabled for full pixel block matching situation (N\textsubscript{homo} case). For PE\textsubscript{CONV}, their status are always ‘ON’ no matter what sampling pattern the system selects. In the detail design, I simply send one disable signal to the PE to be deactivated and the output result of such PE will turn to constant zero. So, under different sub-
3.3 Adaptive propagate partial SAD architecture

Figure 3.12: Adaptive propagate partial SAD architecture
3.3 Adaptive propagate partial SAD architecture

![Diagram](image)

(a) Intuitive Implementation  (b) Detail Adder Tree Circuit

Figure 3.13: 8x8 PE array in PPSAD architecture

By using sampling patterns, the architecture can enable corresponding PEs for the block matching process. Many absolute difference calculations are saved and power dissipation is reduced in the architecture level consequently.

### 3.3.2 Memory organization

In the memory level, the original memory structure also needs to be modified to improve data reuse. In [20], it uses a memory overlapping algorithm to fully utilize pixel data loaded from memory. In my design, I divide memory pixels into four types, namely even-row, odd-row, even-column, odd-column, as shown in the left part of Fig. 3.14. All the square pixels in Fig. 3.14 are odd-row-odd-column pixel ($P_{oo}$); the triangle represents even-row-even-column pixels ($P_{ee}$); for circle and diamond symbols, they are odd-row-even-column pixel ($P_{oe}$) and even-row-odd-column ($P_{eo}$) respectively.

In the second step, all pixels are grouped together according to their types. Since there are four patterns (including full pixel pattern) in my design, two memory groups are needed to store them. For instance, in case of $N_{homo}$ and $V_{homo}$, the required pixel number (16 pixels per clock) for APPSAD architecture is two times of $H_{homo}$ and $S_{homo}$ cases (8 pixels per clock). So, as shown in Fig. 3.14, two memory groups, namely Mem_GA and Mem_GB, are used. Each group contains several one-pixel width memory bars.

All the $P_{oo}$ (Part_OO) and $P_{eo}$ (Part_EO) are stored in Mem_GA while the other two type pixels (Part_OE and Part_EE) are stored in Mem_GB. To improve the IO bandwidth utilization and erase bubble clock cycles of PPSAD based architecture [17], I further
3.3 Adaptive propagate partial SAD architecture

Figure 3.14: Pixel classification and memory organization

Figure 3.15: Memory separation and overlapping
3.3 Adaptive propagate partial SAD architecture

separate each group into 2 sub-group, namely Mem\textsubscript{GA}\textsubscript{1}, Mem\textsubscript{GA}\textsubscript{2}, Mem\textsubscript{GB}\textsubscript{1} and Mem\textsubscript{GB}\textsubscript{2}, and apply memory mapping algorithm \cite{20}. Fig. 3.15 gives out an example. Assume that search range size is 48 in width (W=48) and 32 (H=32) in height. Last fifteen rows and columns are added for block matching on the boundary parts. One row and column is added for hardware implementation. So, the search window size is (W+16)×(H+16). Based on our pixel classification, the size of each part in Fig. 3.14, for example Part\textsubscript{OO}, is 32×24. As shown in Fig. 3.15, I separate the last 8 rows of each part and apply memory overlapping algorithm \cite{20} on both Mem\textsubscript{GA}\textsubscript{1} and Mem\textsubscript{GA}\textsubscript{2}. So, the clock bubble in PPSAD based architecture is removed. Each memory group contains eight memory bars, which makes 100% IO bandwidth utilization for different sub-sampling patterns. In this dissertation, I only focus on the IME’s on-chip memory and do not deal with pixel organization of off-chip frame memory. The original Level C or Level D \cite{32} off-chip to on-chip data reuse scheme and their corresponding scan order still can be used for the whole encoder system. So, the required off-chip to on-chip memory bandwidth is the same with Level C or Level D scheme. The proposed pixel classification can be done in the encoder’s system level, which is not ascribed to the IME engine’s job.

Thirdly, the data flow of our architecture is different from previous design. Figure 3.16 is my memory data loading flow for four types of patterns. To simplify the explanation, Mem\textsubscript{GA}\textsubscript{1} and Mem\textsubscript{GA}\textsubscript{2}, Mem\textsubscript{GB}\textsubscript{1} and Mem\textsubscript{GB}\textsubscript{1} are merged together in my description.

As shown in Fig.3.16, there are two stages in \textit{H\textsubscript{homo}} case. In the 1st Stage, the Part\textsubscript{Sel} signal chooses data from Mem\textsubscript{GA}, which means that only Part\textsubscript{OO} and Part\textsubscript{EO} are the candidate Parts. The pixel data are loaded interactively from these two parts and Mem\textsubscript{GB} is set to idle state, which saves power of Mem\textsubscript{GB} part. Based on our data organization style, the memory address control is also simplified. The difference of succeeding two addresses is only the height of Part\textsubscript{OO}. For example, assume that there are \textit{h} addresses in each bar of Part\textsubscript{OO}. In \textit{2n}th cycle, one pixel row at address \textit{m} of Part\textsubscript{OO} is loaded, in the next cycle ( (\textit{2n}+1)th cycle ), another pixel row from Part\textsubscript{EO} is required for the APPSAD structure based on pattern 2 of Fig. 2.17. The address of this pixel row will be ( \textit{m} + \textit{h} ). The address generation of \textit{N\textsubscript{homo}} case and \textit{H\textsubscript{homo}}’s 2nd Stage.
3.3 Adaptive propagate partial SAD architecture

Figure 3.16: Data flow of APPSAD architecture

can be traced by analogy. When all the pixels in Part ОО and Part EO of $H_{homo}$ case are loaded, it turns to 2nd Stage, during which only Part EE and Part OE are candidate parts and power dissipation for Mem GA can be saved.

For $V_{homo}$ case, it also consists of two stages. In each clock cycle, both memory groups are activated because the required number of pixels for APPSAD structure is doubled (16 pixels) according to pattern 3 of Fig. 2.17. Specifically, in 1st Stage, only Part ОО and Part OE are candidate parts. Two pixel rows are loaded simultaneously from low address to high address cycle by cycle. When all the pixels are loaded, it turns to 2nd Stage, which only requires pixels from Part EO and Part EE. The pixel assemble module (PA Module) combines the two rows together and outputs the assembled 16 pixels to the APPSAD architecture.
3.3 Adaptive propagate partial SAD architecture

For $S_{homo}$ case, since sub-sampling is adopted both horizontally and vertically, the pixels of different types are loaded one part by one part. So, there are four stages in all. In each stage, the pixel row of specific part is loaded from low address to high address cycle by cycle.

As for $N_{homo}$ case, only 1 stage exists based on full pixel pattern in Fig.2.17. As shown in Fig.3.16, in the $2n$th clock, the pixel rows from Part_{OO} and Part_{OE} are loaded simultaneously. In the succeeding $(2n+1)$th cycle, two rows from Part_{EO} and Part_{EE} are loaded. The whole process continues until all the pixels in the memory are loaded.

Furthermore, for $H_{homo}$ and $S_{homo}$ cases, the required number of pixels in each clock cycle is 8, which is half of the $V_{homo}$ and $N_{homo}$ cases. So, only one memory group is enabled within each stage and the power consumption of another group can be saved. Therefore, the proposed pixel organization can keep high data reuse while achieve lower memory power dissipation.

3.3.3 Compressor tree in standard cell library

The proposed APPSAD architecture can realize adaptive sub-sampling algorithm by introducing some multiplexors and optimizing previous PE array. The hardware size will also be dilated compared with original PPSAD structure. In this dissertation, by using 4-2 and 3-2 compressors to manually build up compact architecture, circuit optimization for APPSAD structure is accomplished.

In conventional standard cell library such as TSMC 0.18um, compressor tree is widely used to achieve optimum result during the compiling stage. The criterion for selecting compressor tree is flexible. The synthesis tool will follow some constraints such as timing and area criterions and generate net-list which is close to user’s requirements. So, redundant adders exist inevitably in the final net-list. The cost of all these adders will dilate hardware cost with the increase of synthesis frequency. In my proposal, I use compressor tree to manually build PE array in APPSAD structure, which removes all the unnecessary adders. Figure 3.17 gives out two kinds of compressors used in APPSAD structure. The left one is 3-2 compressor (CMPR32) and the right one is 4-2 compressor (CMPR42). The
3.3 Adaptive propagate partial SAD architecture

ICI and ICO in CMPR42 compressor is the immediate carry-in flag (ICI) from previous compressor and the immediate carry-out (ICO) flag to the next one. The logic equations of CMPR42 and CMPR32 are shown in Eq. 3.9 and Eq. 3.10. Figure 3.17 is an example of 1 bit-width library cell and both compressors can be extended into multiple bit-width ones based on combination of 1 bit-width cell. An example of compressing 4-bit width input data by connecting four 1-bit width CMPR42 is shown in Fig. 3.18.

\[
\begin{align*}
IS &= In_1 \oplus In_2 \oplus In_3 \\
ICO &= (In_1 \cdot In_2) + (In_1 \cdot In_3) + (In_2 \cdot In_3) \\
Out_1 &= IS \oplus In_4 \oplus ICI \\
Out_2 &= (IS \cdot In_4) + (IS \cdot ICI) + (In_4 \cdot ICI)
\end{align*}
\] (3.9)

\[
\begin{align*}
Out_1 &= In_1 \oplus In_2 \oplus In_3 \\
Out_2 &= (In_1 \cdot In_2) + (In_1 \cdot In_3) + (In_2 \cdot In_3)
\end{align*}
\] (3.10)

3.3.4 Circuit optimization for single processing element

The processing elements (PEs) in APPSAD architecture will execute absolute difference (abd) operation between current pixels and reference ones. Each PE is responsible for one pixel location, which is one abd operation between two 8-bit width inputs. The intuitive PE circuit is shown in Fig. 3.19(a). It is obvious that one adder is required to generate final abd result by adding MSB (most significant bit) to the difference value. Since there are 256 PEs in one APPSAD architecture, the hardware cost of these adders is not negligible, especially when parallel processing and high speed requirement are considered.
3.3 Adaptive propagate partial SAD architecture

Figure 3.18: CMPR42X1 with Multiple-bits Wide Input

![Diagram of CMPR42X1 with Multiple-bits Wide Input]

Figure 3.19: Optimization of processing element

![Diagram of 1-Pixel Partial SAD]

(a) Intuitive implementation

![Diagram of Optimized PE circuit]

(b) Optimized PE circuit

(for example, HDTV application). In the optimized circuit, as shown in Fig. 3.19(b), the MSB and difference value are not added up. Thus, the specific adder within each PE is removed. For APPSAD architecture, 1-pixel partial SAD value is not the desired output result, which means that discard of adder in each PE and propagation of temporary result to next stage will not disturb the data flow of APPSAD structure. The temporary results of 8 PEs in one row of APPSAD are accumulated together. Since output of $8 \times 1$ to $8 \times 7$ partial SADs are also not a must, I use one compressor tree structure to achieve compression of these SADs. The details are shown in next section.
3.3 Adaptive propagate partial SAD architecture

3.3.5 Compressor tree based eight stage circuit optimization

As mentioned in previous section, not only the adder in each PE unit, but adder for each PE row is discarded in APPSAD architecture. Figure 3.20 to Fig. 3.23 is the proposed eight-stage compressor tree structures. Each stage is related with each line in APPSAD structure. The detail description is as follows.

\[
\begin{align*}
    x = 3 & \& P = 10, \text{ Stage}_3 \text{ Structure} \\
    x = 5 & \& P = 11, \text{ Stage}_5 \text{ Structure} \\
    x = 7 & \& P = 12, \text{ Stage}_6 \text{ Structure} \\
    t = 4 & \& Q = 11, \text{ Stage}_4 \text{ Structure} \\
    t = 6 & \& Q = 12, \text{ Stage}_6 \text{ Structure} \\
    t = 8 & \& Q = 13, \text{ Stage}_7 \text{ Structure}
\end{align*}
\]

(3.11) (3.12)

For the 1st stage (Stage_1), since no temporary results are propagated from the upper stage, the structure is simple compared with rest stages. As shown in Fig. 3.20, three
3.3 Adaptive propagate partial SAD architecture
3.3 Adaptive propagate partial SAD architecture

CMPR42 and one CMPR32 cells are used to generate two temporary results, namely as $S_1L[9:0]$ and $S_1R[10:0]$. Here, $dpe_{1y}$ ($y \in [1,8]$) represents difference results of 1st stage on $y$th column’s PE. For example, $dpe_{11}$ to $dpe_{18}$ are difference values from eight PEs of Stage_1. The $m_{1y}$ is the related MSB of Stage_1. The meaning of input data for other seven stages can be traced with analogy. The square dot in Fig. 3.20 represents bit-inserted-in-head while diamond dot indicates bit-inserted-in-tail. For instance, in Layer 1, by combining $m_{13}$ with CMPR42[8:1], it will form 8-bit width result [8:0]. Similarly, when ico is added to CMPR42[7:0], the result will become 9-bit width where ico is located on the top bit.

The structure from 2nd stage is different from Stage_1 because that both temporary results in current stage and results propagated from upper stage have to be compressed. Figure 3.21 is the designed structure. Besides three CMPR42 cells, one extra combo module which consists of CMPR42 and CMPR22 cells exists in the structure. The Asmb module is introduced to assemble compressed results for output. The reason for introducing combo module is that, after Layer 2, the bit-width of input data for Layer 3 is
not neat. As shown in Fig. 3.21, there are three 10 bit-width, one 11 bit-width and one 1 bit-width data to be compressed. In the proposed solution, the $S_1.R[10:0]$ is dissembled into $S_1.R[9:0]$ and $S_1.R[10]$ parts. The detail structure of combo module is shown in broken lines.

The compressor tree architecture of Stage 3 to Stage 8 can realized with two similar structures. Figure 3.22 and Fig. 3.23 are proposed architectures. In detail, Fig. 3.22 is used for Stage 3, Stage 5 and Stage 7 compressing procedure while Fig. 3.23 represents the process of Stage 4, Stage 6 and Stage 8. The parameters setting of $x$, $t$, $P$ and $Q$ are shown in Eq. 3.11 to Eq. 3.12. For example, when $x$ is set as 5 and $P$ is 11, it is the structure for Stage 5 compressing process. The compressed results from upper layer are $S_4.L[11:0]$ and $S_4.R[12:0]$ which is the results from Stage 4 compressing structure by setting $t$ as 4 and $Q$ as 11 in Fig. 3.23. Therefore, it is shown that architectures in Fig. 3.22 and Fig. 3.23 are co-related to each other. Based on our eight-stage compressor tree architecture shown from Fig. 3.20 to Fig. 3.23, all the temporary adders exist in each stage are removed. One adder is used to generate final $8 \times$ SAD value.

### 3.4 Experiments, comparison and analysis

In this section, experiments, comparison and analysis are executed on two proposed flexible architectures. The discuss of these two structures are as follows.

Firstly, for RSADT architecture, the target specification is set as HDTV 720p@30fps, with IPPP structure. The maximum search range is [-64,+63) in width and [-32,+31) in height with 1 reference frame. Eight parallel RSADT structures are used.

Figure 3.24 is the clock cycle comparison between proposed structure and existing ones. Six HDTV 720p format sequences are used and I encode 100 frames under QP = 24 (quantization parameter). The required clock cycles ($req_{clk\_yc}$) for handling each frames is based on Eq. (3.13), where $MB\_num$ represents the MB numbers within one frame (3600 in our case). $H_{home\_yc}$, $V_{home\_yc}$, $S_{home\_yc}$ and $N_{home\_yc}$ are the $req_{clk\_yc}$ for handling one $H_{home}$, $V_{home}$, $S_{home}$ and $N_{home}$ MB respectively. In this dissertation, the clock cycles for loading reference pixel data for each MB based on search win-

75
3.4 Experiments, comparison and analysis

dow reuse algorithm [33] is omitted. So, \( H_{homo\_yc} \), \( V_{homo\_yc} \), \( S_{homo\_yc} \) and \( N_{homo\_yc} \) are 512, 512, 256, and 1024 based on 8-parallel RSADT structure. The \( H_{homo\_rat} \), \( V_{homo\_rat} \), \( S_{homo\_rat} \) and \( N_{homo\_rat} \) are the ratio of different type MBs within each frame. In the SADT structure [17], only the \( N_{homo\_rat} \) is 1 (other ratios are all 0). The clock cycle saving (\( \text{clk\_cyc\_sav} \)) result of each frame is based on Eq. (3.14), where \( \text{clk\_cyc\_ori} \) is the \( \text{req\_clk\_cyc} \) of the RSADT architecture while \( \text{clk\_cyc\_our} \) is the \( \text{req\_clk\_cyc} \) of SADT structure. It is shown that the proposed structure can averagely save 72.75% clock cycles for sequence with abundant homogeneous MBs such as crew_720p. In case of knight-shield_720p and stockholm_720p, the clock saving is decreased to averagely 62.78% and 62.46% because of the increase of texture MBs in the image. For parkrun_720p, since the image in this sequence contains many high frequency MBs, the ratio of homogeneous MBs decreases a lot, which result in 42% clock cycle saving. Altogether, our RSADT architecture can averagely save 61.71% clock cycles while keep video quality, maintain data reuse and full utilization of hardware.

\[
\text{clk\_cyc} = MB\_num \times (H_{homo\_rat} \times H_{homo\_yc} + V_{homo\_rat} \times V_{homo\_yc} + S_{homo\_rat} \times S_{homo\_yc} + N_{homo\_rat} \times N_{homo\_yc}) \tag{3.13}
\]

\[
\text{clk\_cyc\_sav} = \frac{\text{clk\_cyc\_ori} - \text{clk\_cyc\_our}}{\text{clk\_cyc\_ori}} \times 100\% \tag{3.14}
\]

Additionally, by introducing some control logic for memory data loading and control signals for PEs, the previous SADT structure [17] can also be modified into extended version (call it [17]’) for adaptive algorithm. Assume that each set of structure handles 1/8 of the search points within search window. Table 3.2 is the comparison between extended SADT and proposed one. It is shown that the extended SADT can also handle three search patterns in adaptive sub-sampling algorithm. However, the pixel data reuse (\( \text{pel\_reuse} \)) and hardware utilization (\( \text{HW\_utiliz} \)) can not always achieve 100%. Moreover, the \( \text{req\_clk\_cyc} \) for the extended version can be shortened (reduced from 1024 to 512) when MB’s type is \( H_{homo} \) or \( S_{homo} \). It is because in these two types, it is possible to expand one extra column in RSA and apply two-column-shift (3rd to 18th columns are shifted left to 1st to 16th columns) operations directly when column shift occurs in snake scan method. The search point on the right side of current one can be processed
3.4 Experiments, comparison and analysis

Figure 3.24: Clock saving of HDTV sequences

...simultaneously. However, in case of $V_{homo}$ MB, the $req_{clk\_cyc}$ is still 1024 and the extended SADT can not reduce $req_{clk\_cyc}$ to 1/4 of original cycles in $S_{homo}$ case. The reason is that the upper and lower pixel rows can not be skipped under original memory organization scheme. In my RSADT architecture, since data organization is applied both in memory level and architecture level, the $req_{clk\_cyc}$ is half of [17]'s in both $V_{homo}$ and $S_{homo}$ cases. As for the search point within one clock cycle ($SP_{per\_cyc}$), the extended
### 3.4 Experiments, comparison and analysis

<table>
<thead>
<tr>
<th>MB Type</th>
<th>Architecture</th>
<th>$H_{homo}$</th>
<th>$V_{homo}$</th>
<th>$S_{homo}$</th>
</tr>
</thead>
<tbody>
<tr>
<td>HW_utiliz</td>
<td>100% 100% 50% 100% 50% 100%</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>req_clk_cyc</td>
<td>512 512 1024 512 512 256</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>SP_per_cyc</td>
<td>2 2 1 2 2 4</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

SADT can only accomplish block matching one point by one point for $V_{homo}$ MB. In the proposed architecture, for example, in case of $S_{homo}$ MB, 4 search points are accomplished in one clock cycle, which speeds up the IME process by 4 times. The reduction of the clock cycles is also meaningful for power aware system. With MB feature obtained before IME starts, the processing time of IME engine can be shortened or the whole engine is set to idle after finishing its work. In this way, much power is saved for the whole system.

As for synthesis result, I pick 110.5MHz and 200MHz and compare the hardware data with existing works as shown in Table 3.3. Here, the hardware data is the sum of one set architecture and current MB module (the pixel difference calculation module is included). It is shown that the hardware cost of my design is smaller than 2-D structure [34] and a little higher than existing SADT or PPSAD architectures. However, the PPSAD architecture is not suitable for large format image because of poor parallelism. As for comparison with SADT and 1-D [35] architectures, since one more pipeline stage is inserted, our design can achieve higher speed (200MHz) than previous architectures, which result in higher $PHR$ [20] value. The maximum work frequency is 208MHz under worst case. Moreover, the proposed architecture is flexible, which is unique to other fixed architectures. The RSADT structure can be configured for three sub-sampling patterns and still achieves full data reuse and 100% hardware utilization. In case of quarter sub-sampling, the processing capability is quadrupled. As for power consumption, the proposed RSADT has the same PE number with original SADT or PPSAD architectures. However, the processing time is greatly shortened in proposed RSADT structure. So, with normalized processing time, the final power consumption of proposed architecture is less than previous designs such
3.4 Experiments, comparison and analysis

Table 3.3: Comparison of RSADT with Previous Designs

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Clock (MHz)</td>
<td>294</td>
<td>100</td>
<td>66.7</td>
<td>110.8</td>
<td>261</td>
<td>110.5</td>
<td>200</td>
</tr>
<tr>
<td>Technology (um)</td>
<td>0.13</td>
<td>0.18</td>
<td>0.35</td>
<td>0.18</td>
<td>0.18</td>
<td>0.18</td>
<td>0.18</td>
</tr>
<tr>
<td>PE Number</td>
<td>16</td>
<td>256</td>
<td>256</td>
<td>256</td>
<td>256</td>
<td>256</td>
<td>256</td>
</tr>
<tr>
<td>Area (gates)</td>
<td>61k</td>
<td>154k</td>
<td>79k</td>
<td>88.6k</td>
<td>151k</td>
<td>93.6k</td>
<td>104.7k</td>
</tr>
<tr>
<td>Flexibility</td>
<td>No</td>
<td>No</td>
<td>No</td>
<td>No</td>
<td>No</td>
<td>Yes</td>
<td>Yes</td>
</tr>
<tr>
<td>Power (mW)</td>
<td>573 –</td>
<td>737 –</td>
<td>484 @200MHz</td>
<td>187</td>
<td>296</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

as [35], [27] and [20]. About 38.84% power reduction can be achieved compared with [20] under 200MHz.

Secondly, impact of parallel APPSAD structure for IME system is analyzed. The specification is the same with RSADT structure. Fifteen bottom pixel rows are added for block matching of search points on the last row. One extra pixel row is included for hardware design. The final search window size is 144×80. Figure 3.25 is the system block diagram of IME engine based on APPSAD architecture. Eight parallel APPSAD structures are used and only one reference frame is adopted. In fact, for RSADT based system, it only needs to replace eight APPSAD structures in Fig. 3.25 with eight parallel RSADT structures. The 110.5MHz and 150MHz are picked as two synthesis frequency points and the testing result is given out in Table 3.4. It is shown that compared with MRPPSAD architecture, the hardware cost of APPSAD is increased by 2.47% for a single PE Array with Cur.MB part and 5.37% for the whole IME engine. Compared with full mode PPSAD and SAD Tree architectures, our design still outweighs them in hardware cost because of mode reduction method. Here, compressed tree based circuit level optimization is not adopted. As for the whole IME engine under 110.5MHz work frequency, our design will incur 25k gates mainly because of pixel assemble module, and extra control logic.

Thirdly, I apply compressor tree based circuit optimization on APPSAD architecture. for one single 8×8 PE array, I synthesize it under several frequency points. As shown in Fig. 3.26, by adopting mode reduction in original design ([17]+MR), when frequency
3.4 Experiments, comparison and analysis

![Diagram of IME block diagram with APPSAD architecture](image)

**Figure 3.25: IME block diagram with APPSAD architecture**

**Table 3.4: Comparison of APPSAD with Previous Designs**

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Technology</td>
<td>0.18(um)</td>
<td>0.18(um)</td>
<td>0.18(um)</td>
<td>0.18(um)</td>
<td>0.18(um)</td>
</tr>
<tr>
<td>Frequency</td>
<td>110.8MHz</td>
<td>110.8MHz</td>
<td>110.5MHz</td>
<td>110.5MHz</td>
<td>150MHz</td>
</tr>
<tr>
<td>PE Array &amp; Cur.MB</td>
<td>88.6k</td>
<td>81.5k</td>
<td>68.7k</td>
<td>70.4</td>
<td>73.3</td>
</tr>
<tr>
<td>Whole Engine</td>
<td>-</td>
<td>-</td>
<td>465k</td>
<td>490k</td>
<td>509k</td>
</tr>
<tr>
<td>Optimized</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>481k</td>
<td>498k</td>
</tr>
<tr>
<td>Flexibility</td>
<td>No</td>
<td>No</td>
<td>No</td>
<td>Yes</td>
<td>Yes</td>
</tr>
</tbody>
</table>

is less than 160MHz, about 24.4\% hardware of one 8×8 PE array in [17]'s PPSAD is saved. However, the saving decreases greatly with the increase of frequency such as 180MHz and 200MHz. By further applying circuit optimization ([17]+MR+Opt), the proposed structure is superior to [17] even under high frequency points. Averagely, about 26.9\% hardware can be saved. The optimized result of whole engine which consists of 8-set APPSAD architectures is shown in the second last line of Table 3.4. About 9k and 11k hardware can be reduced for 8 parallel APPSAD architectures under 110.5 and 150MHz frequency points respectively. So the overall hardware increase of whole IME
3.4 Experiments, comparison and analysis

![Diagram](image)

Figure 3.26: Hardware cost saving of 8x8 PE array

The engine is reduced to only 3.44% compared with mode reduction based PPSAD architecture (MRPPSAD).

Fourthly, the power dissipation between proposed structure and original design is analyzed. The power of one 8×8 PE array is given out in Fig. 3.27. Since mode reduction technique reduces many redundant registers and the proposed circuit optimization discards all unnecessary adders, the whole 8×8 PE array can averagely achieve 11.7% saving of power. Additionally, I pick two typical HDTV 720p format sequences to test gate level power consumption of whole system. Figure 3.28 is the power consumption comparison between proposed work and previous design. In order to make a clear comparison, no speed-up algorithm such as coarse-to-fine search is adopted. The power consumption of SRAM is demonstrated individually besides the whole IME’s power dissipation. Since the reference pixel data is rearranged into two memory groups and only one memory group is enabled in case of $H_{homo}$ and $S_{homo}$ situation, the overall memory power consumption is lower than previous design which use all the memory bars. About 11.6% and 24.9% power consumption in memory part can be reduced for stockholm_720p and crew_720p. Apart from memory, the adaptive architecture can also adjust itself for MB with different homogeneous feature, which reduces power consumption in architecture level. Overall, 25.4%
3.4 Experiments, comparison and analysis

Figure 3.27: Power dissipation of 8x8 PE array

Figure 3.28: Power consumption comparison
and 39.8% power dissipation is reduced for stockholm_{720p} and crew_{720p} sequences.

Finally, in the proposed architectures, adaptive sub-sampling patterns are used for MB with different homogeneity. The complexity reduction is in the matching pattern level. Thus, the proposed scheme and architectures can be combined with other low complexity schemes. For example, the proposed hardware oriented algorithm is orthogonal to hardware algorithms such as coarse-to-fine search in [36] and frame-parallel scheme in [37]; or it can be combined with all zero block and skip mode early detection schemes [38] to further reduce complexity. From exhaustive experiments on sequences with different formats, the proposed flexible architectures can averagely achieve 53.8% reduction in power dissipation.

3.5 Conclusion remarks

In this chapter, one PDA algorithm is proposed and three sub-sampling patterns are used adaptively for different MB types. To efficiently realize adaptive algorithm, two related reconfigurable structures, namely RSADT and APPSAD are proposed. Based on different data flow of SAD Tree and PPSAD architectures, the proposed structures are optimized in different ways. Firstly, for RSADT structure, with structure level and memory level organization, the proposed architecture can averagely save 61.71% processing time with full data reuse and hardware utilization. Under normalized processing time, when comparing with previous efficient SAD Tree design, the proposed RSADT structure can achieve up to 38.84% reduction in power at 200MHz. Secondly, for APPSAD, four different processing elements are introduced for controlling of adaptive sub-sampling schemes. The interactive data loading scheme can keep full data reuse and can achieve 11.6% and 24.9% reduction in memory power consumption. Moreover, one eight-stage based circuit optimization is proposed for APPSAD structure which further reduces hardware cost and power consumption. When eight parallel APPSAD structure is applied in IME engine, with circuit optimization, the overall power saving for typical HDTV720p sequences by using APPSAD architecture is up to 39.8%. Averagely, about 53.8% power reduction can be achieved among different sequences.
Chapter 4

Low design effort VLSI engine for super high-vision application

4.1 Introduction

With the increasing demand of high video quality and large image size, the throughput issue for realizing real-time encoding process in ASIC design is greatly intensified. For H.264/AVC based high complexity system, besides IME engine, the FME and intra engines are also two important parts which occupy two separate pipeline stages. In this section, solutions in FME and intra parts for large image size such as 4k×4k and 4k×2k are given out. A brief introduction of FME part, intra engine and the impact of image size are given out firstly as follows.

The introduction of fractional motion estimation (FME) which is implemented with half and quarter pixel refinements contributes a lot to the video quality. As analyzed by [39], the discard of FME will cause 2-6 dB PSNR loss. With FME part, the inevitable aliasing problem [21] is greatly compensated. However, the new technique also bring about complexity problem which makes it unfavorable for hardware design. As analyzed by [6], the FME part occupies almost 40% computation, which is the second biggest one.

In hardware field, the complexity problem is directly related with throughput issue, which makes pipeline stage a must for real-time processing. In [18], the 4-stage based
Figure 4.1: Spectrum comparison of HDTV1080p with SHV
real-time encoder is given out, which arranges FME engine in a single stage [40] [18]. The maximum specification of [18] is HDTV720p format. In [19] [41], one 3-stage based HDTV 1080p encoder is designed and FME also occupies one single stage. To reduce the complexity of FME engine, one fast FME engine is proposed in [42], which saves 40% hardware cost and 14% searching time. However, the video quality loss of [42] is larger than previous designs because of very few searching points, which means that the aliasing problem can not be compensated well in [42]. Moreover, even though searching points and processing units are reduced in [42], it also obeys ‘first-half, then-quarter’ pixel refinement procedure, which is the basic processing flow in H.264/AVC standard [2]. So, it still has long processing cycles, which is unfavorable to higher design specification such as HDTV1080p.

In 2006, the Japanese broadcaster NHK puts forward concept of Super Hi-Vision (SHV). The real SHV image is captured by special camera which can provide features such as 7680×4320@60 fps and 4:4:4 luminance to chrominance ratio. With high sensitivity video sensor, the noise generated during capturing image is increased, which further intensifies the aliasing problem and increases the importance of FME process. Figure.4.1 is the spectrum comparison with conventional HDTV1080p sequences. Here, I use SHV test sequences ‘Sakura_tree’ which are provided by NHK. It is shown that the high frequency components are much more abundant in SHV clips than HDTV1080p case.

Under current processing technology, it is impossible to handle raw SHV image with a single encoder. Although it is possible to divide one 8k×4k image into 2k×1k image and use 16 HDTV1080p encoder [19] to achieve real-time process. However, this adoption will cause boundary effect in the reconstructed image because of 3 horizontal boundaries and 3 vertical boundaries among 16 HDTV1080p blocks. Moreover, [19] is target for baseline profile where only forward prediction of P frame is involved. When it is extended into main profile for higher compression capability, the advent of B frame which involves forward and backward prediction will double the processing cycles. So, the design effort is greatly increased. Here, the design effort is defined as minimum required frequency (Min_Freq) for the engine, as shown in Eq. 4.1. The cyc_per_MB is the required processing clock cycles for one MB and fps is the frames to be encoded in each second. The frm_width and frm_height
4.1 Introduction

is the width and height of each frame. In fact, Eq. 4.1 is a direct reflection of throughput issue in hardware. Thus, when handling main profile 4k×4k@60fps, the $cyc_{\text{per}_MB}$ will be three times of baseline profile (forward, backward and 1 iteration of Bi-prediction) and the final $Min_{Freq}$ for [19] on SHV specification will be 9.33GHz. From existing works [18] [19], the work speed of such designs are always restricted within 200MHz. It is because that higher frequency will not only cause higher power dissipation but also incur difficulty for synthesis tools during generation of net-list. In [43], one frame-parallel based main profile encoder is proposed. The design effort is greatly decreased when handling HDTV720p@30fps case. However, when this scheme is directly extended to 4k×4k@60fps, even if the AMPD2 algorithm [40] is adopted and only forward and backward prediction are considered, the proposed FME engine in [43] will still need 5.18GHz work frequency to fulfill the throughput requirement.

$$Min_{Freq} = cyc_{\text{per}_MB} \times fps \times \frac{frm_{width} \times frm_{height}}{256} \quad (4.1)$$

Another important part is low design effort intra prediction engine. For bit rate reduction, temporal prediction offers a strong impact on the final bit stream and many works have been done to reduce complexity of motion estimation. As for image quality, intra frame plays a more important role. With more intra frames in the encoding structure, the video quality is obvious improved. Thus, many researchers still focus on the refinement of intra prediction in both software [22] [44] [45] and hardware [46] [47]. In [22], edge gradient is utilized to filter out unpromising modes and about 60% intra frame encoding time can be saved. Literature [44] uses both entropy and edge information for further reduction of candidate modes. The improvement to [45] is about 8% on average. In [46], the fast intra prediction algorithm is achieved by analyzing dominant edge strength and one dedicated VLSI engine is designed. Literature [47] gives out the whole intra engine which support full prediction modes.

In [31], one four stage real-time encoder is designed and intra prediction (IP) engine is separately arranged in one single stage. For the whole IP engine, the most significant part is the intra predictor generation. As listed in [47], in one 4×4 sized sub-block, there are totally 30 cycles required for generating predictors of all intra 4×4 prediction modes (I4MB) and 10 cycles for intra 16×16 modes (I16MB). Since sixteen 4 × 4 sub-blocks exist
in one MB, the total cycles will around 640. Although fast algorithms such as [22] [44] can achieve reduction of candidate intra mode to some extent, full support of all modes in hardware is a must to keep the video quality. In the worst case, all the prediction modes are required for the system. In [46], the fast algorithm is implemented in hardware and it serves as a pre-process for the IP engine. However, no optimization on intra predictor generation is mentioned. For example, when remaining candidate 4×4 modes are DC, mode4, mode5, mode6; and candidate 16×16 modes are mode3 and DC, still 384 cycles are required for generating all these intra predictors within one MB. Moreover, the minimum required frequency (Req. Freq) for predictor generation will determine the design effort for the whole engine. According to Eq. 4.1, When the specification is extended to Full HD (1080p) or 4k×2k@60fps, the existing sequential generation method in [47] will cause extreme high design effort (1.24GHz), which is impossible to be accomplished.

In this chapter, solutions for low design effort FME and intra engines are given out. For FME engine, I fully utilize the existing techniques and contribute one main profile FME engine for SHV 4k×4k@60fps. Firstly, based on the existing works, two algorithms namely mode reduction based mode pre-filtering and motion cost oriented directional one-pass schemes are proposed to reduce design effort and achieve hardware cost reduction. Secondly, two parallel improved schemes called 16-pixel (16-Pel) based processing and MB-parallel scheme are proposed to enhance the performance. Thirdly, to save memory access, a unified pixel block loading scheme is proposed and memory organization is applied on MB-parallel scheme. As for intra engine, one low design effort intra predictor generation engine is given out. By analyzing the data dependency among 4×4 sub-blocks, one 2-block parallel processing flow is proposed. Compared with original 1-block sequential way, about 37.5% processing time can be saved. Secondly, for the predictor generation structure, one dedicated fully utilized hardware architecture is proposed, which simultaneously generates predictors of all the I4MB and I16MB modes. So, the number of processing cycles for each 4×4 sub-block are further reduced. The details of these two parts are described in the following sections.
4.2 Low complexity fractional motion estimation algorithm

4.2.1 Mode reduction based mode pre-filtering scheme

The conventional FME engine consists of two major steps, that is, interpolation and SATD (sum of transformed absolute difference) calculation. In [40], it firstly gives out a high data reuse FME architecture. Based on the 1-D 6-tap FIR and 4×4 based processing unit (PU), the whole interpolation and SATD operations are executed simultaneously, and the FME process can be finished with about 1600 cycles. Since the interpolation is executed twice because of half and quarter pixel refinement, the total circles for one MB’s processing is quite long. Moreover, the exhaustive operation among all the inter modes also deteriorates its performance when extended to large image size. To solve the problem, an optimized advanced mode pre-decision (AMPD2) algorithm is given out in [40]. However, this optimization still has throughput problem for SHV application.

In SHV case, the impact of small inter modes is very limited. In my work, I adopt mode reduction technique and discard refinement of inter modes below 8×8. Although removing small inter modes will cause some quality loss, the computation saving is significant. About 51.92% interpolation cycles can be saved when only focusing on modes above 8×8. The reduction in processing cycles also leads to improvement of design effort. For example, when no extra proposals and algorithms are adopted, full mode FME engine based on [40] will result in 19.40GHz design effort while introducing mode reduction scheme into SHV FME engine can reduce design effort to 9.33GHz. The quality comparison of encoding with full modes and modes above 8×8 is shown in Fig.4.2. It is obvious that the mode reduction (mr) technique will cause negligible quality loss compared with full mode (fm) case. Thus, the mode reduction based mode pre-filter (MRMPF) algorithm is given out, as shown in Fig.4.3. It means that after IME on 16×16, 16×8, 8×16 and 8×8 mode, the 9 integer motion vectors (IMVs) are merged into four MBs and check the integer motion cost (IMC) of them. The IMC of mode below 16×16 is shown from Eq.4.2 to Eq.4.4. During the FME stage, I only focus on the first two modes whose IMCs are smaller than
4.2 Low complexity fractional motion estimation algorithm

Figure 4.2: Impact of mode reduction on SHV

other two modes (for example, 16×8 and 8×8 modes of Fig.4.3 based on Eq.4.5). So, in worst case, 48% clock cycle saving can be achieved compared with AMPD2 algorithm in [40].
4.2 Low complexity fractional motion estimation algorithm

\[ IMC_{m2} = IMC_{m2,blk0} + IMC_{m2,blk1} \]  
\[ IMC_{m3} = IMC_{m3,blk0} + IMC_{m3,blk1} \]  
\[ IMC_{m4} = \sum_{i=0}^{3} IMC_{m4,blk_i} \]  
\[ IMC_{m2} < IMC_{m4} < IMC_{m1} < IMC_{m3} \]

4.2.2 Motion cost oriented directional one-pass scheme

Although MRMPF scheme can shorten the clock cycle for FME process, the optimization is far from enough considering the specification of SHV. In the AMPD2 and MRMPF algorithm, they both follow the ‘first half, then quarter refinement’ flow, which applies interpolation twice for one MB.

In [48], it gives out a one-pass algorithm which handles half pixel and quarter pixel interpolation simultaneously. So, 50% processing time is saved. However, the number of PUs in one set of FME engine is increased from 9 to 25, which results in surge of hardware cost. For SHV case, the adoption of multiple sets of engine is a must for high throughput requirement. In case of 4 parallel sets, the required PU number will be 100 based on algorithm in [48]. In this dissertation, I fully exploit information of neighboring integer pixels and proposes a motion cost oriented directional one-pass scheme (MCDOP).
4.2 Low complexity fractional motion estimation algorithm

Figure 4.4: Motion Cost Oriented One-pass Scheme

The proposed scheme is shown in Fig. 4.4. It means that before the interpolation is executed on the best integer point (BIP), I analyze the IMC of BIP’s neighbors (IMC_1 to IMC_8). In my work, as shown in Eq. 4.6 to Eq. 4.9, I calculate the sum of three IMC on four corner parts and get left-up IMC (IMC_LU), right-up IMC (IMC_RU), bottom-left IMC (IMC_BL) and bottom-right IMC (IMC_BR), respectively. The moving window which consists of candidate search points is selected based on the motion cost analysis. For example, as shown in Fig. 4.4, when IMC_RU is the minimum one, then search points within red broken lines will become candidate points. In case of IMC_BR, the points within black solid lines will be our candidate ones. Since I use integer motion cost to decide moving window and the half and quarter pixel refinement are handled simultaneously, the proposed algorithm is a motion cost based directional one-pass scheme. Moreover, compared with original scheme in [48] which always focus on the centering 25 search points, the required processing units number is reduced from 25 to 16 based on the motion cost feature. So, 36% hardware cost is reduced for one set of engine.

\[
IMC_{LU} = IMC_1 + IMC_2 + IMC_4
\]

(4.6)
4.2 Low complexity fractional motion estimation algorithm

\[ IMC_{RU} = IMC_2 + IMC_3 + IMC_5 \] (4.7)

\[ IMC_{BL} = IMC_4 + IMC_6 + IMC_7 \] (4.8)

\[ IMC_{BR} = IMC_5 + IMC_7 + IMC_8 \] (4.9)

Moreover, to further reduce the hardware, 1/4 sub-sampling technique is introduced in the proposed hardware design. Based on quarter sub-sampling scheme, SATD generation will be executed with interval of 1 pixel both horizontally and vertically. So, for each processing unit (PU) in hardware, 75% hardware cost is reduced. The detail description will be given in section 4.3.1.

4.2.3 Overall hybrid schemes

The pseudo codes of JM FME algorithm and my proposed low complexity algorithm is shown in Fig. 4.5. I use JM 11.0 version and the modification to the JM algorithm is marked with italic font. The parts with broken lines represent MRMPF and MCDOP schemes. Firstly, the MRMPF scheme reduces number of IMV for FME process. Instead of loop all the 41 MVs, the proposed scheme can keep IMV number between 3 to 6. Secondly, the original two-step refinement is replaced with our MCDOP scheme. So, the half pixel and quarter pixel are constructed simultaneously. The two-step refinement turns to one-pass way, which saves 50% clock cycles. Moreover, the cost oriented adaptive window selection can reduce 36% PUs and 1/4 sub-sampling scheme further achieve 75% hardware cost reduction for each PU. The quality comparison of my algorithm to original JM one is given out in Fig. 4.6. It is shown that the modification to the JM original full mode (fm) algorithm will cause 0.2 dB quality loss in SHV Sakura tree clip. In case of SHV clip of Bees, the quality loss is negligible. The merit of proposed algorithm is that it can reduce original long processing cycle [40] from 1664 to 224. When no hardware parallel schemes are adopted, the proposed algorithm can decrease the design effort from 19.40GHz to 2.61GMHz. Although complexity reduction can be achieved in algorithm
4.3 Architecture level parallel improved schemes

4.3.1 Parallel improved 16-Pel processing

From the previous sections, the design effort of SHV FME engine is reduced to 2.61GHz. In order to achieve hardware engine with reasonable design effort, parallel processing is required in the architecture level. In the previous designs, all the processing units and interpolation engine are 4×4 based. In this dissertation, I propose a 16-Pel parallel interpolation and SATD calculation, as shown in Fig. 4.7. For 16×8 and 16×16 cases, they are just the extension of previous 4×4 interpolation process. In each clock cycle, one row containing 22 pixels is loaded. Altogether, the required pixels for 16×8 and 16×16 mode are 22×14 and 22×22, respectively. For 8×8 and 8×16 cases, I handle interpolation of two 8-Pel-width block simultaneously. Although the interpolation process is the same with other two modes, more pixels are required for these two modes. The reason is that the motion vectors for these two modes are discontinuous. In the worst case, if no pixel overlapping exists between two sub-blocks, the maximum pixels required for parallel interpolation of 8×8_blk0 and 8×8_blk1 are 28×14 pixels. For 8×16_blk0 and 8×16_blk1
parallel processing, 28×22 pixels are required in the worst case.

Once pixels are loaded from SRAM row by row, the required pixels will be distributed to two 8×8 based parallel interpolation engines. For example, as shown in Fig. 4.7(a) and
(c), 14 pixels with diagonal lines are distributed to one 8×8 based parallel engine and the other 14 pixels with grey color are for the second engine. As for Fig. 4.7(b) and (d), since the two MVs in each sub-block are continuous, the 22 pixels are divided into two 14-Pel parts with overlapping of centering 6 pixels (1st to 14th pixels are for 1st engine and 9th to 22th pixels are for 2nd engine). So, the overlapped pixels are reused in case of Fig. 4.7(b) and (d). In fact, for Fig. 4.7(a) and (c), I also propose a unified pixel block loading scheme to enable pixel reuse if the two 8×8 MV are close to each other. The detail description is shown in following section.

In the next stage, the interpolated pixels are propagated to PU groups for SATD calculation. Since the interpolated pixels are of 16-Pel width, the SATD calculation also should be 16-Pel width for parallel processing. In that case, two 8×8 sized PUs are required for handling SATD calculation of two parallel engines in Fig. 4.7. Considering the MCDOP algorithm, the total hardware cost of PU group will be dilated significantly. Thus, as shown in Fig. 4.7(e), I apply 1/4 quarter sub-sampling technique for each 8×8
4.3 Architecture level parallel improved schemes

sub-block. In detail, each pixel within $8 \times 8$ sub-block is used to represents its neighboring three pixels. Compared with original $8 \times 8$ sized PU, 75% hardware cost is reduced in our sub-sampled PU. In all, by introducing 16-Pel parallel interpolation and SATD calculation, the minimum required frequency is reduced to 290MHz.

4.3.2 MB-parallel schedule

In [43], a frame-parallel scheme is introduced and the same reference frame is reused for interpolation. In our design, the IBBP encoding structure is also adopted and I further improve the parallelism by applying MB-parallel processing. As shown in Fig. 4.8, after loading of Ref_0 pixels, I apply MRMPF to obtain the candidate modes for current MB on three frames (B0, B1, and P frame). Also, to save memory access, I apply unified pixel block (UPB) loading scheme to analyze IMVs of candidates mode. The detail discussion of our UPB scheme is in next section. After that, the same reference SRAM is used for parallel interpolation of current MB on B0, B1, and P frame simultaneously. So, the processing time is greatly shortened compared with original data flow which sequentially executes each interpolation. In my design, no bi-directional prediction is incurred and the final required frequency is only 145MHz for Super Hi-Vision $4 \times 4@60$fps in main profile. The detail pixel assignment for three interpolation (IP) modules is shown in the bottom right part of Fig. 4.8. It is shown that the required pixels for P frame IP module are loaded from reference SRAM directly. As for IP modules of B0 and B1 frames, their required pixels are loaded together from a unified pixel block (UPB), which is proposed to save redundant memory access. Moreover, since the pixels for UPB and P frame interpolation are loaded from the same SRAM, the memory access problem may occur. So, a parity pixel organization scheme is proposed in this paper to solve this problem. The details are discussed in the following two sections.

4.3.3 Unified pixel block loading

In section 4.3.1, it is clear that 28 different pixels are needed in each cycle under worst case for $8 \times 8$ and $8 \times 16$ modes. In fact, the worst case rarely happens due to the continuity
4.3 Architecture level parallel improved schemes

Figure 4.8: MB parallel processing schedule

of motion. In order to avoid multiple access to the same pixels, I propose a unified pixel block (UPB) loading scheme for current MBs’ processing on two B frames.

Figure 4.9 is the proposed UPB scheme. It is shown that there are two types of situations, called inner block overlapping (IBO) and cross mode overlapping (CMO). In IBO case, the redundant pixels only exist between required pixel of two blocks, as shown with red broken lines on the left side of Fig. 4.9(a). For CMO case, I assume that first candidate mode for B0 and B1 are $8 \times 8$ and $16 \times 8$, respectively. So, one $16 \times 8$ block on B1 and two $8 \times 8$ block on B0 frame are interpolated simultaneously based on our 16-Pel interpolation and MB-parallel scheme in section 4.3.1 and 4.3.2. As shown in Fig. 4.9(b), many pixels are overlapped among $8 \times 8_{\text{blk0}}$, $8 \times 8_{\text{blk1}}$ and $16 \times 8_{\text{blk0}}$. In fact, the overlapping situation can occur among all kinds of modes (from $16 \times 16$ modes to $8 \times 8$ modes) on both P and B frames. In order to simplify the implementation, only the candidate IMVs (remaining IMVs after MRMPF) for B0 and B1 frame after loading of ref$_0$ SRAM are analyzed. Then the UPB is generated and pixel rows are propagated to two parallel 16-Pel scale engines for B0 and B1 frames’ processing. The blue broken lines on the right side of Fig. 4.9 is an example of final UPB region. It is true that our scheme will load some redundant pixels, as shown in white part within blue broken lines in Fig. 4.9(a) and Fig. 4.9(b). However, from the experimental results, the proportion of
4.3 Architecture level parallel improved schemes

![Diagram of Inter Block Overlapping and Unified Pixel Block](image)

(a) Inner block overlapping

![Diagram of Cross Mode Overlapping and Unified Pixel Block](image)

(b) Cross mode overlapping

Figure 4.9: Unified pixel block loading scheme

extra access is very trivial considering the saving of memory access.

4.3.4 Parity pixel organization for parallel processing

In section 4.3.2, I propose an MB-parallel schedule to shorten the whole processing cycles. However, the simultaneous access of three engines will incur memory access problem inevitably.

Since interpolations on B0 and B1 frames obtain the pixel row from the UPB, the access problem only happens between interpolations of B frame and P frame. Assume that the reference pixel memory contains $n$ 1-Pel width memory bars. When the interpolations
4.4 Low design effort architecture for H.264/AVC intra predictor generation

4.4.1 Parallel processing flow for intra predictor generation

The generation of intra predictor in hardware will incur long processing cycles for the whole system, which is the main reason of performance degradation. In [47], the whole intra predictor generation is based on the $4\times4$ sub-block scale. As shown in Fig.4.11,
4.4 Low design effort architecture for H.264/AVC intra predictor generation

One 16×16 MB is separated into sixteen 4×4 sub-blocks. The processing flow is based on the raster scan order because of the data dependency problem. For example, when handling 4×4 blk1 in Fig. 4.11, all the required pixels (M, I, J, K, L, A, B, C, D, E, F, G, H) are already available. All the nine intra 4×4 modes (m0 to m8) use these pixels to generate their corresponding sixteen predictors of 4×4 blk1. The generation process is based on sequential way and each mode causes 4 clock cycles. When processing of blk1 is finished, the whole system turns to blk2 as next 4×4 sized sub-block. The required pixels for blk1 (D, II, J1, K1, L1, E, F, G, H, E1, F1, G1 and H1 in Fig. 4.11) are then available because of the best intra 4×4 mode for blk1 has already been decided and the vertical pixels (II, J1, K1, L1) are the reconstructed pixels based on best intra 4×4 mode. There are some bubbles between handling blk1 and blk2 due to the decision of best mode and reconstruction of pixels. In [47], the bubble period is fully utilized by inserting predictor generation of intra 16×16 modes. As shown in left part of Fig. 4.11, the predictor generation of I16MB modes are also organized in the scale of 4×4 sub-block, which means that predictors of each intra 16×16 modes are obtained in 16 separate stages. I use circle marked with number to indicate each stage. It is shown that each stage of I16MB modes is arranged between the already processed 4×4 sub-block and next unprocessed one. The whole processing flow is based on sequential raster scan order, which is from left to right and top to bottom. Predictors of four chrominance (chroma) 8×8 mode is generated after luminance (luma) modes and the process is similar with luma I16MB mode. In fact, such kind of processing order is not a must, and parallel scheme can be achieved lossless.

Figure. 4.12 is the proposed processing flow. Firstly, for current MB in process, the original ‘16-stage’ based flow is optimized into ‘10-stage’ way. So, about 37.5% processing time is reduced. From Fig. 4.12, it is also obvious that my proposal is a lossless optimization toward original raster scan order. In the first MB, the 4×4_blk1 and 4×4_blk2 are individually processed in two stages. In the following part, the predictor generation is in the form of 2-block scale, which means that two 4×4 sub-blocks are handled simultaneously by two parallel engines. For example, in stage 3 of Fig. 4.11, the 4×4_blk3 is the sub-block in process. Since 4×4_blk1 and 4×4_blk2 are the two sub-blocks already processed, there are no data dependency problem for 4×4_blk5. So the predictor generation
4.4 Low design effort architecture for H.264/AVC intra predictor generation

Figure 4.11: Original processing flow

of $4 \times 4 \text{blk}5$ can be executed together with $4 \times 4 \text{blk}3$ with no quality loss. Secondly, the last two stages of current MB is handled together with first two stages of the next MB, as shown in the top of Fig. 4.12. Therefore, full hardware utilization can be achieved during the whole intra predictor generation process.

4.4.2 Fully utilized parallel intra predictor generation architecture

From the above paragraph, one 2-block based parallel processing flow is proposed and 37.5% processing time is reduced. However, such adoption is also not enough to achieve low design effort engine because of the long processing cycles for handling all the I4MB and I16MB modes within one $4 \times 4$ sub-block.

In [47], except horizontal and vertical modes in I4MB and I16MB (1 cycle is enough for horizontal and vertical mode), the required processing time for rest I4MB or I16MB mode are in the period of 4 clock cycles, which means that, for one specific mode (mode of I4MB or mode of I16MB), the 16 predictors of one $4 \times 4$ sub-block can be obtained after 4 clock cycles. So, the overall processing cycles for generating all luminance predictors of I4MB and I16MB modes are 640, which occupies large proportion of computation time. When this structure is extended into Full HD or 4k×2k@60fps, the design effort
4.4 Low design effort architecture for H.264/AVC intra predictor generation

Figure 4.12: Proposed processing flow
### 4.4 Low design effort architecture for H.264/AVC intra predictor generation

Table 4.1: Predictors of I4MB modes in 4×4 sub-block

<table>
<thead>
<tr>
<th>Pred(y,x)</th>
<th>V</th>
<th>H</th>
<th>DC</th>
<th>DDL</th>
<th>DDR</th>
<th>VR</th>
<th>HD</th>
<th>VL</th>
<th>HU</th>
</tr>
</thead>
<tbody>
<tr>
<td>Pred(0,0)</td>
<td>A</td>
<td>I</td>
<td>Z</td>
<td>A+2B+C</td>
<td>I+2M+A</td>
<td>M+A</td>
<td>I+M</td>
<td>A+B</td>
<td>J+I</td>
</tr>
<tr>
<td>Pred(0,1)</td>
<td>B</td>
<td>I</td>
<td>Z</td>
<td>B+2C+D</td>
<td>M+2A+B</td>
<td>A+B</td>
<td>I+2M+A</td>
<td>B+C</td>
<td>K+2J+I</td>
</tr>
<tr>
<td>Pred(0,2)</td>
<td>C</td>
<td>I</td>
<td>Z</td>
<td>C+2D+E</td>
<td>A+2B+C</td>
<td>B+C</td>
<td>M+2A+B</td>
<td>C+D</td>
<td>K+J</td>
</tr>
<tr>
<td>Pred(0,3)</td>
<td>D</td>
<td>I</td>
<td>Z</td>
<td>D+2E+F</td>
<td>B+2C+D</td>
<td>C+D</td>
<td>A+2B+C</td>
<td>D+E</td>
<td>L+2K+J</td>
</tr>
<tr>
<td>Pred(1,0)</td>
<td>A</td>
<td>J</td>
<td>Z</td>
<td>B+2C+D</td>
<td>J+2I+M</td>
<td>I+2M+A</td>
<td>J+I</td>
<td>A+2B+C</td>
<td>K+J</td>
</tr>
<tr>
<td>Pred(1,1)</td>
<td>B</td>
<td>J</td>
<td>Z</td>
<td>C+2D+E</td>
<td>I+2M+A</td>
<td>M+2A+B</td>
<td>J+2I+M</td>
<td>B+2C+D</td>
<td>L+2K+J</td>
</tr>
<tr>
<td>Pred(1,2)</td>
<td>C</td>
<td>J</td>
<td>Z</td>
<td>D+2E+F</td>
<td>M+2A+B</td>
<td>A+2B+C</td>
<td>I+M</td>
<td>C+2D+E</td>
<td>L+K</td>
</tr>
<tr>
<td>Pred(1,3)</td>
<td>D</td>
<td>J</td>
<td>Z</td>
<td>E+2F+G</td>
<td>A+2B+C</td>
<td>B+2C+D</td>
<td>I+2M+A</td>
<td>D+2E+F</td>
<td>3L+K</td>
</tr>
<tr>
<td>Pred(2,0)</td>
<td>A</td>
<td>K</td>
<td>Z</td>
<td>C+2D+E</td>
<td>K+2J+I</td>
<td>J+2I+M</td>
<td>K+J</td>
<td>B+C</td>
<td>L+K</td>
</tr>
<tr>
<td>Pred(2,1)</td>
<td>B</td>
<td>K</td>
<td>Z</td>
<td>D+2E+F</td>
<td>J+2I+M</td>
<td>M+2A+B</td>
<td>K+2J+I</td>
<td>C+D</td>
<td>3L+K</td>
</tr>
<tr>
<td>Pred(2,2)</td>
<td>C</td>
<td>K</td>
<td>Z</td>
<td>E+2F+G</td>
<td>I+2M+A</td>
<td>A+B</td>
<td>J+I</td>
<td>D+E</td>
<td>L</td>
</tr>
<tr>
<td>Pred(2,3)</td>
<td>D</td>
<td>K</td>
<td>Z</td>
<td>F+2G+H</td>
<td>M+2A+B</td>
<td>B+C</td>
<td>J+2I+M</td>
<td>E+F</td>
<td>L</td>
</tr>
<tr>
<td>Pred(3,0)</td>
<td>A</td>
<td>L</td>
<td>Z</td>
<td>D+2E+F</td>
<td>L+2K+J</td>
<td>K+2J+I</td>
<td>L+K</td>
<td>B+2C+D</td>
<td>L</td>
</tr>
<tr>
<td>Pred(3,1)</td>
<td>B</td>
<td>L</td>
<td>Z</td>
<td>E+2F+G</td>
<td>K+2J+I</td>
<td>I+2M+A</td>
<td>L+2K+J</td>
<td>C+2D+E</td>
<td>L</td>
</tr>
<tr>
<td>Pred(3,2)</td>
<td>C</td>
<td>L</td>
<td>Z</td>
<td>F+2G+H</td>
<td>J+2I+M</td>
<td>M+2A+B</td>
<td>K+J</td>
<td>D+2E+F</td>
<td>L</td>
</tr>
<tr>
<td>Pred(3,3)</td>
<td>D</td>
<td>L</td>
<td>Z</td>
<td>G+3H</td>
<td>I+2M+A</td>
<td>A+2B+C</td>
<td>K+2J+I</td>
<td>E+2F+G</td>
<td>L</td>
</tr>
</tbody>
</table>


will be increased to 157MHz and 1.24GHz, which is beyond maximum work frequency (55MHz). In fact, for one 4×4 sub-block, data reuse can be achieved among nine I4MB modes. Table. 4.1 demonstrates the calculation of all predictors under each I4MB mode. To simplify the description, the shift operations for generating final result are omitted. It is shown that the value of many predictors within same I4MB mode or across different I4MB modes are the same. For example, The predictor on (0,0) (called Pred(0,0)) in DDL mode is the same as Pred(0,2) in DDR mode. I use bold fonts to mark all the predictors with value (A+2B+C). It is obvious that five I4MB modes consist of this value. For predictors of other values (for example, (B+2C+D) and (C+2D+E)), they also occurs several times within one mode or across different modes. Thus, many operations are wasted in generating predictors of the same value based on sequential generation order. If all the repetitive operation can be saved, the processing cycles will be greatly decreased.
4.4 Low design effort architecture for H.264/AVC intra predictor generation

Figure 4.13: Proposed predictor generation engine

In our design, I fully enable the data reuse among all I4MB modes and propose one fully utilized parallel intra predictor generation engine.

Figure 4.13 is the proposed parallel engine. Compared with original design [47], four features exist in our work. Firstly, after all the input parameters are ready, the predictors can be generated within two cycles. Two pipeline stage is inserted to output these results. Secondly, in the original design, many large multiplexors exist to control the type of input data and decide the candidate ones, which will incur complex control logic and long critical path. In the proposed design, I use several small multiplexors which only have two candidate inputs. So the architecture becomes more compact and easy to be controlled. Also, the proposed architecture does not consist any loop-back operations since no temporal result is required for generating the final predictor. Thirdly, instead of generating I4MB predictors sequentially (one mode by one mode), the proposed architecture works in a parallel way. Predictors of all the I4MB modes can be obtained after two clock cycles. Fourthly, the proposed architecture is also compatible for predictor generation of I16MB modes. Except DC modes, predictors of horizontal, vertical and plane modes in I16MB can also be obtained within 2 clock cycles. Moreover, for DC mode of I4MB and I16MB, I use compressor tree to facilitate the generation process. The detail descriptions are in the following paragraphs.

For I4MB modes (except DC mode), the predictors within one 4×4 sub-block can be
4.4 Low design effort architecture for H.264/AVC intra predictor generation

![Diagram of architecture with selected path](image)

Figure 4.14: Proposed architecture for I4MB modes

obtained by configuring structure of Fig.4.13 into Fig.4.14. The bold blue arrow is the selected path. The input data of Fig.4.14 is the left, up and up-right pixels of current sub-block (for example, A to H, I to L, and M for 4×4_blk0 in Fig 4.11). The output result after two clock cycles are listed in Table.4.2. From Fig.4.14, it is shown that predictors from O1 to O8 equal to the input values; and these values are output at the 1st pipeline stage together with O9 to O20 in Fig.4.14. For rest predictors (O21 to O33), they are output and stored at 2nd stage.

For I16MB modes, the horizontal and vertical modes can be easily implemented by our architecture in Fig.4.13 because the 16 predictors of one 4×4 sub-block can be directly obtained from the input data.

As for I16MB plane mode, our architecture can also generate all 16 predictors of one 4×4 sub-blocks by using 2 processing cycles. As defined in standard, the Eq.4.10 is the calculation of plane predictor in each position (Pred(y,x)), where a, b, c are constant value for one MB and they can be calculated based on Eq.4.11 to Eq.4.13. Pel(-1,15) and Pel(15,-1) are pixels from previous MBs. The UR and LC are sum of weighted differences of upper row and left column, respectively. So, I change Eq.4.10 to Eq.4.14 to realize plane mode in our architecture. The Sd are the seed value depending on the location of 4×4 sub-block. There are four seed value namely Sd1 to Sd4 listed in Eq.4.15 to Eq.4.18. Each Sd is for one column of 4×4 sub-blocks. For example, Sd1 is used for blk1, blk5, blk9, blk13; Sd3 is used for blk3, blk7, blk11, blk15. Since the I16MB is also processed in a 4×4 block scale, the difference of blk5 to blk1 is only 4c and this value can be added during the shift operation. The configuration of I16MB mode is shown in...
4.4 Low design effort architecture for H.264/AVC intra predictor generation

Table 4.2: Output predictors of I4MB modes in 4×4 sub-block

<table>
<thead>
<tr>
<th></th>
<th>O1</th>
<th>O2</th>
<th>O3</th>
<th>O4</th>
<th>O5</th>
<th>O6</th>
<th>O7</th>
<th>O8</th>
<th>O9</th>
<th>O10</th>
</tr>
</thead>
<tbody>
<tr>
<td>HU(2,2), H(3,0)</td>
<td>H(2,0)</td>
<td>H(1,0)</td>
<td>H(0,0)</td>
<td>V(0,0)</td>
<td>V(0,1)</td>
<td>V(0,2)</td>
<td>V(0,3)</td>
<td>HU(1,2)</td>
<td>HU(0,2)</td>
<td></td>
</tr>
<tr>
<td>HU(2,3), H(3,1)</td>
<td>H(2,1)</td>
<td>H(1,1)</td>
<td>H(0,1)</td>
<td>V(1,0)</td>
<td>V(1,1)</td>
<td>V(1,2)</td>
<td>V(1,3)</td>
<td>HU(2,0)</td>
<td>HU(1,0)</td>
<td></td>
</tr>
<tr>
<td>HU(3,0), H(3,2)</td>
<td>H(2,2)</td>
<td>H(1,2)</td>
<td>H(0,2)</td>
<td>V(2,0)</td>
<td>V(2,1)</td>
<td>V(2,2)</td>
<td>V(2,3)</td>
<td>HD(3,0)</td>
<td>HD(2,0)</td>
<td></td>
</tr>
<tr>
<td>HU(3,1), H(3,3)</td>
<td>H(2,3)</td>
<td>H(1,3)</td>
<td>H(0,3)</td>
<td>V(3,0)</td>
<td>V(3,1)</td>
<td>V(3,2)</td>
<td>V(3,3)</td>
<td>HD(3,2)</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th></th>
<th>O11</th>
<th>O12</th>
<th>O13</th>
<th>O14</th>
<th>O15</th>
<th>O16</th>
<th>O17</th>
<th>O18</th>
<th>O21</th>
<th>O22</th>
</tr>
</thead>
<tbody>
<tr>
<td>HU(0,0)</td>
<td>HD(0,0)</td>
<td>VR(0,0)</td>
<td>VL(0,0)</td>
<td>VL(0,1)</td>
<td>VL(0,2)</td>
<td>VL(2,2)</td>
<td>VL(2,3)</td>
<td>HU(1,3)</td>
<td>DDR(3,0)</td>
<td></td>
</tr>
<tr>
<td>HD(1,0)</td>
<td>HD(1,2)</td>
<td>VR(2,1)</td>
<td>VR(0,1)</td>
<td>VL(2,0)</td>
<td>VL(2,1)</td>
<td>VL(0,3)</td>
<td>HU(2,1)</td>
<td>HD(3,1)</td>
<td>HU(1,1)</td>
<td></td>
</tr>
<tr>
<td>HD(2,2)</td>
<td>VR(2,2)</td>
<td>VR(0,2)</td>
<td>VR(0,3)</td>
<td>VR(2,3)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th></th>
<th>O23</th>
<th>O24</th>
<th>O25</th>
<th>O26</th>
</tr>
</thead>
<tbody>
<tr>
<td>DDR(2,0), DDR(3,1)</td>
<td>DDR(1,0), VR(2,0)</td>
<td>DDR(0,0), VR(1,0)</td>
<td>DDR(0,1), VR(1,1)</td>
<td></td>
</tr>
<tr>
<td>HD(2,1), HD(3,3)</td>
<td>DDR(2,1), HD(1,1)</td>
<td>DDR(1,1), VR(3,1)</td>
<td>DDR(1,2), VR(3,2)</td>
<td></td>
</tr>
<tr>
<td>VR(3,0), HU(0,1)</td>
<td>DDR(3,2), HD(2,3)</td>
<td>DDR(2,2), HD(0,1)</td>
<td>DDR(2,3), HD(0,2)</td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th></th>
<th>O27</th>
<th>O28</th>
<th>O29</th>
<th>O30</th>
</tr>
</thead>
<tbody>
<tr>
<td>DDR(0,0), VR(3,3)</td>
<td>DDR(1,0), VR(1,3)</td>
<td>DDR(0,2), VL(1,2)</td>
<td>DDR(0,3), DDR(3,0)</td>
<td></td>
</tr>
<tr>
<td>DDR(1,3), HD(0,3)</td>
<td>DDR(1,0), VL(1,1)</td>
<td>DDR(0,2), VL(3,1)</td>
<td>DDR(1,2), VL(1,3)</td>
<td></td>
</tr>
<tr>
<td>DDR(0,2), VL(1,0)</td>
<td>DDR(0,3), VL(3,0)</td>
<td>DDR(1,1),</td>
<td>DDR(2,1), VL(3,2)</td>
<td></td>
</tr>
<tr>
<td>VR(1,2),</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th></th>
<th>O31</th>
<th>O32</th>
<th>O33</th>
</tr>
</thead>
<tbody>
<tr>
<td>DDR(1,3), DDR(3,1)</td>
<td>DDR(2,3)</td>
<td>DDR(3,3)</td>
<td></td>
</tr>
<tr>
<td>DDR(2,2), VL(3,3)</td>
<td>DDR(3,2)</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Fig.4.15. The input data and output result can be traced in Fig.4.15 and Table.4.3.

\[
Pred(y, x) = (a + b \times (x - 7) + c \times (y - 7) + 16) >> 5 \tag{4.10}
\]

\[
a = 16 \times Pel(-1, 15) + 16 \times Pel(15, -1) \tag{4.11}
\]

\[
b = (5 \times UR_w + 32) >> 6 \tag{4.12}
\]
4.4 Low design effort architecture for H.264/AVC intra predictor generation

![Diagram](image)

Figure 4.15: Proposed architecture for I16MB plane mode

<table>
<thead>
<tr>
<th>Table 4.3: Output predictors of I16MB plane mode</th>
</tr>
</thead>
<tbody>
<tr>
<td>O5</td>
</tr>
<tr>
<td>------</td>
</tr>
<tr>
<td>Sd</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>O20</th>
<th>O22</th>
<th>O23</th>
<th>O25</th>
<th>O29</th>
<th>O31</th>
<th>O32</th>
</tr>
</thead>
<tbody>
<tr>
<td>Sd+3b+2c</td>
<td>(Sd+3b)+c</td>
<td>(Sd+2b)+c</td>
<td>(Sd+b)+c</td>
<td>Sd+b+3c</td>
<td>Sd+2b+3c</td>
<td>Sd+3b+3c</td>
</tr>
</tbody>
</table>

\[
c = (5 \times LC_w + 32) >> 6 \quad (4.13)
\]

\[
Pred(y, x) = ((Sd + x \times b) + y \times c) >> 5 \quad (4.14)
\]

\[
Sd1 = a + b \times (-7) + c \times (-7) + 16 \quad (4.15)
\]

\[
Sd2 = a + b \times (-3) + c \times (-7) + 16 \quad (4.16)
\]

\[
Sd3 = a + b + c \times (-7) + 16 \quad (4.17)
\]

\[
Sd4 = a + b \times 5 + c \times (-7) + 16 \quad (4.18)
\]

\[
Pred(y, x) = \left( \sum_{0}^{15} U_i + \sum_{0}^{15} L_j \right) >> 5, \ y \in [0, 15], \ x \in [0, 15] \quad (4.19)
\]
4.4 Low design effort architecture for H.264/AVC intra predictor generation

\[ \text{Pred}(y, x) = (A + B + C + D + I + J + K + L) \gg 3, \quad y \in [0, 3], x \in [0, 3] \] (4.20)

The DC mode in H.264/AVC is the average value of upper and left pixels. Specifically, for I16MB DC mode, the upper pixels are the last row of its upper MB; and left pixels are the column on the right of its left MB. I use U0 to U15 to represent upper pixels; and L0 to L15 for its left pixels. So the calculation of predictors in DC mode can be traced in Eq.4.19. For I4MB DC mode, it is the average of upper and left 4 pixels in the previous processed 4×4 sub-blocks. For example, in case of blk1 in Fig.4.11, its DC predictors are calculated in Eq.4.20. For implementation, the configured structure is shown in Fig.4.16. All the 32 input pixels in I16MB DC mode are annotated on the architecture. Eight temporary output values are marked with red arrows and I use a compressor tree structure (as shown in lower part of Fig.4.16) to generate final result. For I4MB DC mode, the compressor tree structure is also enough to generate the final result. With our structure, the four chroma 8×8 modes can be realized with analogy.
4.5 Experimental result of low design effort engines

The proposed FME architecture is implemented with TSMC 0.18\(\mu\)m technology. Figure 4.17 is the system architecture of our design. The search window is set as 128×64 for implementation. Based on the parity pixel organization, each SRAM part in Fig. 4.17 consists of 64 rows and 64 columns, which is 32k bits. The basic pixel width in our system is 8-bit. The input buffer of current MB is 128-bit width, which means that 16 pixels of one row are loaded in one clock cycle. The IMV buffer stores the integer motion vectors from IME engine, which is 15-bit width. I assume that the maximum IMV difference is within 16 pixels and the SRAM bandwidth for loading pixels is 44 pixels. According to Fig. 4.7(a) and Fig. 4.7(c) cases, 28 input pixels are required. So, bit width for three IP modules is 224-bit. As for PUH and PUQ modules, since quarter sub-sampling technique is used for SADT calculation, the bit width for these modules is 64-bit.

Firstly, the IMV information which comes from IME engine is analyzed in MRMPF, UPB Analysis and MCDOP modules. So, the candidate modes, UPB for B frame interpolation and search points of one-pass algorithm are determined. Secondly, there are two memory groups in our design and each group is divided based on the parity pixel organization. The pixel distributor module assembles the pixel rows loaded from memory and propagates to IP modules. Thirdly, three IP modules are incurred to handle interpolation of current MBs on B0, B1, and P frames simultaneously. The processing capability can be further enhanced by doubling the number of engines, which executes forward and backward interpolations at the same time. However, this adoption will dilate the hardware cost greatly. So, in this dissertation, I only use 3 parallel 16-Pel IP modules and apply forward and backward interpolation sequentially. Fourthly, the interpolated pixels are fed to PU groups. Since MCDOP scheme is introduced, the PUs are classified as PU for half pixel refinement (PUH) and quarter pixel refinement (PUQ). Finally, the best modes of current blocks on B0, B1 and P frames are output simultaneously. So, our FME engine operates in both MB-parallel and frame-parallel mechanism.

Figure 4.18 shows the optimization of each schemes to the existing design. Since most real-time encoder designs have their maximum performances within 200MHz, I also set fre-
4.5 Experimental result of low design effort engines

Figure 4.17: 4kx4k Super Hi-Vision FME architecture

quency within 200MHz as low design effort region. As mentioned in the previous section, the direct extension of [18] to SHV specification will cause unaccomplished design effort. Even if the frame-parallel scheme in [43] is adopted, the minimum required frequency is still very high. However, based on our schemes such as MRMPF and MCDOP, the minimum required frequency can be reduced to 1.16GHz. By applying 16-Pel processing, frame and MB parallel processing, the final minimum required frequency is only 145Mhz, which is quite reasonable considering current technology. Moreover, the MCDOP scheme also helps to achieve 36% reduction of required PUs’ number and sub-sampling technique will reduce 75% cost of each PU.

Figure 4.19 shows the pixel saving ratio based on proposed UPB scheme. The X axis represents the B0 encoding frame number. Here, I give out examples of B frames encoding under QP equals 28 and 32. In the proposed FME system, the forward (Fwd) interpolations for current MB on B0 frame and its co-located current MB on B1 frame are handled simultaneously. Similarly, the backward (Bwd) interpolations are also parallel processed. For each of these two current MBs, it will have 2 intermediate modes to be refined based on our MRMPF scheme. The UPB scheme saves pixel loading by analyzing the overlapped part which is required for interpolation of two current MBs. The saving of
4.5 Experimental result of low design effort engines

Schemes
- Original, [A][B], 19.40GHz
- Frame-Parallel, [C], 8.63GHz
- MRMPF, proposed, 2.32GHz
- MCDOP & Sub-sampling, proposed, 1.16GHz
- 16-Pel Processing, proposed, 290MHz
- MB-Parallel, proposed, 145MHz

High Design Effort Region
- 36% in PU Group
- 75% in PU Size

A: reference [18], B: reference [40], C: reference [43]

Figure 4.18: Scheme for SHV FME engine

Figure 4.19: Pixel saving ratio of UPB scheme
4.5 Experimental result of low design effort engines

<table>
<thead>
<tr>
<th>Designs</th>
<th>MaxSpec.</th>
<th>Min_Freq</th>
<th>Min_Freq'</th>
<th>IP Modules</th>
<th>PU Group</th>
<th>Others</th>
<th>Total</th>
</tr>
</thead>
</table>
| SHV Sakura tree and Bees clips are shown in Fig. 4.19. I use Eq. 4.21 to calculate the saving ratio \( Pel_{Sav} \). The \( Ld_{Pel} \) is the pixels loaded from related SRAM based on our UPB scheme (The overlapped part for two inter modes is loaded only once). The \( Olp_{Pel} \) is the overlapped part in the original sequential processing flow. It is the clear that the proposed UPB scheme will not influence the video quality but to reduce redundant access to memory. The average saving of forward refinement in Sakura_tree is 80.68% and 86.39% when QP equals 28 and 32. In case of Bees, as shown in Fig. 4.19(c) and Fig. 4.19(d), the saving ratio is lower than sakura_tree case because of large motion in Bees clip. Averagely, about 28.67% (QP=28) and 43.70% (QP=32) pixel loading is saved in forward refinement. For backward refinement, the saving ratio is 77.41% (QP=28), 84.70% (QP=32) for Sakura_tree clip; and 29.49% (QP=28), 43.91% (QP=32) for Bees clip. With the increase of QP value, the reconstructed frame becomes smoother, which will lead to higher saving ratio for UPB scheme.

\[
Pel_{Sav} = \frac{Olp_{Pel}}{Ld_{Pel}} \times 100\% \tag{4.21}
\]

The final synthesis result and comparison are shown in Table 4.4. Since there are no existing work directly targets at SHV specification, I compare my result with the extension of previous works on SHV application, which is main profile 4k×4k@60fps. The frame-parallel scheme in [43] is applied on previous designs to enhance the processing capability. It is shown that the difference between Min_Freq for SHV and their own maximum performance is rather large, which means that extremely high design effort is required. To make a fair comparison, my MB-parallel scheme is added to these designs. So, three parallel engines are required and hardware cost under their maximum work frequency can be evaluated. The optimized design effort (Min_Freq') under our MB-parallel
4.5 Experimental result of low design effort engines

scheme is also shown in Table 4.4. It is obvious that the hardware cost of (\([42]+[43]\)) and (\([18]+[43]\)) are quite small compared with others. However, both of them only employs 4-Pel interpolation scheme which incurs very long interpolation cycles. So, the design effort is still very high (2.16GHz in Min\_Freq'). For (\([48]+[43]\)), it uses fixed one-pass algorithm which reduces Min\_Freq’ to 1.08GHz. However, the fixed search window in centering 25 points will increase the PU number greatly, which result in 290.3k hardware for PU group. Moreover, the Min\_Freq’ of (\([48]+[43]\)) is still far from the practical implementation. The maximum performance of (\([48]+[43]\)) is only 54MHz, which is only useful for small size video format such as mobile application. The performance of [41] is close to SHV application. By combining [43], the maximum performance can be kept. However, when dealing with 60fps and 4k×4k image size, even our MB-parallel processing is adopted, the Min\_Freq’ for (\([41]+[43]\)) is still very high (1.03GHz). In my design, the Min\_Freq for SHV specification is only 145MHz due to the low complexity algorithm and highly parallel architecture. Although the 16-Pel based 3 IP modules (87.5k gates), 12 PUH and 36 PUQ (188.5k gates) occupy large hardware cost, my design does not require huge buffers size as in [41]. Also, the 1/4 sub-sampling technique greatly reduces hardware cost of each PU and the total cost of PU group is smaller than (\([48]+[43]\)) and (\([41]+[43]\)). The whole hardware of our FME engine is a little higher than (\([42]+[43]\)) and (\([18]+[43]\)); and much smaller than (\([41]+[43]\)). With 412k gates, the proposed design can achieve real-time FME processing for SHV 4k×4k@60fps. Compared with extension most recent works (\([41]+[43]\)), about 85.92% design effort is reduced. Take hardware cost into consideration, about 93.31% estimated power is reduced.

The proposed intra predictor generation structure is also synthesized with TSMC 0.18um technology under worst case condition. In this dissertation, I only focus on the intra predictor generation and propose a fully utilized structure. The synthesis result is shown in Table 4.5. Since two parallel engines are used in our design, the hardware cost of proposed design is larger than previous one. However, considering the whole encoder design, 30k gates is not a significant value. The merit of our architecture is very obvious. Since no complex multiplexor is used in our design, the whole architecture is highly pipelined with simple and regular structure. The maximum work frequency is about 4
4.5 Experimental result of low design effort engines

Table 4.5: Experimental result and comparison

<table>
<thead>
<tr>
<th>Design</th>
<th>[47]</th>
<th>ours</th>
</tr>
</thead>
<tbody>
<tr>
<td>Technology</td>
<td>0.18um</td>
<td>0.18um</td>
</tr>
<tr>
<td>Gate Count</td>
<td>12945</td>
<td>30112</td>
</tr>
<tr>
<td>Max Freq.</td>
<td>55MHz</td>
<td>200MHz</td>
</tr>
<tr>
<td>Max Spec.</td>
<td>SDTV@31fps</td>
<td>4kx2k@60fps</td>
</tr>
</tbody>
</table>

Table 4.6: Comparison of processing cycles for one 4×4 sub-block

<table>
<thead>
<tr>
<th>Design</th>
<th>[47]</th>
<th>ours</th>
</tr>
</thead>
<tbody>
<tr>
<td>I4MB DC</td>
<td>4</td>
<td>1</td>
</tr>
<tr>
<td>I4MB rest modes</td>
<td>26</td>
<td>2</td>
</tr>
<tr>
<td>I16MB DC mode</td>
<td>4</td>
<td>3</td>
</tr>
<tr>
<td>I16MB rest modes</td>
<td>6</td>
<td>3</td>
</tr>
<tr>
<td>Totally</td>
<td>40</td>
<td>9</td>
</tr>
<tr>
<td>Req-Freq for 4k×2k@60fps</td>
<td>1.24GHz</td>
<td>175MHz</td>
</tr>
</tbody>
</table>

...times than previous design. Moreover, as shown in Table.4.6, for one 4×4 sub-block, instead of using 40 clock cycles for all the modes in I4MB and I16MB, my architecture totally requires only 9 cycles, which saves 77.5% processing cycles. Moreover, the related design effort is greatly reduced when extending to higher specification. For example, by extending the structure in [47] 4k×2k@60fps, the minimum required frequency (Req-Freq) will become 1.24GHz, which is extremely high design effort for existing technology. By using proposed structure with parallel processing flow, only 175MHz is needed to fulfill the throughput requirement for 4k×2k@60fps real-time processing. As for the power consumption, although proposed engine consumes 132% hardware cost of original design, the final estimated power saving is 67.24% due to its 85.88% reduction in design effort. Also, around 20k gates extra hardware will not cause serious performance degradation for the whole encoder system.
4.6 Conclusion remarks

In this chapter, two large image size targeted VLSI engines are given out. Firstly, for Super Hi-Vision 4k×4k@60fps, one fractional motion estimation engine is proposed. The proposed engine solves the throughput problem by utilizing algorithm level optimization and architecture level parallelism. In the algorithm level, the MRMPF and MCDOP schemes optimize the original high complexity FME process and reduce both design effort and required PU number. In hardware level, two parallel improved schemes, namely 16-Pel processing and MB-parallel scheme, are proposed in the hardware level. Also, the sub-sampling technique is adopted and 75% hardware cost is reduced for each PU. Additionally, one UPB scheme is proposed and achieves 28.67% to 86.39% pixel reuse for FME process. To solve the memory access conflict of MB-parallel processing, one parity pixel organization scheme is also proposed. With 412k hardware at only 145MHz, the proposed FME engine can handle real-time processing of Super Hi-Vision 4k×4k@60fps. Secondly, for intra part, one fully utilized and low design effort structure for H.264/AVC intra predictor generation is proposed. The data dependency problem in the conventional flow is analyzed and one parallel flow is given out, which achieve 37.5% reduction in processing time. After that, one fully utilized architecture which can generate predictors of all I4MB and I16MB modes with only 22.5% cycles of previous design is given out. Based on proposed parallel processing flow and efficient predictor generation architecture, the proposed design can achieve real-time intra predictor generation for 4k×2k@60fps with less than 200MHz. Compared with recent works, the proposed FME and intra engines reduce 85.92% and 85.88% design effort of original designs, which leads to only 6.69% and 32.76% estimated power of original FME and intra designs.
Chapter 5

Analysis of macroblock feature to fast inter mode decision

5.1 Introduction

The mode decision part in H.264/AVC plays an important role in the whole encoding process. The intra prediction part and block matching process in inter prediction are all included in the whole mode decision procedure. In H.264, besides skip mode, there are 7 block modes (as shown in Fig. 5.1) for inter prediction, 9 modes for 4×4 intra prediction and 4 modes for 16×16 intra prediction. The encoding process will loop all these modes and select one with the minimum cost. When rate distortion is incurred, all the prediction modes will be involved in a real encoding process. So, the complexity is insurmountable considering the real-time application.

In [22], a fast intra prediction algorithm is proposed, which greatly speeds up intra prediction process while still keep the quality. However, the decision on inter modes is more complicated compared with intra modes. It is because the motion estimation (ME) process adopts block matching on the plane of both current image and reference image, which incurs huge calculation on all the candidate points within the search window. The split, occlusion and fast motion in the moving video increase the ratio of temporal feature among frames, which makes it almost impossible to make a pre-decision on inter prediction.
5.1 Introduction

Figure 5.1: Inter Block Modes in H.264/AVC

Many works have been done to solve the problem. In [49], a pre-encoding scheme is proposed, which abstracts a down sampled small image and restrict the inter block modes within a small subset. Literature [50] tries the way of mean of absolute frame difference (MAFD) to filter out unpromising inter modes. In [51], an adaptive mode decision process based on all-zero coefficients block is proposed. Literature [52] and [53] focus on the optimization of early skip mode decision to release complexity of inter mode decision. However, the idea of introducing pre-processing in [49] and [50] will intensify the computation burden of whole encoding system. With the expansion of image size, for example HDTV application, the 1/2 down sampling or the MAFD calculation of the original frame will increase power dissipation and system latency dramatically. As for all-zero block and skip mode early detection based algorithms [52][53], there exist obvious limitations. With several foreground objects moving irregularly on the complicated background, or the decrease of quantization parameter, the ratio of all-zero block and skip modes will decrease significantly, which deteriorates the efficiency of inter mode filtering. In [54], a very fast mode decision algorithm is proposed, which dramatically reduces complexity for both low-motion and high-motion sequences. However, the compression capability is deteriorated obviously since the bit rate increase is quite large. In this dissertation, the complexity problem of inter mode decision is solved in several stages. Firstly, in the pre-stage (before ME starts), the homogeneity of current macroblock (MB) and the features of encoded
5.2 Pre-stage inter mode decision schemes

MBs on both current and previous frames are inspected to detect skip mode and filter out unpromising modes. Secondly, during ME process, the motion information is collected to discard unnecessary modes. I focus on the information of motion vector predictor’s accuracy, the block overlapping situation, rate distortion cost and SAD’s smoothness. The details are shown in following sections.

5.2 Pre-stage inter mode decision schemes

The fast inter mode decision algorithm targets at finding most candidate modes for the rate distortion based matching process. The early decision can be made either before or during encoding stage. In this dissertation, I firstly try to narrow down the candidate modes in both stages. In this section, two pre-stage inter mode decision schemes are described in detail.

5.2.1 MV oriented spatial-temporal inter mode check

In the conventional video sequences, spatial and temporal redundancy always exist within frame or between frames. In this dissertation, I propose a spatial-temporal skip mode early detection scheme which is applied before encoding process (named pre-stage scheme).

Since the encoding process is executed in a raster scan order, the only spatial information available for using is from the encoded MBs. Therefore, as shown in Fig.5.2(a), the left-up MB (LU.MB), left MB (L.MB) and upper MB (U.MB) of current MB (Cur.MB) are used for mode check. However, because only three MBs provide the mode information, the efficiency and correctness are quite limited. Therefore, the temporal information is also added. Besides co-located MB (Co.MB), there are 8 MBs around Co.MB in the previous frame and I classify all these MBs into three categories, as shown in Eq.5.1 to Eq.5.3. The Co.MB is the only element in the C0 category. The left-up (LU.MB), right-up (RU.MB), bottom-left (BL.MB) and bottom-right (BR.MB) MBs belong to C1 category. As for C2 category, it includes four MBs in cross direction such as U.MB, L.MB, right MB (R.MB) and bottom MB (B.MB) around Co.MB. The temporal mode check algorithm only depends on the dominant category of C1 and C2. Obeying the rule that
the motion is continuous in the succeeding frames [13], I use the motion vector (MV) of MBs in C1 and C2 to decide dominant category. As shown in Eq.5.4 to Eq.5.6, the delta MV between C1 and C0 ($\Delta MV(C1, C0)$) is calculated based on the accumulation of absolute MV difference in the x ($\Delta MV_x(C1, C0)$) and y ($\Delta MV_y(C1, C0)$) direction. For example, Eq.5.5 means that, the absolute difference between $MV_x$ of MB in C0 category and that of each MB in C1 category is calculated in the first step. Then, $\Delta MV_x(C1, C0)$ is obtained based on sum of all the absolute difference results. Here, $MV_x$ represents MV in x direction. The delta MV between C2 and C0 ($\Delta MV(C2, C0)$) is calculated based on the same principle. As shown in Eq.5.7, the category with minimum MV difference will be chosen as candidate category. For instance, when $\Delta MV(C1, C0)$ is smaller than $\Delta MV(C2, C0)$, it means that the motion vector difference between co-located MB (C0 category) and MBs in diagonal positions (C1 category) is smaller than the difference between C0 and C2 category. So, the co-located MB is more similar to C1 category in terms of motion vector and mode information in C1 category is used as reference for fast mode decision. The pseudo codes of our spatial-temporal algorithm is shown in Fig.5.3(a). It means that before ME, I apply mode check based on spatial and temporal information. The MV difference is used as a criterion to select candidate category (C') for temporal mode check. If all the modes of MBs in spatial (LU.MB, U.MB and L.MB in Fig.5.2(a)) and temporal category (C' and C0) are skip modes (mode0), mode0 is selected as best inter mode. Otherwise, full encoding modes of H.264/AVC are enabled during the following process.

\[
C0 \in \{Co.MB\} \quad (5.1)
\]

\[
C1 \in \{LU.MB, RU.MB, BL.MB, BR.MB\} \quad (5.2)
\]

\[
C2 \in \{U.MB, L.MB, R.MB, B.MB\} \quad (5.3)
\]

\[
\Delta MV(C1, C0) = \Delta MV_x(C1, C0) + \Delta MV_y(C1, C0) \quad (5.4)
\]

\[
\Delta MV_x(C1, C0) = \sum_{C1} |C1\{MV_x\} - C0\{MV_x\}| 
\]

\[
120
\]
5.2 Pre-stage inter mode decision schemes

Figure 5.2: Spatial-temporal Skip Mode Check

(a) Spatial Check
(b) Temporal MB in C1
(c) Temporal MB in C2

Spatial MB mode check
Loop MBs in C1 category
   Calculate delta MV (C1, C0)
End Loop
Loop MBs in C2 category
   Calculate delta MV (C2, C0)
End Loop
Decide temporal category (C’ + C0)
If MBs in Spatial and Temporal category are all mode0
   Only mode0 for Cur.MB
Else
   Full modes of H.264/AVC

Loop blk8x8_i of current MB, i = [0, 3]
   Edge gradient analysis of blk8x8_i
If homogenous blk8x8_i
   Discard 8x4, 4x8, 4x4 modes
End Loop
If all blk8x8_i are all homogenous
   Disable 8x8 inter modes

(a) Spatial-temporal Skip Mode Check  (b) Edge Based Inter Mode Filtering

Figure 5.3: Pseudo Codes of Pre-Stage Inter Mode Decision

\[ \Delta MV_y(C1, C0) = \Sigma[C1\{MV_y\} - C0\{MV_y\}] \] (5.6)

\[
\begin{align*}
\Delta MV(C1, C0) &< \Delta MV(C2, C0), & C1 \text{ is adopted} \\
\Delta MV(C2, C0) &\leq \Delta MV(C1, C0), & C2 \text{ is adopted}
\end{align*}
\] (5.7)

5.2.2 Edge gradient based inter mode filtering

The edge detection is another useful technique in both image processing and pattern recognition field. In [22], it uses Sobel edge operator to obtain candidate intra modes. In fact, the same method can also be extended into inter mode filtering. Figure 5.4
demonstrates the mode distribution among different sequences and Fig. 5.5 is the corresponding edge gradient histogram of each frame. Here, the edge gradient is obtained by using Sobel operator on each MB. Since gradients of sequences with smooth and regular motion demonstrate similar distribution among different encoding frames. I extract edge gradients of 20th frame as an example. It is shown that the gradient distribution between ‘mobile QCIF’ and ‘container QCIF’ is quite big, which is in accordance with subjective observation. For mode distribution, the proportion of mode above $8 \times 8$ in ‘container QCIF’ is much more than that of ‘mobile QCIF’. The situations of ‘tempete QCIF’ and ‘grandma QCIF’ are similar, where ‘grandma QCIF’ is more favorable to big inter modes. So, the decrease of gradient in image will increase the possibility of big inter modes in the final mode decision stage.

To be compatible with H.264/AVC encoding flow in JM, the Sobel edge detection oriented inter mode filtering is applied on the basis of MB level. Specifically, before block matching process, I analyze current MB and obtain the edge gradient of each pixel based on Eq. 5.8 to Eq. 5.10, where $P(m, n)$ is the pixel value on coordinate $(m, n)$ of current MB. The $G_x$ and $G_y$ are the gradients in horizontal and vertical directions, respectively. These two gradient is simply summed up to get $G(m, n)$ as gradient of $P(m, n)$.

In JM software, the inter prediction in H.264/AVC standard is implemented based on block matching process from mode1 to mode7 sequentially. Based on this mechanism, I set a threshold on each of the four $8 \times 8$ block (blk8x8_0 to blk8x8_3) within current MB. As shown in Eq. 5.11, if the gradient of every pixels within one $8 \times 8$ block (blk8x8_i, $i \in \{0,1,2,3\}$) is within a predefined threshold, this $8 \times 8$ block is regarded as homogeneous (homo) sub-block. Otherwise, it is an edge $8 \times 8$ block. The edge based inter mode filtering algorithm is shown in Fig.5.3(b). For homogenous $8 \times 8$ block, small inter modes (mode5 to mode7) are removed before ME process. If all the four $8 \times 8$ blocks are homogenous ones, even the mode4 inter mode is filtered.

As for the threshold setting, it is always a trade-off between quality and complexity. The prediction error $e$ in block matching process can be assumed as a jointly Gaussian source with zero mean and variance $\sigma^2$. According to [25], the distortion of quantization $D$ is approximated as $QP^2/3$, where QP is the quantization parameter. So, the
5.2 Pre-stage inter mode decision schemes

![Figure 5.4: Inter Mode Distributions](image)

![Figure 5.5: Gradient Distributions of 20th Frame](image)

The rate distortion function [26] can be represented as Eq. 5.12, where $R(D)$ is the related transmission bit-rate for distortion $D$. The $\sigma^2$ represents maximum distortion based on Gaussian model. When distortion $D$ equals to zero, it indicates that original signal is reconstructed without any loss in image detail. All the information of image (including
5.2 Pre-stage inter mode decision schemes

textures and noise) is exacted the same as original source image. Maximum transmission bit-rate is required for keeping the related information. In fact, such case is one ultimate state which will never happen in real video encoding system, like H.264/AVC. The reason is that the transform and quantization will cause some loss in image detail, which makes distortion between original source image and reconstructed one occur inevitably. On the other hand, when \( D \) is larger than \( \sigma^2 \), the related transmission bit-rate for \( D \) will become zero. This conclusion is in accordance with QP setting in H.264 encoding system. With the increase of QP, the smoothness of reconstructed frames is increased, which results in decline of image’s details. The related residue value is also decreased. It means that quality degradation for edge abundant image is quite obvious under big QP. In the extreme case, all the details are removed by one very large QP and the residue information is vanished, which indicates that no transmission bit-rate is required. Thus, from theoretical analysis of [25] and [26], it is possible to simply set threshold as linear relationship with QP value. With exhaustive experiments, the \( Thr_G \) in Eq. 5.11 is defined as \( 4 \times \)QP to balance the quality and complexity reduction. In the following sections, the related thresholds are also set linearly with QP.

\[
G_x(m, n) = |P(m - 1, n - 1) + 2P(m - 1, n) \\
+ P(m - 1, n + 1) - P(m + 1, n - 1) \\
- 2P(m + 1, n) - P(m + 1, n + 1)|
\]

\[
G_y(m, n) = |P(m - 1, n - 1) + 2P(m, n - 1) \\
+ P(m + 1, n - 1) - P(m - 1, n + 1) \\
- 2P(m, n + 1) - P(m + 1, n + 1)|
\]

\[
G(m, n) = G_x(m, n) + G_y(m, n)
\]

\[
P(m, n) \in blk8x8_i, i \in \{0, 1, 2, 3\}
\]

\[
\begin{cases} 
G(m, n) < Thr_G, & \text{homo blk8x8_i} \\
\text{otherwise}, & \text{edge blk8x8_i}
\end{cases}
\]

\[
R(D) = \begin{cases} 
\frac{1}{2} \log \frac{\sigma^2}{D}, & 0 \leq D \leq \sigma^2 \\
0, & D > \sigma^2
\end{cases}
\]
5.3 Motion feature based fast inter mode decision schemes

In the pre-stage, the unpromising inter modes are filtered before ME starts. However, as mentioned above, the reduction of complexity is quite limited due to the motion feature of MB. In fact, during the ME procedure, it is also possible to skip unnecessary inter modes. The more motion information available for observation, the more it is feasible to narrow down the candidate modes.

5.3.1 MVP accuracy and block overlapping analysis

In JM software, the block matching process starts in the motion vector predictor (MVP), which is obtained by the neighboring coded MBs. For sequence with smooth or regular motion, the prediction of start point is very accurate. Fig. 5.6 shows the distribution of best integer point (BIP) of 16×16 mode among typical clips. The search window is divided into several layers. The layer 0 is the MVP point while layer 1 indicates the 8 points around MVP. The meaning of other layers can be traced by analogy. It is shown that large proportion of BIP are located in MVP position even for foreman_qcif and carphone_qcif cases. The high accuracy in MVP also indicates that the current MB is seldom split into small blocks. Since small modes (such as 4×4) is easily to be trapped into local minimum, I only use the information of 16×16 mode. In my algorithm, motion information of 16×16 mode is analyzed after it’s ME search. If criterion of Eq. 5.13 is satisfied, current MB will be treated as big mode MB. In detail, Eq. 5.13 sets constraints on both MV and motion cost. Firstly, the MV of 16×16 mode (MV_{16×16}) must equal to its own MVP (MVP_{16×16}). Secondly, its motion cost (mcost) must be within one empirical threshold (Thr_{MVP}), which is set as 20×QP based on experiments. In our paper, mode1 to mode3 in Fig.5.1 are defined as big modes while mode4 to mode7 are treated as small inter modes. So, I discard mode4 to mode7 during following ME process when current MB is a big mode one. For the rest MBs, whose best 16×16 MVs are not MVPs, I further analyze the motion information after mode3’s block matching. The related mode decision criterion is shown in Eq.5.14 to Eq.5.16. It means that when
5.3 Motion feature based fast inter mode decision schemes

![Figure 5.6: BIP Distribution of 16×16 Mode in 100 Frames](image-url)

the absolute coordinates of MVs of block0 in mode2 and mode3 (16×8,0 and 8×16,0 in Fig. 5.1) are the same with MVs of mode1; and the MVs of block1 in mode2 and mode3 are only 8 pixel displacement in x or y direction. Then the previous three inter modes are overlapped each other, which indicates that current inter modes are well enough to express the motion trend. In this case, ME on mode5 to mode7 is bypassed. In our paper, the mode4 is remained to keep the video quality of our fast algorithm.

### 5.3.2 Smoothness of sum of absolute difference (SAD)

The SAD value which is obtained after ME process is another useful information. With search point approaching to the potential best one, the SAD value decreases gradually, which leads to less bits in the final encoding stage. On the contrary, the occurrence of big SAD value can indicate the necessity of ME on further small modes, which results in split of current MB. In this dissertation, I fully utilize SAD information to guide mode decision process. In JM software, ME is divided into integer motion estimation (IME) and fractional motion estimation (FME) stages. Since IME is well enough to represent image and object’s overall motion trend, I use motion feature of IME stage in the proposed algorithm. Specifically, during IME on 16×16 mode, the four 8×8 SAD blocks
5.3 Motion feature based fast inter mode decision schemes

are recorded, namely left-up $8 \times 8$ SAD (SAD$8 \times 8_{LU}$), right-up $8 \times 8$ SAD (SAD$8 \times 8_{RU}$), bottom-right $8 \times 8$ SAD (SAD$8 \times 8_{BR}$), and bottom-left one (SAD$8 \times 8_{BL}$). If Eq. 5.17 to Eq. 5.20 are all satisfied, it indicates that the distributions of four $8 \times 8$ size SAD value are quite smooth. So, further process on small modes is rarely needed and the $mode5$ to $mode7$ are discarded in the proposed scheme. When any of Eq. 5.17 to Eq. 5.20 is dissatisfied, the ME on mode2 and mode3 are skipped and the algorithm directly turns to mode4 to mode7 for precise matching process. The $Thr_{SAD}$ in my algorithm is set as $15 \times QP$ based on our exhaustive experiments.

\[
\begin{align*}
MV_{16 \times 16} &= MV_{16 \times 16} \\
mcost_{16 \times 16} &\leq Thr_{MVP} \\
MV_{16 \times 16} &= MV_{16 \times 8} \\
MV_{x_{16 \times 16}} &= MV_{x_{16 \times 8}} - 8 \\
MV_{y_{16 \times 16}} &= MV_{y_{16 \times 8}} - 8
\end{align*}
\]

\[
|SAD_{8 \times 8_{LU}} - SAD_{8 \times 8_{RU}}| < Thr_{SAD}
\]

\[
|SAD_{8 \times 8_{BL}} - SAD_{8 \times 8_{BR}}| < Thr_{SAD}
\]

\[
|SAD_{8 \times 8_{LU}} - SAD_{8 \times 8_{BL}}| < Thr_{SAD}
\]

\[
|SAD_{8 \times 8_{RU}} - SAD_{8 \times 8_{BR}}| < Thr_{SAD}
\]

5.3.3 Rate distortion cost analysis on big inter modes

In the high complexity mode of H.264/AVC, after ME and intra prediction loop over all inter and intra modes, the rate distortion (RD) costs of each mode are checked exhaustively by minimizing the Lagrangian function, as shown in Eq. 5.21. The $SSD$ is sum of
5.4 Overall algorithm and experiments

The overall flow chart of proposed algorithm is shown in Fig. 5.7. The parts with bold font are original JM mode decision flow which consists of inter prediction, intra prediction and RD cost check. In the inter prediction part, the ME process executes block matching...
Figure 5.7: Overall Flow Chart of Proposed Algorithm

process from mode1 to mode7 sequentially. The proposed schemes described in this paper are noted with its section number in parentheses. It is shown that schemes in section 2.1 and 2.2 work before ME start and the rest schemes are involved with the ME process.

$$\Delta \Gamma = \frac{\Gamma_{pro} - \Gamma_{jm}}{\Gamma_{jm}} \times 100\%, \ \Gamma \in \{MET, Bits\}$$

(5.24)

The proposed algorithm is implemented in JM 11.0 software [29]. Several QCIF and CIF clips with different features are used for simulation. I encode 200 frames with RD optimization enabled. The QP value ranges from 28 to 40 with interval of 4. The encoding structure is IPPP under baseline profile and 1 reference frame. The search range for QCIF
5.4 Overall algorithm and experiments

[A]: Reference [52], [B]: Reference [54], [C]: Reference [50]

Figure 5.8: Comparison of RD Curves
Table 5.1: Complexity Analysis based on $-\Delta MET$ (%)  

<table>
<thead>
<tr>
<th>Clips</th>
<th>QP=28</th>
<th>QP=32</th>
<th>QP=36</th>
<th>QP=40</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>[52]</td>
<td>[54]</td>
<td>pro</td>
<td>[52]</td>
</tr>
<tr>
<td>(1)</td>
<td>5.7</td>
<td>38.4</td>
<td>24.6</td>
<td>24.9</td>
</tr>
<tr>
<td>(2)</td>
<td>9.5</td>
<td>45.9</td>
<td>22.5</td>
<td>30.6</td>
</tr>
<tr>
<td>(3)</td>
<td>51.8</td>
<td>45.3</td>
<td>26.7</td>
<td>51.7</td>
</tr>
<tr>
<td>(4)</td>
<td>9.6</td>
<td>47.6</td>
<td>21.4</td>
<td>35.2</td>
</tr>
<tr>
<td>(5)</td>
<td>8.0</td>
<td>47.2</td>
<td>26.2</td>
<td>49.0</td>
</tr>
<tr>
<td>(6)</td>
<td>28.9</td>
<td>45.4</td>
<td>38.1</td>
<td>40.2</td>
</tr>
<tr>
<td>(7)</td>
<td>2.1</td>
<td>51.0</td>
<td>24.8</td>
<td>39.8</td>
</tr>
<tr>
<td>(8)</td>
<td>7.3</td>
<td>47.4</td>
<td>31.8</td>
<td>37.3</td>
</tr>
</tbody>
</table>

Table 5.2: Quality Analysis based on C1 and C2 (C1: $\Delta PSNR$ (dB); C2: $\Delta Bits$ (%))  

<table>
<thead>
<tr>
<th>Clips</th>
<th>Criterion</th>
<th>QP=28</th>
<th>QP=32</th>
<th>QP=36</th>
<th>QP=40</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>[52]</td>
<td>[54]</td>
<td>pro</td>
<td>[52]</td>
</tr>
<tr>
<td>(1)</td>
<td>C1</td>
<td>-0.01</td>
<td>-0.24</td>
<td>-0.04</td>
<td>-0.09</td>
</tr>
<tr>
<td></td>
<td>C2</td>
<td>+0.00</td>
<td>+4.78</td>
<td>+0.90</td>
<td>-0.17</td>
</tr>
<tr>
<td>(2)</td>
<td>C1</td>
<td>-0.08</td>
<td>-0.18</td>
<td>-0.07</td>
<td>-0.09</td>
</tr>
<tr>
<td></td>
<td>C2</td>
<td>+0.33</td>
<td>+1.00</td>
<td>+0.10</td>
<td>-0.27</td>
</tr>
<tr>
<td>(3)</td>
<td>C1</td>
<td>-0.09</td>
<td>-0.07</td>
<td>-0.00</td>
<td>-0.06</td>
</tr>
<tr>
<td></td>
<td>C2</td>
<td>+1.72</td>
<td>+1.57</td>
<td>+0.01</td>
<td>-0.16</td>
</tr>
<tr>
<td>(4)</td>
<td>C1</td>
<td>-0.02</td>
<td>-0.17</td>
<td>-0.01</td>
<td>-0.06</td>
</tr>
<tr>
<td></td>
<td>C2</td>
<td>+0.22</td>
<td>+0.32</td>
<td>+0.32</td>
<td>-0.17</td>
</tr>
<tr>
<td>(5)</td>
<td>C1</td>
<td>-0.02</td>
<td>-0.24</td>
<td>-0.02</td>
<td>-0.01</td>
</tr>
<tr>
<td></td>
<td>C2</td>
<td>+0.29</td>
<td>+1.90</td>
<td>+0.21</td>
<td>+0.25</td>
</tr>
<tr>
<td>(6)</td>
<td>C1</td>
<td>-0.06</td>
<td>-0.08</td>
<td>-0.04</td>
<td>-0.07</td>
</tr>
<tr>
<td></td>
<td>C2</td>
<td>+0.05</td>
<td>+2.09</td>
<td>+0.36</td>
<td>+0.08</td>
</tr>
<tr>
<td>(7)</td>
<td>C1</td>
<td>-0.00</td>
<td>-0.20</td>
<td>-0.03</td>
<td>-0.04</td>
</tr>
<tr>
<td></td>
<td>C2</td>
<td>+0.05</td>
<td>+1.28</td>
<td>+0.00</td>
<td>+0.01</td>
</tr>
<tr>
<td>(8)</td>
<td>C1</td>
<td>-0.01</td>
<td>-0.34</td>
<td>-0.01</td>
<td>-0.02</td>
</tr>
<tr>
<td></td>
<td>C2</td>
<td>+0.05</td>
<td>+4.62</td>
<td>+0.42</td>
<td>+0.23</td>
</tr>
</tbody>
</table>

and CIF are ±16 and ±24 respectively. The experiments and comparisons are shown in Table 5.1, Table 5.2 and Fig. 5.8. I use Eq. 5.24 to analyze the ratio of motion estimation time ($MET$) and bit increment. The $\Gamma_{pro}$ is the element of proposed method (our method or others’ methods) and $\Gamma_{jm}$ is the related element caused by original JM full mode search which loops all inter modes. The $\Gamma$ can be $MET$ or $Bits$. As for $\Delta PSNR$, it is calculated by subtracting $PSNR$ of proposed algorithm from that of JM’s. The ‘+’ in Table 5.2 represents PSNR gain and increment of bits. The meaning of ‘-’ in Table 5.2 means PSNR’s drop and decrease of bits. It is shown that my scheme is superior to [50] in terms of complexity reduction, especially clips with slow motion feature such as ‘container.cif’. In case of [52], it can achieve high complexity reduction for clips such as ‘container.cif’ and ‘paris.cif’. However, the situation of fast
5.4 Overall algorithm and experiments

motion (‘football_qcif/cif’), complex background (‘coastguard_cif’) or camera’s shaking (‘foreman_qcif/cif’) will deteriorate the efficiency of this algorithm greatly. As for [54], the \( \Delta MET \) is always large among different clips. However, the quality trade-off is also very significant. Figure 5.8 is the comparison of RD curve among original JM algorithm, others’ works and my scheme. It is shown that the RD curves of proposed scheme and algorithms of [52] and [50] are all very close to JM’s curve. However, for [54], the quality loss is very obvious, especially those fast motion clips such as ‘football_qcif/cif’ and ‘coastguard_cif’. The quality loss for [54] in ‘foreman_qcif/cif’ case is also very big due to irregular shaking of camera. The detail quality analysis is shown in Table 5.2. Since video quality variation is more vulnerable to small QP value, I give out PSNR and bit rate analysis of QP equals 28, 32, and 36 as an example. The \( \Delta PSNR \) which is below -0.1dB and \( \Delta Bits \) which is larger than 1% are marked with bold font. It is shown that most bold font cases fall into [54] and the bits increment in fast motion clips (‘football_qcif/cif’) is very large. In the proposed scheme, the quality loss and bits gain are always trivial while my scheme also achieves large complexity reduction for clips with static feature and comparative big reduction for clips of different motion types. For the QCIF format, the bits increment is always negative with only negligible PSNR loss. In all, for sequences with different motion type, the proposed algorithm can achieve 21.6% to 53.4% complexity reduction for the inter mode decision process.
5.5 Conclusion remarks

One fast inter mode decision algorithm is contributed in this chapter. In the pre-stage, the spatial-temporal information is used to detect skip mode in an early stage. The current MB’s homogeneity is also extracted to filter out unpromising small modes. In the motion stage, the MVP’s accuracy, the block overlapping and SAD distribution are analyzed to bypass unnecessary inter modes. Furthermore, the RD costs of big modes are obtained in an early stage and compared with historical ones to speed up mode decision procedure. Experiments show that the proposed algorithm can achieve up to 53.4% speed-up ratio with trivial quality loss and bit increment.
Chapter 6

Conclusions and future work

In this dissertation, the gap between software algorithm and hardware implementation is solved by hardware oriented algorithm and low power low cost architectures. The application fields ranges from small image size such as QCIF and CIF format, to HDTV image like 720p and 1080p, and finally reaches the Super Hi-Vision (SHV) application. As shown in Fig. 6.1, the whole thesis can be concluded with four phases.

Firstly, in the hardware algorithm level, this dissertation gives out hardware oriented fast motion estimation algorithms. The proposed hardware oriented schemes provide complexity reduction based on hardware data flow. The complexity reduction of this part mainly located in three categories. The first one is the MRF technique. With analysis in frequency domain, the MRF on the low frequency image part is removed. Also, similarity analysis on the centering 9 points is executed for further reduction of stationary part within the image. The second category is the search range (SR) adjustment. By extracting motion features during block matching process on the first frame, MB with small motion trend is restricted within centering region. So, redundant search points are eliminated. Moreover, for other motion MB, one recursive search range adjustment scheme is adopted for further reduction of search points. The third category is the matching pattern (MP). Compared with conventional direct sub-sampling, the proposed adaptive scheme not only take quality into consideration, but achieve complexity reduction in a reasonable way. By combining all the scheme, it is shown that the proposed hardware oriented algorithm can averagely achieve 88.53% reduction of ME time among different sequences. Also, the proposed scheme can be easily applied in to existing 4-stage based real-time encoder.
With some extra control module, the proposed MRF and SR schemes can be realized in existing encoder with only 27.68% of original processing cycles.

Secondly, based on adaptive sub-sampling, two flexible architectures are given out. The pros and cons of adaptive algorithm to existing fixed architectures are analyzed in this dissertation. The proposed architectures are based on optimization of existing SAD Tree and PPSAD structures which are two efficient hardwired structures for various application. In the proposed structures, pixel organization is applied in both architecture level and memory level. So, full data reuse and hardware utilization can be achieved, which result in low power and low processing time. Moreover, circuit optimization is discussed in this dissertation and further reduction in hardware cost and power dissipation can be attained based on my proposal. Experimental results show that the proposed RSADT and APPSAD can achieve up to 38.8% and 39.8% reduction in power consumption, respectively. Averagely, about 53.8% power can be reduced by proposed flexible architectures.

Thirdly, with the expansion of image size, the throughput issue for extreme large image come into existence. In detail, the hardware accelerator for SHV image has become a heated topic. By simply extending existing designs into SHV specification, the hardware size, power and design effort becomes impossible to be accomplished. In this dissertation, two low design effort hardware accelerators in FME and intra processing are given out. With algorithm optimization, parallel architecture, and 2-level (MB and frame level) data flow, the proposed FME engine can handle 4k×4k@60fps real-time processing. As for the intra engine, based on proposed 2-block data flow and fully utilized predictor generation architecture, this dissertation gives out one high speed intra predictor generation engine for 4k×2k@60fps specification. All in all, 85.92% and 85.88% design effort in reduced for SHV based FME and intra engine, respectively.

Fourthly, the mode decision part is also discussed in this dissertation. Although it is very hard to realize fast mode decision algorithm in hardware, the fast decision for image with different feature is very promising in video compression field. In this dissertation, proposals that fully considers the feature of image are given out. The complexity reduction in the proposed algorithm is realized in a multi-stage way. About 53.4% complexity in
inter mode decision part can be reduced and the proposed algorithm is superior to other schemes for sequences with various motion feature.

As for the power issue, as shown in Fig. 6.1, by combine all the hardware oriented algorithms and fast mode decision algorithm, the final power consumption in IME part is only 9.32% of original 4-stage based design. As for the FME and intra parts, compared with extension of recent works, the final estimated power is only 6.69% in FME engine and 32.76% in intra part.

In the future, the H.265 will come into existence. Questions like how much H.265 can achieve or comparison between H.265 and H.264/AVC will become a heated topic once H.265 standard is completed. Also, the ever increasing demand for ultra high resolution image makes low power and low cost real-time encoder attractive to the market. In this dissertation, I mainly give out solutions of FME and intra part. For the whole encoder part, problems such as high throughput IME engine, efficient arithmetic coding tools are still remained to be solved. Furthermore, 3-D video processing has attracted much attention in recent years. Some researchers have already proposed some algorithm and
parallel architectures. However, current status is still far from satisfactory. The ultra low power and high quality video processing requires deep exploration in not only signal and video processing fields but also circuit design and support of manufacturing technology.

To sum up, this dissertation covers a wide research area in video compression field. The complexity reduction is achieved in a hardware oriented way. With related flexible structures and low design effort architectures, key issues in ASIC design such as hardware cost, power consumption, and throughput are solved by several proposals in this dissertation.
Acknowledgements

First of all, I would like to show my deepest gratitude to my loving wife, my parents and all my family members. Owing to your support, I have accomplished a lot in my life.

Secondly, I would like to express my gratitude to my supervisor, Associate Professor Takeshi Ikenaga. Under his guidance, I finish my Ph.D research and complete this dissertation. Thank you very much for your guidance during my stay in Waseda University. Also, many thanks to Professor Yasuo Matsuyama, Professor Jiro Katto, Professor Shinji Kimura, Professor Takeshi Yoshimura. Thank you for offering me lot of suggestions to the final completion of this dissertation.

Thirdly, I would like to thank all members in Ikenaga laboratory. Thank you for sharing wonderful time with me. Especially, I would like to thank Dr. Qin Liu for his encouragement and collaboration with my research. Many thanks to Mr. Lkhagvajantsan Damdinsuren, Miss Jia Su, Mr. Lei Wang, Mr. Jiachen Zhou, Mr. Shuijiong Wu, Mr. Zhewen Zheng, Mr. Jingbang Qiu, Mr. Tianci Huang, Mr. Xiaocong Jin, Mr. Jin Zhou, Mr. Bingrong Wang, Mr. Lei Sun, Miss. Ying Lu and Miss Chenjiao Guo, for everything you have done for me. Also, many thanks to Japanese students Mr. Takahiro Mori, Mr. Koichi Nakamura, Mr. Shinsuke Ushiki, Mr. Takahiro Sakayori, Mr. Kodai Kawane, and Mr. Tuyoshi Sasaki, for your kindly help in both research and my daily life.

Furthermore, I would like to thank Dr. Yoshiro Tsuboi, Mr. Masaki Nakagawa and Dr. Shunichi Ishiwata of Toshiba Corporation Semiconductor Company for offering suggestions and opinions during research discussion. Also, I
would like to thank Dr. Shinichi Sakaida and Mr. Kazuhisa Iguchi of Japan Broadcasting Corporation for fruitful discussion on Super Hi-Vision image evaluation.

Lastly, I would like to acknowledge the support from CREST and Global COE program. Thank you very much for the support of all those international and domestic conferences.
References


REFERENCES


REFERENCES


REFERENCES


REFERENCES

of the 2005 Asia and South Pacific Design Automation Conference, volume 1, pages 631–634, January 2005. 78, 79


[41] Z. Liu, Y. Song, M. Shao, S. Li, L. Li, S. Ishiwata, M. Nakagawa, S. Goto, and
REFERENCES


[48] T. Chen, Y. Chen, C. Tsai, S. Tsai, S. Chien, and L. Chen. 2.8 to 67.2mw low-power and power-aware H.264 encoder for mobile applications. In *VLSI Symposium ’07.*
REFERENCES


Publications

Journal Papers (with review)

[1] Yiqing Huang, Takeshi Ikenaga, “Highly Parallel Fractional Motion Estimation Engine for Super Hi-Vision 4k × 4k@60fps”, IEICE Electronics, March, 2010 (accepted).


International Conference (with review)


tle, USA, pp. 844-847, May, 2008.


[20] Shuijiong Wu, Yiqing Huang, Qin Liu, Takeshi Ikenaga, “Bit-Usage Analysis Based Frame Layer QP Adjustment for H.264/AVC Rate Control at Low Bit-Rate”, The 24th International Technical Conference on Circuits and Systems, Computers and Communications (ITC-CSCC


Domestic Conference (with review)


Domestic Conference (without review)


Invited Paper


Awards

CSPA 2009 Best Paper Award

ISOCC 2009 Samsung Award

2007 Excellent Student Award of The IEEE Fukuoka Section