Novel VLSI architecture of motion estimation and compensation for H.264 standard by Li, Xiang
Rochester Institute of Technology 
RIT Scholar Works 
Theses 
2004 
Novel VLSI architecture of motion estimation and compensation 
for H.264 standard 
Xiang Li 
Follow this and additional works at: https://scholarworks.rit.edu/theses 
Recommended Citation 
Li, Xiang, "Novel VLSI architecture of motion estimation and compensation for H.264 standard" (2004). 
Thesis. Rochester Institute of Technology. Accessed from 
This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in 
Theses by an authorized administrator of RIT Scholar Works. For more information, please contact 
ritscholarworks@rit.edu. 
Novel VLSI architecture 
of Motion Estimation and Compensation 




A Thesis Submitted 
III 
Partial Fulfillment 
Of the Requirements for the Degree of 
MASTER OF SCIENCE 
III 
Electrical Engineering 
PROF. __ K_e_n_n_e_t_h_W_e _H_s_u __ _ 
Dr. Kenneth W. Hsu 
PROF. __ P_r_at_a--=-p_a_R_e_d_d........;;..Y __ _ 
Dr. Pratapa Reddy 
PROF. ___ D_o_r_i n_P_a_t_r_u ____ _ 
Dr. Dorin Patru 
PROF. __ R_o_b_e_r_t_B_o_w_m_a_n __ _ 
Dr. Robert J. Bowman (Department Head) 
Department of Electrical Engineering 
College of Engineering 
Rochester Institute of Technology 
Rochester, New York 
August 2004 
RELEASE PERMISSION FORM 
Rochester Institute of Technology 
Novel VLSI architecture 
of Motion Estimation and Compensation 
for H.264 standard 
I, Xiang Li, hereby grant pennission to any individual or organization to reproduce this 




This thesis presents a high performance novel VLSI
architecture of a H.264 motion estimator, which can be
used as a building block for real-time H.264 video
compression. Full-search block matching algorithm was
used in this design. Pipeline structure was developed for
variable block size processing units to work in parallel. The
speed at 125MHz is good for real time motion estimation
with 25/sec frame rate and 640x480 resolutions. The
processing speed is also independent of the threshold level
of Sum ofAbsolute Difference (SAD), which is used to
determine the size of the macro block. The architecture is
implemented with Register Transfer Level VHDL codes
then synthesized with Synopsys Design Compiler, using
TSMC 0.25um technology. The synthesized Application





CHAPTER 1: Introduction 1
1.1 H.264 features 1
1 .2 Block-matching Algorithm 2
1.3 Description of global VLSI Block Diagram 4
CHAPTER 2: Literature review 5
2.1 Origin ofH.264 5
2.2 H.264 Codec 5
2.3 Motion Estimation and Compensation 8
2.4 Existing popularME/MC Algorithms 11
CHAPTER 3: Dataflow and VLSI Architecture Design 13
3.1 Dataflow Diagram for 16x16,8x8, and 4x4 ME 13
3.2 VLSI Architecture for 16x16, 8x8, and 4x4ME 16
3.3 Dataflow Diagram for Fractional ME 18
3.4 VLSI Architecture for Fractional ME 20
CHAPTER 4: Behavior VHDL design 21
CHAPTER 5: Register Transfer Level VHDL design 22
5.1 1 6x16 processing block 22
5.2 8x8, 4x4 processing blocks 35
5.3 Fractional MV processing block 37
CHAPTER 6: Simulation Results and Analysis 42
6.1: Simulation (Gymnast) 42
6.2: Simulation (Artist with big movement) 45
CHAPTER 7: Synthesis ofRTL VHDL codes 49
7.1: Constraints for Synthesis 49
7.2: Area Report 49
7.3: Timing Report 50
7.4: Violation Report 53
CHAPTER 8: Conclusion 54
REFERENCE 55
ACKNOWLEDGMENT 56







A-7 :Memory previous frame 63
A-8 :Memory current frame 63
A-9 :Mux between memory and interconnection 63
A-10:TransferUnit 64
A-l 1 :Bridging unit 65
A-12 :ME top level 66
A-13:8x8AGU 67
A-14: 8x8 Interconnection 68
A-15: 8x8 Controller 69
A-16:4x4AGU 70
A-17: 4x4 Interconnection 71








APPENDIX B: Source Codes 90
B-l: Behavior VHDL code 90







B-2.7 :Memory previous frame 112
B-2.8 :Memory current frame 113
B-2.9 :Mux between memory and interconnection 114
B-2. 10 transfer Unit 115
B-2.11 :Bridging unit 117
B-2.12.ME top level 118
B-2.13:8x8AGU 123
B-2.14: 8x8 Interconnection 126
B-2.15: 8x8 Controller 127
B-2.16:4x4AGU 131
B-2. 17: 4x4 Interconnection 133














Address Generation Unit, Unit that generates sequence ofmemory address in a specific
way.
ASIC
Application Specific Integrated Circuit.
Behavior VHDL code
A technique used in describing a circuit functionality at a high level. Generally, the
architecture can not be inferred from the description.
VHDL
Very large Hardware Description Language
MC
Motion Compensation, a technique to substract a current frame from its previous frame
which has already been motion estimated, in order to obtain the residual.
ME
Motion Estimation, a technique to predict the current frame from previous frame by
calculating the motion vectors and residuals.
MV
Motion Vector, used to describe the displacement of two matching blocks in
neighborhood frames.
RTL VHDL code
Registered Transfer Level VHDL code, a technique used in describing a circuit at state
machine and data path level. Generally, the architecture can be easily inferred from the
description.
SAD





1.1 H.264 features in this design.
The latest H.264 standard provides adaptive and powerful coding schemes, including tree
structure motion compensation and quarter-pixel motion vector. These features will
provide smaller residual by more accurate motion vectors, but use more hardware.
It is not easy to implement H.264 codec as a real-time system due to its high requirement
ofmemory bandwidth and intensive computation [1]. Variable block size, and
quarter-
pixel motion vectors, being the key features ofH.264 standard, demands substantial
computational complexity. Most existent fast motion estimation algorithms are not
suitable [2] for H.264 having variable block sizes.
In this thesis, a novel VLSI architecture is proposed and implemented; following highly
desirable features are achieved
1. Pipeline data processing allows the variable block size computation to work as fast as
the traditional 16x16 block size motion estimation.
2. Full-search algorithm is implemented which is also the optimal solution in block
matching.
3. Sequential inputs from memory to reduce the memory bandwidth and pin count.
4. 8x8, 4x4 sized block matching requires only local memory access, by using very small
on-chip memories.
5. Adaptive design to balance the computation complexity and compression ratio for
specific applications.
1.2 Block-matching Algorithm.
There is a significant amount of frame-to-frame redundancy existing in full motion video
sequence. Usually the scenes in successive frames are highly correlated. Motion
estimation/compensation is the inter-frame coding that reduces the redundancy
information and achieves high data compression ratio.
Motion estimation is in most cases bases on a search scheme which tries to find the best
matching position of a macro-block (MB) of the current frame which a block of same
size within a predetermined or adaptive search range in the previous frame. The position
offset between these 2 matching blocks is called motion vector (MV). The size ofMB is













The key to determine the best motion vector is the SAD (Sum ofAbsolute Differences),
cf.eq. (1.1)
N N
SADun = minX I |A(x+i, y+j) - B(x+i+m, y+j+n)| eq 1.1
i=0 j=0
A(x+I, y+j) is the pixel ofmacro block from current frame, B(x+i+m,y+j+n) is the pixel
of a candidate matching block from previous frame, with a candidateMV (m,n), while m
and n are the searching range. N is the size ofmacro block.
In this design, the search range is from -7 to 8, which means 16x16 = 256 candidate
motion vectors in search area. Size of macro block can vary from 16x16 , 8x8 to 4x4, the
size of search area will then vary from 32x32, 24x24, to 16x16.
The motion vector (m,n) can also be presented as eq 1.2 .
MV=16*m + n eq 1.2
The goal of block matching is to find the matching block with the smallest SAD, thus
will also yield the smallest residual. Since all the residual and motion vectors will be
transmitted to entropy encoder, choosing a large partition size (e.g. 16x16) means that a
small number of bits are required to signal the choice of motion vector(s) and the type of
partition; however, the motion compensated residual may contain a significant amount of
energy in frame areas with high detail; choosing a small partition size (e.g. 8x8, 4x4) may
give a lower-energy residual after motion compensation but requires a larger number of
bits to signal the motion vectors and choice of partition(s).
In genera], a large partition size is appropriate for homogeneous areas of the frame and a
small partition size may be beneficial for detailed areas. As a direct reflection the
residual,Min SAD is used to determine if smaller block size processing is needed. If the
min SAD of a larger block surpasses the threshold, smaller blocks will be processed.
1.3 Description of glob.al VLSI Block Diagram
A global block diagram of the architecture is shown in Fig.3.1 . It basically consists
4-
pipelined blocks that process 16x16, 8x8, 4x4 and quarter-pixel motion vectors, with


































































Figure 1-2 Global Block Diagram
Every processing block is this architecture has roughly the same processing time in order
to achieve maximum pipeline efficiency. Only local access from small size memories are
needed except of the 16x16 processing block. Thus, drastically decreases the memory
bandwidth and pin count, which also decreases the power consumption.
For real time processing with 640x480 resolution video at 25/sec frame rate,
640x480x25x16 = 122M clock cycles are needed per sec. (for every pixel, 16 clock
cycles are needed in this design)
After synthesis, a clock frequency at 125MHz is achieved, which is fast enough for real
time processing.
CHAPTER 2: Literature review
2.1 Origin ofH.264
Broadcast television and home entertainment are being revolutionized by the invent of
digital TV and DVD-video. These applications and many more were made possible by
the standardization of video compression technology. The next standard in theMPEG
series, MPEG4, is enabling a new generation of internet-based video applications while
the ITU-T H.263 standard for video compression is now widely used in
videoconferencing systems.
MPEG4 and H.263 are standards that are based on video compression ("video coding")
technology from circa. 1995. The groups responsible for these standards, theMotion
Picture Experts Group and the Video Coding Experts Group (MPEG and VCEG) are in
the final stages of developing a new standard that promises to significantly outperform
MPEG4 and H.263, providing better compression of video images together with a range
of features supporting high-quality, low-bit-rate streaming video, the new standard,
"Advanced Video
Coding"
(AVC) is also known by its old working title, H.26L and by
its ITU document number, H.264. [3]
2.2 H.264 Codec fl 11
In common with earlier standards (such asMPEG1,MPEG2 andMPEG4), the H.264
draft standard does not explicitly define a CODEC (enCOder / DECoder pair). Rather,
the standard defines the syntax of an encoded video bit-stream together with the method
of decoding this bit-stream. In practice, however, a compliant encoder and decoder are
likely to include the functional elements shown in Figure 2-1 and Figure 2-2. There is
scope for considerable variation in the structure of the CODEC. The basic functional
elements (prediction, transform, quantization, entropy encoding) are little different from
previous standards (MPEG1, MPEG2, MPEG4, H.261, H.263); the important changes in
H.264 occur in the details of each functional element. The Encoder (Figure 2-1) includes
two dataflow paths, a
"forward"
path (left to right) and a
"reconstruction"
path (right to
left ). The dataflow padi in the Decoder (Figure 2-2) is shown from right to left to
































! ysa. Filter ,
Ur"
f T-* Q-* Reorder . Entropy
decode
HfiL
Figure 2-2 AVC Decoder [4]
2.2.7 Encoder (forwardpath)
An input frame Fn is presented for encoding. The frame is processed in units of a
macroblock (corresponding to 16x16 pixels in the original image). Each macroblock is
encoded. In either case, a prediction macroblock P is formed based on a reconstructed
frame. P is formed by motion-compensated prediction from one ormore reference
frame(s). In the Figures, the reference frame is shown as the previous encoded frame F'n-
1 ; however, the prediction for each macroblock may be formed from one or two past or
future frames (in time order) that have already been encoded and reconstructed. The
prediction P is subtracted from the current macroblock to produce a residual or difference
macroblock Dn. This is transformed (using a block transform) and quantized to give X, a
set of quantized transform coefficients. These coefficients are re-ordered and entropy
encoded. The entropy encoded coefficients, together with side information required to
decode the macroblock (such as the macroblock prediction mode, quantizer step size,
motion vector information describing how the macroblock was motion-compensated, etc)
form the compressed bitstream. This is passed to a Network Abstraction Layer (NAL) for
transmission or storage. [5]
2.2.2 Encoder (reconstruction path)
The quantized macroblock coefficients X are decoded in order to reconstruct a frame for
encoding of further macroblocks. The coefficients X .are re-scaled (Q-l) and inverse
transformed (T-l) to produce a difference macroblock Dn'. This is not identical to the
original difference macroblock Dn ; the quantization process introduces losses and so
Dn'
is a distorted version ofDn.
The prediction macroblock P is added to
Dn'
to create a reconstructed macroblock uF'n
(a distorted version of the original macroblock). A filter is applied to reduce the effects of
blocking distortion and reconstructed reference frame is created from a series of
macroblocks F'n.
2.2.3Decoder
The decoder receives a compressed bitstream from the NAL. The data elements are
entropy decoded and reordered to produce a set of quantized coefficients X. These are
rescaled and inverse transformed to give
Dn'
(this identical to the
Dn'
shown in the
Encoder). Using the header information decoded from the bitstream, the decoder creates a
prediction macroblock P, identical to the original prediction P formed in the encoder. P is
added to
Dn'
to produce uF'n which this is filtered to create the decoded macroblock F'n.
It should be clear from the Figures and from the discussion above that the purpose of the
reconstruction path in the encoder is to ensure that both encoder and decoder use identical
reference frames to create the prediction P. If this is not the case, then the predictions P in
encoder and decoder will not be identical, leading to an increasing error or
"drift"
between the encoder and decoder. [5]
2.3 Motion estimation and compensation
2.3.1 Introduction
The AVC CODEC uses block-based motion compensation, the same principle adopted by
every major coding standard since H.261. Important differences from earlier standards
include the support for a range of block sizes (down to 4x4) and fine sub-pixel motion
vectors (1/4 pixel).
2.3.2 Tree structured motion compensation
AVC supports motion compensation block sizes ranging from 16x16 to 4x4 luminance
samples with many options between the two. The luminance component of each
macroblock (16x16 samples) may be split up in 4 ways as shown in Figure 3-1: 16x16,
16x8, 8x16 or 8x8. Each of the sub-divided regions is a macroblock partition. If the 8x8
mode is chosen, each of the four 8x8 macroblock partitions within the macroblock may
be split in a further 4 ways as shown in Figure 2-2: 8x8, 8x4, 4x8 or 4x4 (known as
macroblock sub-partitions). These partitions and sub-partitions give rise to a large
number of possible combinations within each macroblock. This method of partitioning








Figure 2-4Macro block sub-partitions: 8x8, 4x8, 8x4, 4x4 [5]
A separate motion vector is required for each partition or sub-partition. Each motion
vector must be coded and transmitted; in addition, the choice of partition(s) must be
encoded in the compressed bit-stream. Choosing a large partition size (e.g. 16x16, 16x8,
8x16) means that a small number of bits are required to signal the choice of motion
vector(s) and the type of partition; however, the motion compensated residual may
contain a significant amount of energy in frame areas with high detail. Choosing a small
partition size (e.g. 8x4, 4x4, etc.) may give a lower-energy residual after motion
compensation but requires a larger number of bits to signal the motion vectors and choice
of partition(s). The choice of partition size therefore has a significant impact on
compression performance. In general, a large partition size is appropriate for
homogeneous areas of the frame and a small partition size may be beneficial for detailed
areas. [6]
In our design, 8x4 and 16x8 blocks are not used in order to reduce the complexity.
Example: Figure 2-5 shows a residual frame (without motion compensation). The AVC
reference encoder selects the
"best"
partition size for each part of the frame, i.e. the
partition size that minimizes the coded residual and motion vectors. The macro block
partitions chosen for each area are shown superimposed on the residual frame. In areas
where there is little change between the frames (residual appears grey), a 16x16 partition
is chosen; in areas of detailed motion (residual appears black or white), smaller partitions
are more efficient. [6]
Figure 2-5 Residual (withoutMC) showing optimum choice of partitions [6]
2.3.3 Sub-pixelmotion vectors
Each partition in an inter-coded macro block is predicted from an area of the same size in
a reference picture. The offset between the two areas (the motion vector) has 14-pixel
resolution (for the luma component). Figure 3-1 gives an example. A 4x4 sub-partition in
the current frame (a) is to be predicted from a neighboring region of the reference picture.
If the horizontal and vertical components of the motion vector are integers (b), the
relevant samples in the reference block actually exist (grey dots). If one or both vector
components are fractional values (c), the prediction samples (grey dots) are generated by
interpolation between adjacent samples in the reference frame (white dots).
10
oooooo o o # oooooo
o <i #
o O O O <I <1 oooooo
# 6 $ #
o o o o oooooo
#
o o o o ,# @ @ oooooo
/^
Jl # 1
o o oooooo o GTO 0 0 0
OOOOOO OOOOOO OOOOOO
(a) 4x4 block in current frame (b)Reference block: vector (1, -1) (c) Reference btocK vector (0.75, -0.5)
Figure 2-6 Ex.ample of integer .and sub-pixel prediction [6]
Sub-pixel motion compensation can provide significantly better compression
performance than integer-pixel compensation, at the expense of increased complexity.
Quarter-pixel accuracy outperforms half-pixel accuracy
2.4 Existing popularME/MC Algorithms
Motion estimation has proven to be effective to exploit the temporal redundancy of video
sequences and is therefore a central part of the ISO/IEC MPEG-1, MPEG-2, MPEG-4
and the H.263 and H.264 video compression standards. These video compression
schemes are based on a block-based hybrid-coding concept, which was extended within
theMPEG-4 standardization effort to support arbitrarily-shaped video objects.
Motion estimation algorithms have attracted much attention in research and industry
because of these reasons:
1 . It is the computational most demanding algorithm of a video encoder(about 60%-
80% of the total computation time) [7], which limits the performance of the
encoder in terms of encoding speed.
2. The motion estimation algorithm has a high impact on the visual performance of
an encoder for a given bit rate.
3. Finally, the method to extract motion vectors from the video material is not
standardized, thus being open to competition.
11
Full-search block-matching algorithm is the most hardware friendly algorithm. [8]
However, there do have some hardware based fast search algorithm, but none of them
supports variable block size [9].
Four example fast search algorithms are briefly described below:
1 . 2D logarithmic searches
2DLOG is a fast search algorithm based on minimum distortion, in which the distortion
metric is only calculated for sparse sampling of the full-search area. The step seize of the
search area is reduced by n/2 with every search step.
2. Three Step Search
TSS uses a similar structure as 2DLOG but with the use of SAD. TSS is one of the most
popular fast motion estimation algorithms requiring a fixed number of 25search steps and
is often used as reference.
3. New Three-Step Search [10]
NTSS adds additional checking points around zeroMV of the first step of TSS, it is more
robust and yield less errors than TSS. However for larger MV NTSS will take more
computation power.
4. Diamond search
DS is a search algorithm that has a diamond shape-searching pattern, there is advanced
DS search that has different size of diamond shape searching areas [11]. But there is no
hardware architecture for this algorithm that also supports variable block size.
12
CHAPTER 3: Dataflow and VLSI Architecture Design
The 16x16 dataflow is shown in Table 3-1 , after some modification, 8x8 and 4x4
dataflow can be shown in Table 3-2 and Table 3-3. These are all based on full search
block matching Algorithm.
3.1 Dataflow Diagram for 16x16. 8x8. and 4x4 Motion Estimation
In our design, the first processing block size is 16x16, and the tracking range is -8
- +7
pixels. Following data flow is presented to solve first line of 16 possible matching blocks
with 16 Processing Elements [12]. Current frame data will be shift through all the
Processing Elements and the previous frame data will be broadcasted.
The tables below assume the current block (a) starts from (0,0), and the search area (b) in
previous frame also starts from (0,0). In actual design the start points may vary depending
on the location of the block.




will be providing to the PE
array for calculation.
13
Data Sequence PEO PEl PE14 PE15









a(0,15) b(0,15) a(O,14)-b(0,15) a(0,l)-M0,15) a(0,0)-b(0,15)
a(l,0) b(1.0) b(0,16) a(l,0>b(l,0) a(0,15>b(0,16) a(0,2)-b(0,16) a(0,l)-b(0,16)







a(l,2) b(l,2) b(0,18) a(l,l)-b(l,2) a(0,3)-b(0,18)
a(l,13)-b(l,14)a(l,14) rX1.14) b(0,30) a(l,0)-Ml,14) a(0,15)-b(0,30)
a(l,15) W1.15) b(0,31) a(l,14)-b(l,15) a(l,l)-Ml,15) a(l,0)-Ml,15)
a(2,0) bttO) b(l,16) a(2,0VM2,0) a(l,15)-b(l,16) a(l,2)-M1.16) a(l,l)-b(l,16)









a(15,0) b(15,0) b(14,16) a(14,l>b(14,16)
a(15,l) b(15,l) b(14,17) a(14,2>b(14,17)












Table 3- 1 Dataflowvfor 16x: assuming both start point is (0,0)
14











8+0 a(l,0) b(l,0) b(0,8)
8+1 a(l,l) b(l,l) b(0,9)
8+2 a(l,2) b(l,2) b(0,10)
8+6 a(l,6) b(l,6) b(0,14)
8+7 a(l,7) b(l,7) b(0,15)
2*8+0 a(2,0) b(2,0) b(l,8) b(0.16)
2*8+1 a(2,l) b(2,l) b(l,9) b(0.,17)
.
7*8+0 a(7,0) b(7,0) b(6,8) b(5,16)
7*8+1 a(7,l) b(7,l) b(6,9) b(5,17)
,





















4+0 a(l,0) b(l,0) b(0,4)
4+1 a(U) b(l,l) b(0,5)
4+2 a(l,2) b(l,2) b(0,6)
4+3 8(1,3) b(l,3) b(0,7)
2*4+0 a(2,0) b(2,0) b(l,4) b(0,8)
2*4+1 a(2,l) b(2,l) b(l,5) b(0,9)
2*4+2 a(2,2) b(2,2) b(l,6) b(0,10)
2*4+3 a(2,3) b(2,3) b(l,7) b(0,ll)
3*4+0 a(3,0) b(3,0) b(2,4) b(l,8) b(0,12)
3*4+1 a(3,l) b(3,l) b(2,5) b(l,9) b(0,13)
3*4+2 a(3,2) b(3,2) b(2,6) b(l,10) b(0,14)
3*4+3 a(3,3) b(3,3) b(2,7) b(l,ll) b(0,15)
4*4+0 b(3,4) b(2,8) b(l,12) b(0,16)
4*4+1 b(3,5) b(2,9) b(l,13) b(0,17)
4*4+2 b(3,6) b(2,10) b(l,14) b(0,18)
4*4+3 b(3,7) b(2,ll) b(l,15) b(0,19)
5*4+0 b(3,8) b(2,12) b(l,16)
5*4+1 b(3,9) b(2,13) b(l,17)
5*4+2 b(3,10) b(2,14) b(l,18)









Table 3-3 Dataflow for 4x4 block size, assuming both start points are (0,0)
3.2 VLSI Architectures for 16x16, 8x8. and 4x4Motion Estimation
From the Dataflow tables VLSI architecture was developed. The following figure shows




































Figure 3-1 VLSI Architecture for 16x16, 8x8, 4x4ME.
8x8 and 4x4 processing blocks can use the same architecture, after changing the
Interconnection, Controller and Address generator a little.
Between 16x16, 8x8, 4x4, and fractional MV blocks there are bridging units that
coordinate and transfer data between the 4 processing blocks.
17





















Figure 3-2 Bridging Architecture between 2 Different block size stages
The bridging units between 8x8 and 4x4, 4x4 and fractional MV. have the similar
architectures.
The design details of all these blocks will be discussed in Register Transfer Level code
chapter.
3.3 Dataflow Diagram for FractionalMotion Estimation
The fractional motion estimation dataflow includes interpolations. The following Table
shows the dataflow in fractional motion estimation. Here start point of (0,0) is assumed.
18







B0W) B(0,0) B,0(0,0) B2(0,0) B3(0,0) A(0,0)-B0(0,0) A(o,oyB?(0,0)
B0(0,1) B0(0,0) B;(0,0) B2(0,0) Bj(0,0) A(0,0)-fi0(0,0) A(0,0)-Bt(0,0)
B2(0,1) B2(0,0) B2(0,0) B2(0,0) B2(0,0) A(0,0)-fio2(0,0) A(0,0)-fi2(0,0)
B(0,2),B(1,2)
B03(0,1) B3(0,0) B,3(0,0) B23(0,0) B3(0,0) A(0,0)-503(0,0) A(0,0)-fl3(0,0)
B0(0,2) B0(0,1) B,(0,1) B2(0,1) B3(0,1) A(0.1)-fi0(0,l) A(0,1)-B(0,1)
B0(0,2) B0(0,1) B,'(0,1) B2(0,1) B3'(0,1) A(0,1)-B'0 (0,1) A(0,l)-Bfl(0,l)
B2(0,2) B2(0,1) B2(0,1) B2(0,1) B32(0,1) A(0,1)-S02(0,1) A(0,l)-52(0,1)
B3(0,2) B03(0,1) B,3(0,1) B3(0,1) B3(0,1) A(0,1)-B03(0,1) A(0,1)-B3(0,1)
B(0,16),B(I,16)
B0(0,16)
B0(0,16) B0(0,15) B,(0,15) B2(0,15) B3(0,15) A(0,15)-Bfl(0,15) A<0,15)-fl(0,15)
B2(0,16) B0(0,15) B^O,^) B2(0,15) B^O,^) A(0,15)-fi^(0,15) A(0,15)-fl0(0,15)
B03(0,16) B2(0,15) B,2(0,15) B22(0,15) B2(0,15) A(0,15)-B02(0,15) A(0,15)-fi2(0,15)
B3(0,15) B3(0,15) B3(0,15) B3(0,15) A(0,15)-fi03(0,15) A(0,15)-fl3(0,15)
Table 3-4 Dataflow diagram for Fractional Motion Estimation [12]
19
IP1 is the functional block that interpolates between the pixel rows, and TP2 is the
functional block to interpolate between pixel columns. There are only 4 PEs but with
4
latches inside each PE they can process 16 blocks at the same time.
There are total 64 searches for quarter-pixel precision, namely the combination of -1,
-
0.75, -0.5, -0.25, 0, 0.25, 0.5, and 0.75.
3.4 The VLSI architecture of Fractional ME







Figure 3-3 VLSI Architecture for Fractional ME [12]
IP1 and IP2 interpolate the entire search area, and the pixel data are processed in PE array
and comparator.
20
CHAPTER 4: Behavior VHDL code
High-level behavior code was written for algorithm verification, also, it provides the
"known
good"
motion vectors to test the Register Transfer Level codes.
As show in the following code, 2 neighborhood frames were used to calculate the motion
vectors, separated processes were used to calculate the 16x16, 8x8, 4x4, and fractional
motion vectors. Compensations were also performed during each process. The motion
vectors are directed into *.pgm files that can be access by text editor.
Behavior code is just for verification purpose, it cannot be synthesized. For VLSI
hardware implementations, Register Transfer Level codes are needed to be developed,
which will be covered in next chapter.
The source code can be found in Appendix B - 1 .
The following segment of behavior code contains the key loops for 16x16 motion
estimation. Variable m and n represent the number of the block, in this case, the frame
size is 256x256, so there are 16 blocks in both row and column. Variable p and q
represent the row and column number of the possible matching blocks, in this case, we
have a search area of 32x32, so there are 256 possible matching blocks. Variable i and j
represent the locations of pixels in current possible matching block.
form in 1 to 16 loop
for n in 1 to 16 loop
for p in 0 to IS loop
for q in 0 to IS loop
for i in ((m-l)*16+p-8) to ((m-l)*16+p+7) loop




if (i>0 and j>0 and i<256 and j<256) then
if (prev_frame(ij)>cur_frame(k,l)) then
ad := prev_frame(ij)-cur_frame(k,l);
else ad := cur_frarne(k,l)-piev_frame(ij);
end if;
temp_sad_array(p,q) := temp_sad_anay(p,q) + ad;









CHAPTER 5: Register transfer level VHDL code
5.1 16X16 Processing block
5.1.1 AGU
AGU's purpose is to generate serial ofmemory address and necessary control signal. As
shown below, inputs are from controller, start point generator (SPU) and bridging units.
Input ports:
Start_c_row, Start_c_col,Start_p_row, start_p_col are from SPU.
Reset and Start are from Controller.
Start_Transfer andME_to_sub_ready are from Bridging Units.
Output ports:
Add_c_row, add_c_col, add_p_row, add_p_col, add_pl_row and add_pl_col are
connected to memory address lines.
Done_block is a feedback to Controller, when the address generations of a 16x16 block
and its search area are done.
Start_cmp, Reset_cmp are control signals for Comparator.
Z_cp, Z_pl are the signals to control the muxes which are used to determined either the
actually pixel data or a
"00000000"



























Figure 5-1 Address Generator
The source code can be found in Appendix B - 2. 1 . Synthesis result can be found in 1 1
1.
5.1.2 Comparator
Comparator is the unit that collects the SADs from each PE and then output the smallest
SAD and the motion vector related to the SAD. An 8-bit logic vector represents the
motion vector because the search area contains 256 posible matching blocks.
Input ports:
PEO, PE1, PE15 are the inputs of SADs from the 16 PEs.
SAD_threshhold is an input that specify whether the block need further 8x8 processing or
not. If the smallest SAD is bigger than the threshold SAD, the block will be processing in
8x8 mode, if less than the threshold SAD, the block will no be processed in 8x8 mode.
23
Reset_cmp and Start_cmp are control signals from AGU, to start the comparition or reset
the comparator.
Output ports:
SAD, min_SAD_out are the SAD outputs, while min_SAD_out is a test signal that
updates at every round of "table
3-1"
instead of the done_block in AGU.
Mv_out, and mv_out_sub are the motion vector outputs. Mv_out is the one that doesn't














FSM is used to design the comparator. Idle and initial states are used to control the
comparator, while compare and output_mv states are used to compare SADs and generate
motion vector outputs.
The source code can be found in Appendix B - 2.2. Synthesis result can be found in
Appendix A - 2.
24
5.7.5 Controller
Controller are the toplevel control unit which is dedicated to coordinate and generate
control signals forAGU, SPU, and bridging units, also receive feedback signals from
those items.
Input ports:
Reset and start are the top-level control input.
ME_to_ME_sub_ready is the feedback signal from bridging units when the data transfer
between 16x16 and 8x8 blocks are done.
Output ports:
Reset_AGU, Enable_SPU, Start_AGU are the control singals for AGU and SPU.
Done_Frame is top-level output when the prediction for the entire frame is done.
Block_num_col, Block_num_row, position are the information for SPU to generator start















FSM is used to design the controller. To get the location of each 16x16 block, 9 output
states were used to represent the 9 situations where block is : upper left comer, upper
edge, upper right corner, left edge, center, right edge, bottom left comer, bottom edge,
bottom right comer.
The source code can be found in Appendix B - 2.3. Synthesis result can be found in
Appendix A 3.
5.1.4 Interconnection
Interconnection is in between the memory and the PEs. Under controlled ofAGU,
Interconnection unit distribute the pixel data to PEs .
Input ports:
c, p, pi are pixel data inputs.
Reset_dff_bit, reset_dff_pixel are the dataflow control inputs.
26
Output ports:
PEOc, PEOp PE15c, PE15p are the outputs that connect to the 16 Pes, they are all







The Interconnection is basically made by series ofDffs andMultiplexers. Reset_dff_bit
is to reset theMux, and the Reset_dff_pixel is to reset the D ports of the Dffs.
The source code can be found in Appendix B - 2.4. Synthesis result can be found in
Appendix A - 4.
5.7.5 PE (Processing Element)
PE (Processing Element) is the unit that calculates the SAD by performing absolute
difference calculation and adding them.
Input ports:
A, b are the inputs of pixel data, one is from current frame, another is from previous
frame.
27
Reset is the control signal from AGU, which will set the SAD output 0.
Output ports:
SAD, Sum ofAbsolute Difference is the output to Comparator.
SAD
Figure 5-5 PE (Processing Element)
PE conducts Equation 1.1. It consist a comparator, a substractor, an adder and an
accumulator.
The source code can be found in Appendix B - 2.5. Synthesis result can be found in
Appendix A - 5.
5.1.6 SPU (Start Point Generator)
SPU is the unit that generates the address of start point for current block and the
according search area, which works as a part ofAddress Generator. In this design, it is
separated fromAddress Generator for a simpler code for the Address Generator.
Input Ports:
Block_num_col, Block_num_row, position are the inputs from Address generator, they
tell the SPU which block it is.
Enable is a control input from Controller.
Output ports:
28
Start_c_row, Start_c_col, start_p_row, start_p_col are the address of start points for







Figure 5-6 SPU (Start Point Generator)
The source code can be found in Appendix B - 2.6. Synthesis result can be found in
Appendix A - 6.
5. 7. 7Memory previousframe
Memory for previous frame, it reads data from image file first.
Input ports:
Add_p_row,add_p_col,add_pl_row,add_pl_col are the addresses line inputs.
Output ports:
















Figure 5-7 Memory Previous Frame
The source code can be found in Appendix B - 2.7. Synthesis result can be found in
Appendix A - 7.
5.1.8Memory currentframe
Memory for current frame, it reads data from image file first.
Input ports:
Add_c_row,add_c_col are the addresses line inputs.
Output ports:









Figure 5-8 Memory Current Frame
The source code can be found in Appendix B - 2.8. Synthesis result can be found in
Appendix A - 8.
5.7.9Mux betweenMemory and Interconnection
This Mux provides control over the data inputs of Interconnection. It determines whether
the actually pixel data goes into interconnection or "00000000".
Input ports:
Z_cp,Z_pl are the control inputs from AGU.
In_c,in_p,in_pl are the pixel data inputs from memories.
Output ports:
Out_c, out_p, out_pl are the pixel data outputs to interconnection, they can either be the





The source code can be found in Appendix B - 2.9. Synthesis result can be found in
Appendix A - 9.
5.1.10 Transfer unit
This is the unit that transfer current block and search area data from 16x16 memories to
8x8 processing block's memories, it generates the memory addresses for memories on
both stages, also generates some control signals and feedback for another bridging unit.
Input ports:
Start_transfer is the control signal fromME_to_ME_sub bridging unit.
S_c_row,s_c_col,s_p_row,s_p_col are the start point of current block and search area,
come from SPU.
Output ports:
C_row,c_col,p_row,p_col,sub_c_row,sub_c_col,sub_p_row,sub_p_col are the addresses
output for the memories on both stages.
Read_write, me_sub_add_selectl, me_sub_add_Select2 are the control signals for the
memories in next stage.
32





























Figure 5-10 Data Transfer Unit
The source code can be found in Appendix B - 2.10. Synthesis result can be found in
Appendix A - 10.
5.1.11 Bridge unit
This is the unit that coordinates the handshake signals between 2 stages and controls the
transfer unit. With the transfer unit, these 2 pipeline the stages together.
Input ports:
Reset is the control input from top level.
Mv, SAD_in are the motion estimation results from last stage.
33
SAD_threshold determines whether the result need further processing or not. It is a
top-
level input.
ME_sub_ready andME_ready are the 2
"ready"
handshake signals from both stages.
Data_ready is the feedback signal from transfer unit.
Output ports:
Need_process is the signal that indicates whether the results need further processing or
not.
Start_data_transfer is the control signal for transfer unit.




Figure 5-1 1 Bridging Unit
The source code can be found in Appendix B - 2.1 1. Synthesis result can be found in
Appendix A- 1 1.
34
5.1.12 ME top level
This is the Structure VHDL code that combines all the units into a 16x16 processing
block.
The source code can be found in Appendix B- 2.12. Synthesis result can be found in
AppendixA- 12.
5.2 The 8x8. 4x4 Processing blocks
They are similar as the 16xl6-processing unit. Only the interconnection, Controller and
AGU have some significant changes.
The 8x8 memories contain the current block data, which is a 16x16 frame, and the search
area data, which is a 32x32 frame.
The 4x4 memories contain the current sub block data, which is an 8x8 frame, and the
search area data, which is a 24x24 frame.
The 8x8 and 4x4 dataflow should follow the Table 3-2 and 3-3.
5.2.1 The 8x8AGU
The difference between this AGU and the 16x16 AGU discussed in 1.1 is that this AGU
generates data according to Table 3-2, following the c, p, pi and p2 sequence.
It has an extra p2 output, along with an extra control output Z_p2.
The source code can be found in Appendix B - 2.13. Synthesis result can be found in
AppendixA- 13.
5.2.2 The 8x8 Interconnection
35
8x8 interconnection has one extra p2 input comparing to 16x16 interconnection, it
also
has different dataflow inside to select the output data between p, pi and p2. Others are
same as the 16x16 interconnection.
The source code can be found in Appendix B - 2.14. Synthesis result can be found in
Appendix A - 14.
5.2.3 The 8x8 Controller
Same inputs and outputs as the 16x16 controller. The major difference is that instead of 9
output states, here only 4 output states exist because only 4 sub blocks existing in the
16x16 current block.
The source code can be found in Appendix B- 2.15. Synthesis result can be found in
AppendixA- 15.
5.2.4 The 4x4AGU
It has two extra output p3 and p4 compare to 8x8 AGU, in order to generate data
sequence like Table 3-3.
The source code can be found in Appendix B - 2.16. Synthesis result can be found in
AppendixA- 16.
5.2.5 The 4x4 Interconnection
Interconnection has extra inputs like p3 and p4, also different dataflow inside to select the
output data between p, pi, p2, p3, and p4.
The source code can be found in Appendix B- 2.17. Synthesis result can be found in
AppendixA- 17.
5.2.6 The 4x4 Controller
Same as 8x8 Controller.
36
Overall, it can be seen that 16x16, 8x8, 4x4 has very similar architecture, they also have
almost same processing times for every block, that makes the 3 very good
candidates for
pipeline processing. With Transfer unit and Bridging unit in between neighborhood
stages, the pipeline is achieved.
The source code can be found in Appendix B - 2.18. Synthesis result can be found in
AppendixA- 18.
5.3 Fractional Motion Estimation Processing block
This processing block is a lot different from the former 16x16 , 8x8, and 4x4 processing
blocks. First the search area is only +1/-1 pixel wider than the current block, second the
search area need to be interpolated before the search starts. Following is the RTL design
for the functional blocks in the fractional motion estimation-processing block.
The interpolation detail is shown in the following example:
Neighborhood pixels needed to be interpolated: A(0,0), A(0,1).
^0(0,0)=A (0,0) A0(0,0) = 0.75*A(0,0) + 0.25*A(0,1)
Al (0,0)




P, PI are the pixel data input from the frame that is needed to be interpolated.
Start_IPl is the control signal that starts interpolation after every 4-clock cycles.
Output ports:
Quarter_out is the interpolated (row direction) pixel data output.






P, PI are the pixel data input from IP1 (after delays as shown in Figure 3-3).
Output ports:
Out_pl, Out_p2, Out_p3, and Out_p4 are the interpolated pixel data outputs.
The source code can be found in Appendix B - 2.20. Synthesis result can be found in
Appendix A - 20.
5.3.3 Frac_AGU
Fractional AGU is different with the variable block size AGUs because it is dealing with
a different search area, which had already been interpolated, its also has to be
synchronized with the DPI and TP2.
Input ports:
Start_p_row, Start_p_col are the start point inputs from Frac_SPU.
Start_AGU is the control signal from Frac_Controller.
Output ports:
Reset_PE, resetjatch, output_PE, z_c, start_cmp, starMPl are the control signals
sending to PE, Latches, Mux, comparator and DPI.
Add_p_row,add_p_col, add_pl_row,add_pl_col,add_c_row,add_c_col are the address
outputs.
Done_block_sub is the feedback signal to Frac_controller when the address generation
for 1 search area is done.






Frac_PE's only difference with regular PE is its 4 latches to store the SADs, so 1 PE can
handle 4 SADs instead of 1.
Input ports:
C, P are the pixel input from memory and interpolation unit 2.
Out is the control input to let the PE pumping out the 4 SADs stored in the latches.
Output ports:
SAD4 is the serial SAD output.
The source code can be found in Appendix B - 2.24. Synthesis result can be found in
Appendix A - 24.
5.3.7Frac_SPU
Unit that generates start point for Frac_AGU.
Input ports:
Position is the input that let SPU knows which block it is, therefore SPU can generate the
start point address.
Enable is the control input from Frac_controller.
Output ports:
Start_p_row, and start_p_col are the start point outputs, since there is only 1 block in
Frac_mem_c there is no need for a current block start point.
The source code can be found in Appendix B - 2.25. Synthesis result can be found in
Appendix A - 25.
40
5.3.8 Fracjdemory ofcurrentframe
Memory of current frame in Fractional ME are just a 16x1 6 block.
The source code can be found in Section VD! - 2.26.
5.3.9 Frac_Memory ofpreviousframe
Memory of Previous frame in Fractional ME are just a 18x18 block, which is 1 pixel
wider than current frame at every edge.
The source code can be found in Section VTH - 2.27.
5.3.10 Frac_ME Toplevel
Structure VHDL code of the top level Fractional ME.
The source code can be found in Appendix B- 2.28.
41
CHAPTER 6: Simulation result and analysis
6.1 Simulation (Gymnast)
The following frames have been used in the simulation. The images show the two
continuous frames of a gymnast in the
28th
Olympic games at Athen, Greece.
Current Frame:
Figure 6-1 Current frame
Previous Frame:
Figure 6-2 Previous frame
42
After running the VHDL codes using the given 2 frames, 16x16 MV array,
8x8 MV
array, 4x4 MV array, and fractionalMV array can all be attained as follow.
16X16MV array:
189 176 176 181 17S 191 191 186 176 176 185 187 176 176 184 180
137 136 168 133 136 136 136 137 136 136 136 133 137 136 142 128
169 168 184 184 184 200 168 184 184 183 181 183 168 184 184 166
185 199 200 184 184 184 184 184 184 183 183 183 183 183 200 184
185 183 184 184 184 200 183 183 183 183 167 183 168 184.184 184
185 184 200 183 183 218 216 184 184 182 240 184 184 184 184 184
185 168 168 169 169 18S 185 200 184 213 245 175 168 184 170 184
249 255 169 140 172 182 174 168 168 214 213 158 184 214 250 152
185 137 157 155 136 131 144 16 112 128 136 136 199 172 206 248
137 165 175 140 110 156 115 64 0 208 151 207 241 207 223 5
249 240 154 26 207 190 223 224 191 143 175 158 125 186 200 136
169 191 206 174 175 191 223 255 31 239 225 240 234 204 246 246
233 240 146 166 142 133 138 175 138 9 67 107 200 184 120 56
159 192 150 140 118 232 208 255 252 248 254 254 245 244 128 70
137 139 130 128 129 128 128 128 140 134 128 128 141 129 138 136
95 91 80 82 95 80 80 94 95 80 93 88 95 80 69 83
8x8MV array:
153 145 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144
1B9 191 183 178 164 176 189 181 176 189 181 191 184 191 190 182 176 176 179 176 185 188 180 187 179 176 176 190 182 176 186 180
137 129 138 130 128 134 136 135 135 135 136 136 136 143 136 137 136 138 136 136 136 143 143 138 138 132 136 136 143 138 135 128
217 1G3 176 183 164 184 200 53 216 151 244 171 120 168 184 168 216 205 56 152 167 172 164 167 153 56 151 195 63 59 54 232
169 183 168 167 164 168 184 69 169 183 202 200 168 184 184 184 168 186 168 168 197 179 131 166 152 168 184 1B1 182 191 189 166
233 167 184 151 152 183 184 183 249 20 74 33 250 168 184 168 164 135 167 230 166 133 164 183 216 167 16B 167 167 169 167 160
185 199 152 71 216 200 184 143 200 184 2 184 1B4 164 184 164 184 183 183 183 167 163 183 183 183 199 136 232 248 200 1B4 184
169 255 199 200 184 184 168 248 249 168 184 184 184 168 184 1B4 1B4 183 214 215 199 167 166 167 200 184 183 168 184 184 232 216
137 119 168 115 168 184 200 184 201 184 190 194 199 183 200 183 201 199 168 183 152 151 183 183 184 167 168 247 150 168 217 168
188 173 173 182 177 184 199 183 185 193 211 226 240 163 160 177 183 215 183 183 181 183 167 183 183 166 184 167 163 185 180 197
185 186 184 217 195 176 231 160 182 197 204 239 247 247 164 183 195 247 134 245 55 34 168 183 167 185 1B3 191 1B4 1B3 170 184
185 184 184 184 199 200 183 155 183 169 200 218 201 199 1B4 250 247 199 182 197 226 28 164 255 247 216 169 184 182 185 168 184
185 184 168 1B4 184 136 111 79 169 153 120 42 111 136 136 185 1B4 152 196 244 226 214 248 239 76 168 184 183 186 170 169 246
169 184 171 223 168 168 169 153 242 202 135 185 201 182 200 200 200 167 229 245 162 24S 191 170 153 168 183 199 167 170 185 24B
47 78 255 255 249 168 243 142 156 205 168 180 174 166 172 193 168 154 197 19 163 197 158 158 176 215 136 152 103 250 169 136
9 255 254 250 157 157 156 57 129 223 152 147 151 127 122 128 128 226 240 221 215 235 245 183 236 219 216 214 240 249 153 152
249 207 156 194 141 156 156 42 168 140 8 128 0 232 60 16 16 96 164 126 119 227 197 140 247 200 216 239 242 239 248 3
185 161 122 120 137 156 171 154 136 164 138 131 129 219 211 109 96 134 128 12B 171 136 135 128 36 198 143 154 206 207 186 161
121 194 137 194 111 141 155 106 141 151 138 136 175 107 128 81 28 24 11 38 140 134 136 223 232 254 223 88 239 136 71 64
218 228 166 181 230 159 142 141 207 71 221 219 132 251 32 16 244 223 220 194 173 151 199 232 224 241 174 207 78 239 11 5
89 163 56 64 158 141 57 9 205 19S 206 169 164 233 224 211 241 244 126 186 173 142 77 142 136 221 235 185 203 19B 187 84
249 212 242 240 245 113 79 43 52 189 207 187 223 223 188 184 191 140 11 158 172 175 158 79 175 207 141 103 168 121 126 104
185 156 131 178 254 7 175 174 175 124 79 106 223 223 159 177 47 60 191 255 237 198 224 254 23 2S5 252 236 245 8 124 67
169 181 153 191 207 127 249 248 191 189 188 170 152 167 236 255 249 143 219 166 172 149 205 227 49 234 170 188 249 247 246 66
89 195 240 252 244 29 27 201 209 158 121 154 142 155 163 207 248 241 131 42 24 134 109 147 203 200 184 211 152 151 248 24
153 167 190 169 147 128 184 153 164 144 135 146 242 39 0 2 137 130 255 7 242 65 242 22 168 233 133 226 40 25 60 88
137 143 168 194 136 134 130 109 138 113 207 213 211 208 192 255 248 241 240 130 208 99 241 252 201 136 135 241 117 14 13 118
252 255 126 208 103 246 172 164 121 144 224 216 227 160 240 255 216 253 231 234 252 249 226 141 140 244 179 222 16 226 224 178
217 175 47 53 131 208 201 219 199 208 176 202 193 213 209 208 72 140 117 218 206 215 210 208 156 132 129 113 70 125 64 195
137 143 143 140 132 143 128 143 136 126 130 133 126 133 126 128 126 138 143 137 130 134 128 132 142 134 134 131 143 136 131 136
95 67 80 91 B3 80 90 82 93 65 90 82 80 BO 95 B7 80 88 80 80 87 93 B5 B0 95 90 B2 80 67 93 90 82
91000000000000000000000000000000
4x4 MV array is included in the attachment CD.
Quarter-pixelMV array is included in the attachment CD.
The simulation results from Behavior VHDL code and RTL VHDL code match perfectly.
The following figure show the residual frames. Figure 6-3 shows the residual frame
without anyME; figure 6-4 shows the residual frame with 16x16ME; figure 6-5 shows
the residual frame with 8x8 ME; and figure 6-6 shows the residual frame with 4x4ME.
The results are significant. The residual from more precise compensation are smaller,
Figure 6-6 contains less energy than Figure 6-5, and Figure 6-5 contains less energy than
Figure 6-4. Hence, Figure 6-6 will result in smallest bit stream after Discrete Cosine
Transform.
43
Residual Frame without motion estimation.
Figure 6-3 residual frame withoutME
Residual Frame after 16x16 motion compensation:
Figure 6-4 16x16 residual frame
44
Residual Frame after 8x8 motion compensation
Figure 6-5 8x8 residual frame
Residual Frame after 4x4 motion compensation:
Figure 6-6 4x4 residual frame
6.2 Simulation (Artist with big movement)
These 2 frames have relative big differences, we will see how the residual is when there
is major movement. Figure 6-7 and 6-8 are twp continuous frames. The residuals of
different block size are given in Figure 6-9, 6-10, 6-1 1 and 6-12. The residual with the
4x4 ME has the least energy level.
45
Previous frame:
Figure 6-8 Current Frame
Figure 6-9 Residual Frame withoutME
46
Figure 6-12 Residual frame with 4x4ME
47
The results are significant. The residual from more precise compensation contain less
energy.
48
CHAPTER 7: Synthesis of the RTL VHDL codes
7.1 Constraints for Synthesis
The Register Transfer Level VHDL codes have been synthesized by Synopsys Design
Compiler. Blank memory block has been used to avoid the long synthesis time.
The following constraints were applied to the design. The worst operation condition is
1.62 Volt, which is 10% lower than normal voltage level, and 125 degree centigrade. The
clock period is set at 7ns, anything lower than 7ns will cause timing violation. The input
and output ports delays are also set. TSMC 0.25 um library is linked to this design.
reset_design
create_clock -period 7 -name my_clk [get_ports me_clk]
set_dont_touch_network [get_clocks my_clk]
set_clock_uncertainty 0.25 [get_clocks my_clk]
set_input_delay 1 -max -clock my_clk [get_ports me_reset]
set_input_delay 1 -max -clock my_clk [get_ports me_start]
set_output_delay 1 -max -clock my_clk [all_outputs]
set_input_delay 0.2 -min -clock my_clk [all_inputs]
set_output_delay -0.3 -min -clock my_clk [all_outputs]
set_operating_conditions -max slow_125_1.62
set all_in_ex_clk [remove_from_collection [all_inputs] [get_ports
me_clk]]
set_driving_cell -lib_cell fdeflal -pin Q $all_in_ex^clk
set max_cap [expr [load_of ssc_core_slow/and2al/A]*5]
set_max_capacitance $max_cap $all_in_ex_clk
set_load [expr 3 * $max_cap] [all_outputs]
7.2 Area report
The Area after synthesis is 141231um2. As the area report shows








Number of ports: 262
Number of nets: 22244 1
Number of cells: 232198
Number of references: 29
Combinational area: 289784.578 125
Noncombinational area: 151447.332031
Net Interconnect area: undefined (Wire load has zero net area)
Total cell area: 44 123 1 .906250
Total area: undefined
7.3 Timing report
High mapping effort and DesignWare_foundation library were used in order to meet the
timing requirements.
Following log is the default timing report, which can be attained by using
"reportjiming"
command.The critical path is shown in the log, along with the delays







Date : Tue May 18 1 1:19:19 2004
Operating Conditions: slow_125_1.62 Library: ssc_core_sIow
Wire LoadModelMode: enclosed
Startpoint: delay5/delay_out_reg
(rising edge-triggered flip-flop clocked by my_clk)
Endpoint: pe_16/SAD_reg[l 1]
(rising edge-triggered flip-flop clocked by my_clk)
Path Group: my_clk
Path Type: max










clock my_clk (rise edge)
0.00

































































































pe_l6/U47/Y (and2b6) 0. 18
6.40 r






clock my_clk (rise edge) 7.00
7.00




pe_16/SAD_reg[l 1]/CLK (fdmflc9) 0.00
6.75 r












Following report log is the constraint report, which reports violations of constraints
during synthesis. As we can see in following log, the constraints are all met, with no
violation. There were some hold time violation at first, "fix_timing command was






Date : TueMay 18 1 1 :22:50 2004
****************************************
This design has no violated constraints.
The synthesis was successful, the required speed was achieved and the core area is fairly
small. There is no violation.
53
CHAPTER 8: Conclusion
H.264 motion estimation has been successfully implemented as ASICs, which meet the
real-time speed requirement (125MHz), whenl6xl6, 8x8 and 4x4 blocks are being
processed in parallel. The 16x16 and fractional ME 's basic concept is from paper [12],
however novel dataflow and architectures for 8x8 and 4x4 block size processing were
developed and a pipeline global diagram was implemented.
It shows that full search block-matching algorithm is very hardware friendly because of
the fixed searching sequence. The processing speed is fixed at 16 clock cycles per pixel.
Also, this algorithm only requires a fewmemory ports which greatly reduces the power
consumption.
In the simulation, the results are significant, the smaller the block size, the less the energy
level in residual frame.
In this design two static frames were used for prediction. A streaming interface is needed
to process actual video sequence. At 125MHz this design can handle 640X480 resolution
video at 25/sec frame rate. The area of the synthesized chip is 664um x 664um.
54
REFERENCE
[1] L. D. Vos, andM. Schobinger, "VLSI Architecture for a Flexible BlockMatching
Processor"
IEEE trans, Circuit and System, vol 5, no.5, October 1995.
[2] Peter. Kuhn, "Algorithm, Complexity, Analysis and VLSI Architecture forMPEG-4 Motion
Estimation,"
Kluwer Academic Publishers, Boston, 1999.
[3] ITU-T Rec. H.264 / ISO/IEC 1 1496-10, "Advanced Video Coding", Final Committee Draft,
Document JVTG050, March 2003
[4] Ian. E. G. Richardson. "H.264 white
paper,"
www.vcodex.com
[5] Ian. E. G. Richardson "H.264 andMPEG-4 video compression", John wiley&Sons Ltd. West
Sussex P019 85Q. England, 2003
[6] Ian E. G. Richardson "Video Codec
Design"
", John wiley&Sons Ltd. West Sussex P019
85Q. England, 2003
[7] Peter. Kuhn, "Algorithm, Complexity, Analysis and VLSI Architecture forMPEG-4 Motion
Estimation,"
Kluwer Academic Publishers, Boston, 1999.
[8] Rajesh T.N. Raj.ar.am, "Optimization of fast search block matching motion estimation
algorithms and their VLSI implementation", Thesis work for the degree ofM.aster of Science (by
research), Department ofElectrical Engineering, Indian Institute ofTechnology, Madras, June
1999.
[9] S. Seikiguchi, Y. Yamada, and K. Asai, "Block-size AdaptiveMotion Compensation for
Complexity Reduction", JVT D-l 10,
4th
Meeting: Klagenfurt, Austria, Jul. 22-26, 2002.
[10] H.M. Jong, L.G. Chen and T.D. Chiueh, "Parallel architectures for three step hierarchical
search block matching algorithm", IEEE Trans, on circuits and systems for video technology,
Vol. 4, No. 4, 1994, pp. 407-4 17
[11] Shan Zhu, Kai-KuangMa, A new diamond search algorithmforfast block
matching, IEEE Transactions on Circuits and Systems on Video Technology,
Vol.9, No.2, Feb.2000, pp287-290
[12] K. M. Yang , M. T. Sun, and L.Wu. "A Family ofVLSI Design for the Motion
Compensation Block-Matching
Algorithm"




I would like to thank Dr. Hsu for the tremendous amount of assistance he provided me
with. I would also like to thank Dr. Reddy and Dr. Patru for their valuable feedbacks. Dr.
Lukowiak for his help with VHDL. And Suriyadi Gowanan for his helps me to gather the
references.
56








====:=: = :.. .. ._ . .. _
1
jiijjl Iliii!
i 1 1 1 i 1 1 1
1
lllIIIJIIIIIIIIIIi:nii; 1 = i ee = ! e
I] BtMi
|:|lla1ll 1 I Illllll II 5 5 E= = E E E E = 3 = Z|
Iiiiiiiiii iii iiiiiiiii ii : 1 il I ll i l = =l ]
Illllll llllliliilli: :::::::::: iS
Illlllllllllllll I 1 : :B
Illllll 111 liiiil II III in 1






: i'oilii"' JJ! Mllllllllll ill
n-UlllMai; -



































7 Memory Previous Frame
63
Used blank block with I/O ports.
8 Memory Previous Frame
Used blank block with I/O ports.
9 Mux betweenMemory and Interconnection
Used regularMux from library.
10 Transfer Unit
64
1 1 Bridging Unit
65











































































































APPENDED B: Source Codes







generic (max_row : integer :=257;
max_col : integer :=257
);
end beh_rae;
architecture ben ofbeh_me is
type frame is array (-8 to 262, -8 to 262 ) of integer ; -defined
frame
shared variable prev__ame, cur_frame, residuaLframel,
residual_frarne2,residual_frame3 : frame;
type inter_frame is array (-1 to 1024, -1 to 1024 ) of integer ;
-defined frame
shared variable inter_prev_frame : inter_frame;
typemv_array isarray(l to 16, 1 to 16) of integer range 0 to
65535;~defined mv array
shared variable mv_array_col, mv_array_row, mv,sad_out,mode_sub :
mv_array;
type mv_array_sub is array(l to 32, 1 to 32) of integer range 0 to
65535;--defined mv array
shared variable mv_array_col_sub, mv_array_row_sub,
mv_sub,sad_out_sub,test_mode_sub,mode_sub_sub : mv_anay_sub;
type mv_array_sub_sub is array(l to 64, 1 to 64) of integer range 0 to
65535;~definedmv array
shared variable mv_array_col_sub_sub, mv_array_row_sub_sub,
mv_sub_sub,sad_out_sub_sub,test_mode_sub_sub : mv_array_sub_sub;
type mv_array_frac is array(0 to 64, 0 to 64) of integer range 0 to
65535;~defined mv array
shared variable mv_frac : mv_anay_fiac;
shared variable hi : line;
shared variable h2 : line;
shared variable h3 : line;
shared variable h4 : line;
shared variable h5 : line;
shared variable h6 : line;
shared variable h7 : line;
shared variable h8 : line;
signal read_okl, read_ok2 :std_logic;
signal operation_okl,operation_ok2,operation_ok3,interok,frac_ok :
stdjogic;
signal write_ok : stdjogic;
begin
READING : process
variable Ul, 112: line;
variable pixl, pix2 : integer;
81
variable rowl, row2 : integer ;
variable coll, col2 : integer ;
variable status1, status2 : boolean;
file fin2 : text open read_mode is
"/home/xxl7341/ME_whole/pgm/shanl.pgm";





















while statusl = true loop
read(lll, pixl, statusl);
if (status i = true and rowl < 256)then
prev_frame(rowl,coU)
:= pixl;
if col 1 = 255 then
coll ;= 0;
rowl := rowl + 1;
else








while status2 = true loop
read(112, pix2, status2);
















wait until (read_okl =T and read_ok2 = '1');
for m in 0 to 254 loop



























ME : process 16x16ME
variable sad_row, sad_col, min_sad, ad, m, n, p, q, i, j, k, 1 :
integer,
type valid_array is anay(0 to 15, 0 to 15) of integer;
type sad_array is array(0 to 15, 0 to 15) of integer;
variable valid_sad_array : valid_array;
variable temp_sad_array : sad_array;
begin
wait until (read_okl = 'V and read_ok2 = '1') ;
formin 1 to 16 loop
for n in 1 to 16 loop
min_sad := 65535;
sad_row := 1;
sad_col := 1 ;
for p in 0 to 15 loop
for q in 0 to 15 loop
valid_sad_array(p,q) :
temp_sad_array(p,q) := 0;
for i in ((m-l)*16+p-8) to ((m-l)*16+p+7) loop
for j in ((n-l)*16+q-8) to ((n-l)*16+q+7) loop
k := (i+8-p); 1 := (j+8-q);
if (i>0 and j>0 and i<256 and j<256) then
if (prev_frame(i,j)>cur_frame(k,l)) then ad :=
else ad := cur_firame(k,])-prev_frame(i,j);
end if;
temp_sad_array(p,q) := temp_sad_array(p,q) + ad;












for r in 0 to 15 loop
for s in 0 to 15 loop
if cur_fi_me(((m- 1 )* 16+r),((n- 1 )* 16+s)) >
prev_frame(((m-1 )* 16+r-8+sad_row),((n- 1 )* 16+s-8+sad_col))
then
residual_framel(((m-l)*16+r),((n-l)*16+s)):=
cur_frame(((m- 1 )* 16+r),((n- 1 )* 16+s)) -
prev_frame(((m- 1 )* 16+r-8+sad_row),((n- 1 )* 16+s-8+sad_col));
else
residual_framel(((m-l)*16+r),((n-l)*16+s))
















variable sad_row, sad_col, min_sad, ad, m, n, p, q, i, j, k, l,e,f :
integer;
type valid__rray_sub is array(0 to 15, 0 to 15) of integer;
type sad_array_sub is array(0 to 15, 0 to 15) of integer,
variable valid_sad_array_sub : valid_anay_sub;
variable temp_sad_array_sub : sad_array_sub;
begin
wait until (operation_okl = '1') ;
formin 1 to 16 loop
for n in 1 to 16 loop
if (sad_out(m,n) > 0) then
for e in 1 to 2 loop
for fin 1 to 2 loop
min_sad := 65535;
sad_row := I ;
sad_col := 1 ;
for p in 0 to 15 loop
forq inO to 15 loop
valid_sad_array_sub(p,q) :=
temp_sad_array_sub(p,q) := 0;
for i in ((m-l)*16+(e-l)*8+p-8) to ((m-l)*16+(e-l)*8+p-l) loop




if (i>0 and j>0 and i<256 and j<256) then
if (prev_frame(i,j)>cur_frame(kj)) then
ad :_ Prev-
else ad := c_r_frarne(k,l)-prev_n_rne(ij);
end if; temp_sad_array_sub(p,q)
:=











for r in 0 to 8 loop
























for e in 1 to 2 loop


















variable sad_row, sad_col, min_sad, ad, m, n, p, q, i, j, k, l,e,f :
integer;
type valid_anay_sub_sub is array(0 to 1 5, 0 to 15) of integer,
type sad_an_y_sub_sub is array(0 to 15, 0 to 15) of integer.
variable valid_sad_array_sub_sub : valid_anay_sub_sub;
85
variable temp_sad_array_sub_sub : sad_array_sub_sub;
begin
wait until (operation_ok2 = '1') ;
for m in 1 to 32 loop
for n in 1 to 32 loop
if (test_mode_sub(m,n)=l and sad_out_sub(m,n)>0) then
for e in 1 to 2 loop
for f in 1 to 2 loop
min_sad := 65535;
sad_row := 1 ;
sad_col := 1;
for p in 0 to 15 loop
for q in 0 to 15 loop
valid_sad_array_sub_sub(p,q) := 1;
temp_sad_array_sub_sub(p,q) := 0;
for i in ((m-l)*8+(e-l)*4+p-8) to ((m-l)*8+(e-l)*4+p-5) loop
for j in ((n-l)*8+(f-l)*4+q-8) to ((n-l)*8+(f-l)*4+q-5) loop
k:=(i+8-p);l:=(j+8-q);
if (i>0 and j>0 and i<256 and j<256) then
if (prev_fi_me(ij)>cur_frame(k,l)) then
ad := prev_frame(ij)-cur_frame(k,l);
else ad := cur_frame(k,l)-prev_frame(io);
end if;
temp_sad_array_sub_sub(p,q) := temp_sad_array_sub_sub(p,q) + ad;












for r in 0 to 4 loop






























for e in 1 to 2 loop
















frac : process -quarter pixel
variable sad_row, sad_col, min_sad, ad, m, n, p, q, i, j, k, l,e,f :
integer;
type valid_array_frac is array(0 to 15, 0 to 15) of integer,
type sad_array_frac is array(0 to 15, 0 to 15) of integer;
variable valid_sad_array_frac : valid_array_frac;
variable temp_sad_array_frac : sad_array_frac;
begin
wait until (interok =
'1'
and operation_ok3 = '1');
form_sub in 1 to 16 loop
for n_sub in 1 to 16 loop
ifmode_sub(m_sub,n_sub) = 0 then 16x16 frac
min_sad := 65535;
sad_row := 1 ;
sad_col := 1;
for p in 0 to 7 loop
for q in 0 to 7 loop
valid_sad_array_frac(p,q) :=
temp_sad_array_frac(p,q) := 0;
for x in 1 to 16 loop
for y in 1 to 16 loop
i := ((m_sub-l)*16+rnv_array_row(m_sub,n_sub)-8)*4-4+(x-l)*4+p;
j := ((n_sub-l)*16+mv_array_col(n_sub,n_sub)-8)*4-4+(y-l)*4+q;
k := (i+4-p)/4; 1 := (j+4-q)/4;
if (i>0 andj>0 and i<1021 andj<1021) then
if (inter_prev_rrame(i,j)>cur_frarne(k,l)) tben
ad := inter_prev_frarne(i,j)-cur_frarne(k,l);
else ad := cur_frame(k,l)-inter_prev_frame(i,j);
end if;
temp_sad_array_frac(p,q) := temp_sad_array_frac(p,q) + ad;













for i in 0 to 3 loop




for m_sub_sub in m_sub*2-l to m_sub*2 loop
for n_sub_sub in n_sub*2-l to n_sub*2 loop




for p in 0 to 7 loop
for q in 0 to 7 loop
valid_sad_an_y_frac(p,q) := 1 ;
temp_sad_array_frac(p,q) := 0;
for x in 1 to 8 loop




k := (i+4-p)/4; 1 := (j+4-q)/4;
if (i>0 andj>0 and i<1021 and j<1021) then
if (inter_prev_frame(ij)>cur_frame(k,l)) then
ad := inter_prev_frame(ij)-cur_frame(k,l);
else ad := cur_frame(k,l)-inter_prev_frame(i,j);
end if;
temp_sad_array_frac(p,q) := temp_sad_array_frac(p,q) + ad;











for i in 0 to 1 loop
forj in 0 to 1 loop




for m_sub_sub_sub in m_sub_sub*2-l to m_sub_sub*2 loop
for n_sub_sub_sub in n_sub_sub*2-l to n_sub_sub*2 loop - 4x4 frac
min_sad := 65535;
sad_row := 1;
sad_col := 1 ;
for p in 0 to 7 loop
forq in 0 to 7 loop
88
valid_sad_an_y_frac(p,q) := 1 ;
temp_sad_an_y_frac(p,q) := 0;
for x in 1 to 4 loop




((n_sub_sub_sub-l )*4+mv_array_col_sub_sub(m_sub_sub_sub,n_sub_sub_sub)-8)*4-4+(y- 1 )*4+q;
k := (i+4-p)/4; I := (j+4-q)/4;
if(i>0andj>0andi<1021 and j<1021) then
if (inter_prev_frame(ij)>cur_frame(k,l)) then
ad := inter_prev_frame(i,j)-cur_frame(k,l);
else ad := cur_frame(k,l)-inter_prev_frame(ij);
end if;
temp_sad_array_frac(p,q) := temp_sad_array_frac(p,q) + ad;
























file f_image_outl : text open write_mode is "pgm/mv_sub_sub.pgm";
file f_image_out2 : text open write_mode is "pgm/mode_sub_sub.pgm";
file f_image_out3 : text open write_mode is "pgm/mv.pgm";
file f_image_out4 : text open write_mode is "pgm/mv_sub.pgm";
file f_image_out5 : text open write_mode is "pgm/SAD_sub.pgm";
file f_image_out6 : text open write_mode is "pgm/SAD.pgm";
file f_image_out7 : text open write_mode is "pgm/mode_sub.pgm";
file f_image_out8 : text open write_mode is "pgm/mvjrac.pgm";
file f_image_out9 : text open write_mode is "pgm/resl.pgm";
file f_image_outl0 : text open write_mode is "pgm/res2.pgm";
file f_image_outl 1 : text open write_mode is "pgm/res3.pgm";
variable L_OUTl, L_OUT2, L_OUT3, L_OUT4,
L_OUT5,L_OUT6,L_OUT7,L_OUT8,L_OUT9,L_OUT10J._OUT11 : LINE;
variable CHAR 1, CHAR2, CHAR3,CHAR4,CHAR5,CHAR6,CHAR7,CHAR8 : character
begin

























for x in 1 to 256 loop






for x in 1 to 256 loop






for x in 1 to 256 loop
for y in 1 to 256 loop
write (L_OUTl 1, residual_frame3(x,y));
write(__OUTI 1, CHAR1);
end loop;
writeline(f_image_outl 1, L_OUTl 1);
end loop;
for x in 1 to 64 loop






for x in 1 to 32 loop






for x in 1 to 16 loop






for x in 1 to 32 loop






for x in 1 to 32 loop







for x in 1 to 16 loop






for x in 1 to 16 loop






for x in 0 to 63 loop










12.2 Register Transfer Level Codes
8.2.1 AGU
library ieee;






st_rt_p_row, start_p_col : in std_logic_vector(7 downto 0);
reset, st_rt,clk,start_transfer,
ME_to_ME_sub_ready : in stdjogic;
done_block, start_cmp : out stdjogic;
add_c_row,add_c_col, add_p_row,add_p_col,
add_pl_row, add_pl_col : out std_logic_vector(7 downto 0);
Z_cp, Z_pl : out stdjogic;
reset_inter_pixel, reset_inter_bit,
reset_cmp,reset_PE : out stdjogic
);
end AGU;
Architecture RTL_AGU ofAGU is
type statejype is (idle, output,done_16, finish);
shared variable s_c_row,s_c_col,
s_p_row,s_p_col,c_row,c_col,p_row,p_col,pl_row,pl_col: integer,
shared variable i_cp, i_pl, I_cp, l_pl, line : integer,
shared variable done : bit := '0';
91








then state <= idle;
























i_cp := 0; i_pl := 0;
l_cp := 0; l_pl := 0;
if start=
'1'














if (done = "l") then state <=
finish;done_block <= '1';
else state <= output;
if (l_cp< 16) then
Z_cp <= 0';
c_row := s_c_row + l_cp;
c_col := s_c_coI + i_cp;
p_row := s_p_row + l_cp;














pl_row := l_pl + s_p_row - 1 ;
pl_col := s_p_col + 16 + i_pl ;
add_pl_row <= conv_std_logic_vector(pl_row,8);
add_pl_col <= conv_std_logic_vector(pl_col,8);




if (i_cp=15 and l_cp=16 and line = 15) then done :=T;
start_cmp<='r;
elsif (i_cp=15 and l_cp=16 ) then
i_cp := 0; l_cp := 0; i_pl := 0; I_pl := 0; line := line + 1;
s_p_row := s_p_row + l;state <= done_16;
elsif (i_cp=15 )then
i_cp := 0; i_pl := 0; l_cp := l_cp + 1;
l_pl :=l_pl + 1; reset_inter_bit<='r;
start_cmp <= '0'; reset_PE <= '0';
else i_cp := i_cp + 1; i_pl := i_pl + 1;


























port(mv_ready : out stdjogic;
PE0, PE1, PE2, PE3, PE4, PE5, PE6, PE7, PE8,
PE9, PE10, PE1 1, PE12, PE13,
PE14, PE15 : in std_logic_vector(15 downto 0);
Sj^D_threshhold : in std_logic_vector(15 downto 0);
reset_cmp, start_cmp : in stdjogic;
elk: in stdjogic;
S4^D, min_SAD_out : out stdJogic_vector(15 downto 0);
mv_out,mv_out_sub : out std_logic_vector(7 downto 0)
);
end comparator
Architecture comparator of comparator is
93
type state_type is (idle, compare.outputmv.initial);
signal state : state_type;
begin
process(clk, reset_cmp,start_cmp)
variable pe_count,mv : integer range 0 to 256;




then state <= initial;
end if;
if start_cmp =T then state <= compare; end if;
if clk'event and clk=T then
case state is
when idle =>









if (convJnteger(PEl) < min_SAD )
then min_SAD := conv_integer(PEl);
mv :=pe_count+l;
end if;




if (convJnteger(PE3) < min_SAD )
then min_S./VD := conv_integer(PE3);
mv := pe_count+3;
end if;
if (conv_integer(PE4) < min_SAD )
then min_Si\D := conv_integer(PE4);
mv := pe_count+4;
end if;
if (convJnteger(PE5) < min_SAD )
then min_SAD := conv_integer(PE5);
mv := pe_count+5;
end if;
if (conv_integer(PE6) < min_SAD )
then min_SAD := conv_integer(P_6);
mv := pe_count+6;
end if;
if (conv_integer(PE7) < min_SAD )
then min_SAD := conv_integer(PE7);
mv := pe_count+7;
end if;
if (conv_integer(PE8) < min_SAD )




if (conv_integer(PE9) < min_SAD )
then min_SAD := conv_integer(PE9);
mv := pe_count+9;
end if;
if (conv_integer(PE10) < min_SAD )
then min_SAD := conv_integer(PE10);
mv := pe_count+10;
end if;
if (conv_integer(PEl 1) < min_SAD )
then min_SAD := conv_integer(PEl 1);
mv := pe_count+ll;
end if;
if (conv_integer(PE12) < min_SAD )
then min_SAD := conv_integer(PE12);
mv :=pe_count+12;
end if;
if (conv_integer(PE13) < min_SAD )
then min_SAD := conv_integer(PE13);
mv := pe_count+13;
end if;




if (conv_integer(PE15) < min_SAD )
then min_S./VD := conv_integer(PE15);
mv := pe_count+15;
end if;
min_Si\D_out <= conv_std_logic_vector(min_S4\D,16); test signal
-SAD <= conv_std_logic_vector(min_S>VD,16);
if pe_count = 240 then state <= outputmv;
else state <= idle; pe_count := pe_count + 16;
end if;
when outputmv =>
ifmin_SAD < conv_integer(SAD_threshhold) then
mv_out <= conv_std_logic_vector(mv,8);
else mv_out_sub <= conv_std_logic_vector(mv,8);
end if;
SAD <= conv_std_logic_vector(min_SAD,16);
























reset, start : in stdjogic;
ME_to_ME_sub_ready : in stdjogic;
elk : in stdjogic;
reset_AGU, enable_SPU, start_AGU : out stdjogic;
donejrame : out stdjogic;
Block_num_col, Block_num_row : out std_logic_vector(3 downto 0);




type state_type is (idle, startup, output1,
Output2,output3,output4,output5,output6,output7,output8,output9, finish);
signal state : state_type;
begin
process(clk.reset)
variable col, row : integer range 0 to 15;
variable block_count : integer range 0 to 256;
























reset_AGU <= 'O'lenable.SPU <= '0';












block_count := block_count+l; col := col + 1; state <=
96
output2;reset_AGU <= T;








if (ME_to_ME_sub_ready = '1 ') then
block_count := block_count + 1; reset_AGU <= '1';
col :=col+ 1;
end if;
if (block_count = 1 5) then
state <= output3;







if (M__to_ME_sub_ready ='1 ') then
block_count := block_count + 1; reset_AGU <= '1';
col := 0; row := row + 1 ;
state <= output4;








if (ME_to_ME_sub_ready = '1 ') then
block_count := block_count + 1; reset_AGU <= '1';
col := col + 1 ;
state <= output5;







col := col + 1 ;
if (ME_to_ME_sub_ready =
'1'
and sub_block_count = 13) then
block_count := block_count + 1; reset_AGU <= '1'; state <= output6;
elsif(ME_to_ME_sub_ready
= '1') then
bIock_count := block_count + 1; reset_AGU <= '1'; state <= output5;
sub_block_count := sub_block_count + I ; col := col + 1 ;







if(ME_to_ME_sub_ready =T and block_count < 239) then
block_count := block_count + l;sub_b!ock_count := 0; reset_AGU <=
97
col := 0; row := row + 1; state <= output4;
elsif (ME_to_ME_sub_ready = '1') then
block_count := block_count + l;reset_AGU <= '1';
col := 0; row := row + 1 ;
state <= output7;











block_count := block_count + 1; col := col + 1; reset_AGU <= T;
state <= output8;






position <= "01 11";
if (ME_to_ME_sub_ready = '1') then











if (ME_to_ME_sub_ready = '1') then
block_count := block_count + l;reset_AGU <= '1';
state <= finish; donejrame <= '1';















c, p, pi : in std_logic_vector(7 downto 0);
reset_dff_bit, reset_dff_pixel : in stdjogic;
elk : in stdjogic;
PEOc, PEOp, PElc, PElp, PE2c, PE2p, PE3c, PE3p, PE4c, PE4p, PE5c,
PE5p, PE6c, PE6p, PE7c, PE7p, PE8c, PE8p,
PE9c, PE9p, PElOc, PElOp, PEllc, PEllp, PE12c, PE12p, PE13c, PE13p,
PE14c, PE14p, PE15c,
PE15p : inout std_logic_vector(7 downto 0 )
);
end Interconnection;
Architecture RTL_Interconnection of Interconnection is
type Dff_anay is array (0 to 14) ofbit;







PElp <= pi; else PElp <= p;
end if;
if(Dffbit(l) = '0,)then
PE2p <= pi; else PE2p <= p;
end if;
if(Dffbit(2) = ,0')then
PE3p <= pi ; else PE3p <= p;
end if;
if(Dffbit(3) = '0')then
PE4p <= pi ; else PE4p <= p;
end if;
'if(Dffbit(4) = '0')then
PE5p <= pi; else PE5p <= p;
end if;
if(Dffbit(5) = '0')then
PE6p <= pi ; else PE6p <= p;
end if;
if(Dffbit(6) = '0')then
PE7p <= pi ; else PE7p <= p;
end if;
if(Dffbit(7) = ,0,)then
PE8p <= pi ; else PE8p <= p;
end if;
if(Dffbit(8) = '0')then
PE9p <= pi ; else PE9p <= p;
end if;
if(Dffbit(9) = '0')then
PElOp <= pi ; else PElOp <= p;
end if;
if(Dffbit(10) = '0')then
PE1 lp <= pi ; else PE1 lp <= p;
end if;
if(Dffbit(ll) = '0')then
PE12p <= pi ; else PE12p <= p;
end if;
if(Dffbit(12) = '0')then
PE13p <= pi; else PE13p <= p;
end if;
if(Dffbit(13) = '0')then









'0'; Dffbit(5) <= '0'; Dffbit(6) <= '0';
<= '0'; Dffbit(l 1) <= '0'; Dffbit(12) <= '0';
<=
Dffbit(l 1); Dffbit(l 1) <= Dffbit(lO); Dffbit(10) <= Dffbit(9);
Dffbit(6); Dffbit(6) <= Dffbit(5); Dffbit(5) <= Dffbit(4);
Dffbit(l); Dffbit(l) <=Dffbit(0);Dffbit(0) <= '!';
if (clk'event and elk = '1') then
if (reset_df_bit = 'l") then
Dffbit(l) <= '0'; Dffbit(2) <= W; Dfrbit(3) <= '0'; Dffbit(4)
<=
Dffbit(7) <= '0'; Dffbit(8) <= '0'; Dffbit(9) <= '0'; Dffbit(10)
Dffbit(13) <= '0'; Dffbit(14) <= 'O'lDffbitfO) <= '0';
else








PE1 lc <= PElOc; PElOc <= PE9c;
PE5c<=
PE4c; PE4c <= PE3c; PE3c <= PE2c;
end process;
end architecture RTL_Interconnection;







a, b : in std_Iogic_vector(7 downto 0);
reset : in stdjogic;
Clk : in stdjogic;
Dffbit(9) <=Dffbit(8); Dffbit(8) <= Dffbit(7); Dffbit(7) <=







PElc <= (others => '0'); PE2c <= (others => '0');
PE3c <= (others => '0'); PE4c <= (others => ,0'); PE5c <=
PE6c <= (others => '0'); PE7c <= (others => '0'); PE8c <=
PE9c <= (others => '0'); PElOc <= (others => '0'); PE1 lc <=
PE12c <= (others => '0'); PE13c <= (others => '0'); PE14c <=
PE15c <= (others => '0');
else
PE15c <= PE14c; PE14c <= PE13c; PE13c .<=PE12c; PE12c
PE9c <= PE8c; PE8c <= PE7c; PE7c <= PE6c; PE6c <= PE5c;




SAD : inout std
_logic_vector(
1 5 downto 0)
);
end PE;
architecture str_PE of PE is
begin
process(clk)
variable c : integer range 0 to 255;
variable temp : integer range 0 to 65535;
begin




SAD<= (others => '0');
else






















Block_num_col, Block_num_row : in std_logic_vector(3downto 0);
position : in std_logic_vector(3 downto 0);
enable : in stdjogic;




architecture RTL_SPU of SPU is
begin
process(Block_num_col,Block_num_row, position, enable)
variable row, col : integer range 0 to 15;
















































p_row := c_row - 8;







p_row := c_row - 8;















p_row := c_row -16;







































add_p_row,add_p_col : in std_logic_vector(7 downto 0);
add_pl_row,add_pl_col: in std_logic_vector(7 downto 0);
out_pl: out std_logic_vector(7 downto 0);
out_p : out std_logic_vector(7 downto 0);




type frame is array (0 to 255, 0 to 255 ) of integer ;
shared variable prevjrame : frame;
shared variable hi : line;
shared variable h2 : line;
shared variable h3 : line;
shared variable h4 : line;






variable rowl: integer :=0;
variable coll: integer :=0;
variable statusl : boolean;














while statusl = true loop
readflll, pixl, statusl);
if (status i = true and rowl < 256)then
prev_frame(rowl,coll) := pixl;
if col1= 255 then
coll := 0;
rowl := rowl + 1;
else










if(read_okl = "l") then
































add_c_row, add_c_col : in std_logic_vector(7 downto 0);
out_c : out stdJogic_vector(7 downto 0);
elk : in stdjogic
);
end mem_current_frame;
Architecture Mem_c ofmem_current_frame is
104
type frame is array (0 to 255, 0 to 255 ) of integer ;
shared variable currjrame : frame;
shared variable hi : line;
shared variable h2 : line;
shared variable h3 : line;
shared variable h4,h5 : line;




variable 111 : line;
variable pixl : integer,
variable rowl: integer :=0;
variable coll : integer :=0;
variable statusl : boolean;














while statusl = true loop
read(lll, pixl, statusl);
if (status 1 = true and rowl < 256)then
curr_frame(rowl,coll) := pixl;
if col 1 = 255 then
coll := 0;
rowl :=rowl + 1;
else










if (read_okl = T) then



































z_cp, z_pl : in stdjogic;
elk : in stdjogic;
in_c, in_p, in_pl : in std_logic_vector(7 downto 0);





















port( elk : in stdjogic;
startjtransfer : in stdjogic;
s_c_row, s_c_col,s_p_row,
s_p_coI : in std_logic_vector(7 downto 0);
data_ready : out stdjogic;




sub_c_row,sub_c_col : out std_logic_Vector(3 downto 0);
sub_p_row,sub_p_col : out std_logic_vector(5 downto 0);
read_write,ME_add_select : out stdjogic;
ME_sub_add_selectl,ME_sub_add_select2 : out stdjogic
Architecture TransME ofTransME is
type state_type is (startup, transfer);








then state <= startup; end if;















read_write <= '0'; data_ready <= '0';ME_add_select <= '0';
ME_sub_add_selectl <= 'l';ME_sub_add_select2<= '1';
'
st_rt_c_row := conv_integer(s_c_row);start_c_col :=
conv_integer(s_c_col);
if conv_integer(s_p_row)= 0 then
start_p_row := conv_integer(s_p_row)+l ;
elsif conv_integer(s_p_row) = 240 then
start_p_row := conv_integer(s_p_row)-l ;
else start_p_row := conv_integer(s_p_row);
end if;
if conv_integer(s_p_col)= 0 then
start_p_col := conv_integer(s_p_col)+l;
elsif conv_integer(s_p_col) = 240 then
start_p_col := conv_integer(s_p_col)-l;
else start_p_col := conv_integer(s_p_col);
end if;
i_c := 0; i_p := 0; I_c := 0; l_p := 0;
if start_transfer= '1
'
then state <= transfer;
else state <= startup;
end if;
when transfer =>
read_write <= T;ME_add_select <= T;ME_sub_add_selectl <= '0';
ME_sub_add_select2 <= '0';
if l_c < 16 then
num_c_row := start_c_row + l_c;
c_row <= conv_std_logic_vector(num_c_row,8);
sub_c_row <= conv_std_logic_vector(I_c,4);




if i_c< 15 then
Lc :=i_c+ 1;
else i_c := 0; l_c := I_c + 1 ;
end if;
end if;
if l_p < 34 then
num_p_row := start_p_row + l_p-l;
p_row <= conv_std_logic_vector(num_p_row,8);
sub_p_row <= conv_std_logic_vectorfl_p,6);





else i_p := 0; l_p := l_p + 1;
end if;









port(clk, reset : in stdjogic;
mv : in std_logic_vector<7 downto 0);
SAD_threshhold : in stdJogic_vector(15 downto 0);
SAD_in : in std_logic_vector(15 downto 0);
ME_sub_ready,ME_ready : in stdjogic;
data_ready : in stdjogic;
need_process : out stdjogic;
start_data_transfer : out stdjogic;




type state_type is (rese
_state,idle, need_process_state,
not_process_state);
signal state : state_type;






then state <= reset_state;










then state <= idle;





if conv_integer(SAD_in) < conv_integer(SAD_threshhold) then state <=
not_process_state;








then start_data_transfer <= '0';state <= idle;
ME_to_ME_sub_ready <= '1';













use ieee.std_logic_l 1 64.all;
entity ME is
port(me_reset,me_st_rt, me_ME_sub_ready: in stdjogic;
me_clk :in stdjogic;
me_Sj\D_threshhoId : in std_logic_vector(15 downto 0);
need_process : out stdjogic;
mv_out, mv_out_sub, in_c,in_p : out std_Iogic_vector(7 downto 0);
sub_c_row,sub_c_col : out std_logic_vector(3 downto 0);
sub_p_row,sub_p_coI : out std_logic_Vecton.5 downto 0);
read_write, done, transfer_done : out stdjogic;
ME_sub_add_selectl , ME_sub_add_select2 : out stdjogic
);
end ME;
Architecture str_ME ofME is
component delay_clock
port(cIk : in stdjogic;
end component;
delayjn : in stdjogic;




port(reset, start : in stdjogic;
ME_to_ME_sub_ready : in stdjogic;
elk : in stdjogic;
reset_AGU, enable_SPU, start_AGU : out stdjogic;
donejrame : out stdjogic;
Block_num_col, Block_num_row : out std_logic_vector(3 downto 0);




port(start_c_row,start_c_col, start_p_row, start_p_col : in
std_logic_vector(7 downto 0);
reset, start, elk, start_transfer,ME_to_ME_sub_ready : in stdjogic;
done_block,start_cmp : out stdjogic;
add_c_row, add_c_col, add_p_row, add_p_col,
add_pl_row,add_pl_col : out std_logic_vector(7 downto 0);
Z_cp, Z_pl : out stdjogic;




port(Block_num_col, Block_num_row : in std_logic_vector(3 downto 0);
position: in std_logic_vector(3 downto 0);
enable : in stdjogic;
start_c_row, start_c_col,




port(a,b : in std_logic_vector(7 downto 0);
reset : in stdjogic;
Clk : in stdjogic;




port(mv_ready : out stdjogic;
PE0, PE1, PE2, PE3, PE4, PE5, PE6, PE7, PE8, PE9, PE10,
PE11, PE12, PE13, PE14, PE15 : in std_logic_vector(15 downto 0);
reset_cmp, start_cmp :in stdjogic;
elk: in stdjogic;
SAD_threshhold : in std_logic_vector(15 downto 0);
Sj\D, min_Si,\D_out : out std_logic_vector(15 downto 0);




port(c, p, pi : in std_logic_vector(7 downto 0);
reset_dff_bit, reset_dff_pixel : in stdjogic;
elk : in stdjogic;
PEOc, PE0p,PElc, PElp,PE2c, PE2p,PE3c, PE3p,PE4c, PE4p,PE5c,
PE5p,PE6c, PE6p,PE7c, PE7p,
PE8c, PE8p,PE9c, PE9p,PE10c, PElOp.PEllc, PEllp,PE12c, PE12p,PE13c,
PE13p,PE14c, PE14p,




port(mux_select : in stdjogic;
ME_addl,ME_add2 : in std_logic_vector(7 downto 0);





port(add_c_row, add_c_col : in std_logic_vecton7 downto 0);
out_c : out std_logic_vector(7 downto 0);




port(z_cp, z_pl : in stdjogic;
elk : in stdjogic;
in_c, in_p, in_pl : in std_logic_vector(7 downto 0);




port(add_p_row, add_p_col : in stdJogic_vector(7 downto 0);
add_pl_rovlk add_pl_col : in std_logic_vector(7 downto 0);
out_p : out std_logic_vector(7 downto 0);
out_pl : out std_logic_vector(7 downto 0);




port(clk,reset : in stdjogic;
mv: in std_logic_Vector(7 downto 0);
SAD_threshhold : in std_logic_vector(15 downto 0);
SADJn : in std_logic_Vector(15 downto 0);
ME_sub_ready, ME_ready : in stdjogic;
data_ready : in stdjogic;
need_process : out stdjogic;
star_data_transfer : out stdjogic;
ME_to_ME_sub_ready : out stdjogic);
end component;
component TransME
port(clk, start_transfer : in stdjogic;
s_c_row, s_c_col, s_p_row, s_p_col : in std_logic_vector(7 downto 0);
data_ready : out stdjogic;
c_row,c_col,p_row,p_col : out std_logic_vector(7 downto 0);
sub_c_row,sub_c_col : out std_logic_vector(3 downto 0);
sub_p_row,sub_p_col : out std_logic_vector(5 downto 0);
read_write,ME_add_seIect : out stdjogic;
ME_sub_add_selectl,ME_sub_add_select2 : out stdjogic
);
end component;
signal me_z_cp,me_z_pl,me_ME_add_select,me_mv_ready : stdjogic;
signal me_start_cmp,me_reset_AGU, me_start_AGU,
me_done_blockjne_need_process,me_read_write,








me_block_num_row,me_sub_c_row,me_sub_c_col : std_logic_vector(3 downto 0);




me_PE5_c, me_PE5_p,me_PE6_c, me_PE6_p,me_PE7_c, me_PE7_p,me_PE8_c,
me_PE8_p,me_PE9_c, me_PE9_p,




me_mem_out_c, me_mem_out_p, me_mem_out_pl, me_start_c_row,
me_start_c_col,















ME_mux21prow:ME_mux21 port map(mux_select =>
me_ME_add_select,ME_addl=>me_add_p_row,ME_add2=>me_p_row,














Sj\D_in => me_min_SAD_out,ME_sub_ready => me_ME_sub_ready, ME_ready
=> me_done_block, data_ready => me_data_ready,
need_process => me_need_process,
start_data_transfer => me_start_transfer, ME_to_ME_sub_ready=>
me_ME_to_ME_sub_ready);



























































map(clk=>me_cllc,delay_in=>me_z_p1 ,delay_out=>me_delay_z_p 1 );
pe_l : PE port
map(reset=>me_delay_reset_PE,clk=>me_clk,SAD=>me_PEO_SAD,a=>me_PEO_c,b=>me_PEO_p);










pe_7 : PE port
rnap(reset=>me_delay_reset_PE,clk=>me_clk,SAE)=>me_P_6_Sj\D,a=>rne_PE6_c,b=>me_PE6_p);
pe_8 : PE port
rnap(reset=>me_delay_reset_PE,clk=>me_clk,Sj\E)=>me_P_7_SAD,a=>meJ>E7_c,b=>me_PE7_p);







map(reset=>me_delay_reset_PE,clk=>me_clk,SAD=>me_PEl l_SAD,a=>me_PEl l_c,b=>me_PEl l_p);
pe_13:PEport
marKreset=>me_delay_reset_PE,clk=>me_clk,S4\D=>me_PE12_SAD,a=>me_PE12_c,b=>me_PE12_p);
pe_14 : PE port
map(reset=>me_delay_reset_PE,clk=>me_clk,SAD=>me_PE13_SAD,a=>me_PE13_c,b=>me_PE13_p);







sub_c_col <= me_sub_c_col; sub_c_row <= me_sub_c_row; sub_p_row <=
















start_c_row, start_c_col : in std_Iogic_vector(3 downto 0);
start_p_row, start_p_col : in std_logic_vector(5 downto 0);
reset, start,clk : in stdjogic;
done_block_sub, start_cmp : out stdjogic;
add_c_row,add_c_col : out std_logic_vector(3 downto 0);
114
add_p_row,add_p_col, add_pl_row, add_pl_col,add_p2_row,add_p2_col :
out std_logic_vector(5 downto 0);
Z_cp, Z_pl,Z_p2 : out stdjogic;




Architecture RTL_AGU_sub ofAGU_sub is
type state_type is (idle, output,done_16, finish);
shared variable s_c_row,s_c_col,
s_p_row,s_p_col,c_row,c_col,p_row,p_col,pl_row,pl_col,p2_row,p2_col: integer,
shared variable i_cp, i_pl, l_cp, l_plj_p2J_p2, line : integer,
shared variable done : bit := '0';




if reset = 'V then done_block_sub<='0';state <= idle;
























s_p_row := conv_integer(st_rt_p_row)+l ;
s_p_col := conv_integer(start_p_col)+l;
i_cp := 0; i_pl := 0; l_cp := 0; l_pl := 0;i_p2 := 0; l_p2 := 0;
if start=
' 1'














l_p2 := l_p2 + 1; reset_inter_bit <= '1';
<= '0';reset_PE <= '0';
if (done = T) then state <= finish;
else state <= output;
if (l_cp < 8) then
Z_cp <= '0';
c_row := s_c_row + l_cp;
c_col := s_c_col + i_cp;
p_row := s_p_row + l_cp;











if (l_pl > 0 and l_pl < 9) then
Z_pl<=,01;
pl_row := l_pl + s_p_row - 1;
pl_col := s_p_col + 8 + i_pl;
add_pl_row <= conv_std_logic_vector(pl_row,6);
add_pl_col <= conv_std_logic_vector(pl_col,6);





p2_row := l_p2 + s_p_row -2;






1 '; add_p2_row <= "OOOOOO";
add_p2_col <= "OOOOOO";
end if;
if (i_cp=7 and l_cp=9 and line= 15) then done := T;start_cmp <=
elsif (i_cp=7 and l_cp=9 ) then
i_cp := 0; l_cp := 0; i_pl := 0; l_pl := 0y_p2:=0;l_p2:=0; line :=
s_p_row := s_p_row + l;state <= done_16;
elsif (i_cp=7 ) then
i_cp := 0; i_pl := 0; l_cp := l_cp + 1; l_pl := l_pl + l;i_p2 :=0;
start_cmp
<= '0'; reset_PE <= '0';























reset, start : in stdjogic;
ME_sub_tc_ME_sub_sub_ready : in stdjogic;
elk : in stdjogic;
reset_AGU, enable_SPU, start_AGU : out stdjogic;
done_block,ME_sub_ready : out stdjogic;
position : out std_logic_vector(l downto 0)
);
end entity Controller_sub;
Architecture FSM_Controller_sub ofController_sub is
type state_type is (idle, startup, output 1 ,
output2,output3,output4,finish);
signal state : state_type;
begin
process(clk,reset)
variable block_count : integer range 0 to 4;
variable sublblock_count : integer range 0 to 14;
begin
if reset = T then
state <= idle;


















reset_AGU <= '0,;enable_SPU <= '0';
done_block <= 'Opposition <= "00"; ME_sub_ready <= '0';
state <= output1;




if (ME_sub_to_ME_sub_sub_ready = '1") then
block_count := block_count+l;state <= output2;reset_AGU <= T;





if (ME_sub_to_ME_sub_sub_ready = T) then
bIock_count := block_count + 1; reset_AGU <= '1';
state <= output3;






block_count := block_count + 1; reset_AGU <= T;
state <= output4;





if (ME_sub_to_ME_sub_sub_ready = '1') then
block_count := block_count + 1 ; reset_AGU <= '1 ';
state <= finish; done_block <= T;














c, p, pl,p2 : in stdJogic_vector(7 downto 0);
reset_dff_bit, reset_dff_pixel : in stdjogic;
elk : in stdjogic;
PEOc, PEOp, PElc, PElp, PE2c, PE2p, PE3c, PE3p, PE4c, PE4p, PE5c,
PE5p, PE6c, PE6p, PE7c, PE7p, PE8c, PE8p,




PE15p : inout std_logic_vector(7 downto 0 )
118















































































































































































Dffbitl5 <= '0'; Dffbitl6 <= '0';
'0'; Dffbit24 <= '0'; Dffbit25 <= '0';
Dffbitl3<=Dffbitl2;
'l';Dffbit26 <=Dffbit25;





PE1 lc <= PElOc; PElOc <= PE9c;









Dffbitl 1 <= '0'; Dffbitl2 <= '0'; Dffbitl3<= '0'; DffbitH <= '0';
Dffbit20 <= '0'; Dffbit21 <= '0'; Dffbit22 <= '0'; Dffbit23 <=
Dffbit26 <= '0'; DffbitlO <= '0';
else
Dffbitl6 <= Dffbitl5; Dffbitl5 <= DffbitH; DffbitH <= Dffbitl3;
Dffbitl2 <= Dffbitl 1; DffbitH <= DffbitlO; DffbitlO <=
Dffbit25 <= Dffbit24; Dffbit24 <= Dffbit23 ; Dffbit23 <= Dffbit22;
Dffbit20<=T;
end if;
if (reset_dff_pixel = '1
'
) then
PElc <= (others => '0'); PE2c <= (others => '0');
PE3c <= (others => '0'); PE4c <= (others => '0'); PE5c <= (others
PE6c <= (others => '0'); PE7c <= (others => W); PE8c <= (others
PE9c <= (others => 0"); PElOc <= (others => 'O^; PE1 lc <=
PE12c <= (others => '01); PE13c <= (others => 'O1); PE14c <=
PE15c <= (others => 'O1);
else
PE15c <= PE14c; PE14c <= PE13c; PE13c <= PE12c; PE12c <= PE1 lc;
PE9c <= PE8c; PE8c <= PE7c; PE7c <= PE6c; PE6c <= PE5c; PE5c <=












start_c_row, start_c_col : in std_logic_vector(2 downto 0);
start_p_row, start_p_col : in std downto 0);
reset, start.clk : in stdjogic;
done_block_sub, start_cmp : out stdjogic;
add_c_row,add_c_col : out std_logic_vector(2 downto 0);
add_p_row,add_p_col, add_pl_row, add_pl_col,add_p2_row,add_p2_col,
add_p3_row,add_p3_col,add_p4_row,add_p4_col : out std_logic_vector(4
Z_cp, Z_pl,Z_p2,Z_p3,Z_p4 : out stdjogic;
reset_inter_pixel, reset_inter_bit,reset_cmp,reset_PE : out
);
end AGU_sub_sub;
Architecture RTL_AGU_sub_sub ofAGU_sub_sub is




shared variable i_cp, i_pl, l_cp, l_pl, _p2,_p2,i_p3,l_p3,i_p4,l_p4,
line : integer;
shared variable done : bit := '0';






then state <= idleidoneJjloc^sutx^O';


































s_p_row := conv_integer(start_p_row)+l ;
s_p_col := conv_integer(start_p_col)+l ;
i_cp := 0; i_pl := 0; l_cp := 0; l_pl := 0y_p2 := 0; l_p2 :=
if start =
'1'










if (done = T) then state <= finish;
else state <= output;
ifG_cp < 4) then
Z_cp <= 0';
c_row := s_c_row + l_cp;
c_col := s_c_col + i_cp;
p_row := s_p_row + l_cp;











if (l_pl > 0 and l_pl < 5) then
Z_pl <= '0';
pl_row := l_pl + s_p_row - 1 ;
pl_col := s_p_col + 4 + i_pl ;
add_pl_row <= conv_std_logic_vector(pl_row,5);
add_pl_col <= conv_std_logic_vector(pl_col,5);
else Z_pl <= '1'; add_pl_row <= "00000";
add_pl_col <= "00000";
end if;
if (l_p2 > 1 and l_p2 <6) then
Z_p2<='0';
p2_row := l_p2 + s_p_row -2;





1 '; add_p2_row <= "00000";
add_p2_coI <= "00000";
end if;
if (l_p3 > 2 and l_p3 <7) then
Z_p3 <= '0';
p3_row := l_p3 + s_p_row -3;
123





1'; add_p3_row <= "00000";
add_p3_col <= "00000";
end if;
if(l_p4 >3 and l_p4 < 8) then
Z_p4<='0';
p4_row := l_p4 + s_p_row -4;





1'; add_p4_row <= "00000";
add_p4_col <= "00000";
end if;
if (i_cp=3 and l_cp=7 and line = 1 5) then done := 'r;start_cmp <=
T;
elsif (i_cp=3 and l_cp=7 ) then
i_cp := 0; l_cp := 0; i_pl := 0; Lpl :=
0;i_p2:=0;l_p2:=0;i_p3:=0;l_p3:=0;i_p4:=0;l_p4:=0; line := line + 1 ;
s_p_row := s_p_row + l;state <= done_16;
elsif (i_cp=3 ) then
i_cp := 0; i_pl := 0; l_cp := l_cp + 1 ; l_pl := l_pl + 1 ;i_p2 :=0;
Lp2:=l_p2+1;
i_p3:=0;l_p3:=l_p3+l;i_p4:=0;l_p4:=l_p4+l; reset_inter_bit <= '1';
start_cmp <= '0'; reset_PE <= '0';
else i_cp := i_cp + 1; i_pl := i_pl + 1; i_p2 := i_p2 +






















reset, start, ME_4_to_frac_ready : in stdjogic;
done_block_sub : in stdjogic;
elk : in stdjogic;
reset_AGU, enable_SPU, start_AGU : out stdjogic;
done_block : out stdjogic;




Architecture FSM_Controller_sub_sub ofController_sub_sub is
type state_type is (idle, startup, output 1,
output2,output3,output4,finish);
signal state : state_type;
begin
process(clk,reset)
variable block_count : integer range 0 to 4;






















reset_AGU <= '0';enable_SPU <= '0';
done_block <= 'Opposition <= "00";
state <= output1;






block_count := block_count+l;state <= output2;reset_AGU <= T;





if (ME_4_to_frac_ready = '1') then
block_count := block_count + 1; reset_AGU <=
state <= output3;






block_count := block_count + 1 ; reset_AGU <= '1 ';
state <= output4;









bIock_count := block_count + 1; reset_AGU <= '1';
state <= finish; done_block <=T;












- Interconnection network for full searchMotion Estimation
- coded by XiangU, 05/07/04
entity Interconnection_sub_sub is
port(
c, p, pl,p2,p3,p4 : in stdJogic_vector(7 downto 0);
reset_dff_bit, reset_dff_pixel : in stdjogic;
elk : in stdjogic;
PEOc, PEOp, PElc, PElp, PE2c, PE2p, PE3c, PE3p, PE4c, PE4p, PE5c,
PE5p, PE6c, PE6p, PE7c, PE7p, PE8c, PE8p,
PE9c, PE9p, PElOc, PElOp, PE1 lc, PE1 lp, PE12c, PE12p, PE13c, PE13p,
PE14c, PE14p, PE15c,
PE15p : inout std_logic_vector(7 downto 0 )
);
end Interconnection_sub_sub;
























































































































































Dffbit21 <= '0'; Dffbit22 <= '0';




if (clk'event and elk =
'
1 ) then
if (reset_dff_bit = '1') then
DffbitlO <= 0'; Dffbitl 1 <= '0'; Dffbitl2<= '0'; Dffbit20 <= '0';
Dffbit30 <= '0'; Dffbit31 <= '0'; Dffbit32 <= '0'; Dffbit40 <=
else
Dffbitl2 <= Dffbitl 1; Dffbitl 1 <= DffbitlO; DffbitlO <= '1';
Dffbit2 1 <= Dffbit20; Dffbit20 <=
'
1'; Dffbi62 <= Dffbit3 1 ;











PE1 lc <= PElOc; PElOc <= PE9c;
PE4c; PE4c <= PE3c; PE3c <= PE2c;
PE3c <= (others => '0'); PE4c <= (others => '0J, PE5c <= (others
PE6c <= (others => '0'); PE7c <= (others => '0'); PE8c <= (others
PE9c <= (others => '0'); PElOc <= (others => '0'); PE1 lc <=
PE12c <= (others => 'O1); PE13c <= (others => 'O1); PE14c <=
PE15c <= (others => '0');
else
PE15c <= PE14c; PE14c <= PE13c; PE13c <= PE12c; PE12c <= PE1 lc;
PE9c <= PE8c; PE8c <= PE7c; PE7c <= PE6c; PE6c <= PE5c; PE5c <=
end process;
end architecture RTL_Interconnection_sub_sub;









port(p, pi : in std_logic_vector(7 downto 0);
clk,start_IPl : in stdjogic;
quarter_p : out std_logic_vector(7 downto 0)
);
end fracJP1 ;
architecture fracJPl offracJPl is
shared variable n,u,v,x,y : integer,
begin
process(clk,start_IPl )
variable z : integer;
begin











z := x + n*(v - u);
q__ter_p
<= conv_std_logic_vector(z,8);











port(p,pl : in std_Iogic_vector(7 downto 0);
elk : in stdjogic;




architecture fracJP2 of frac_IP2 is
begin
process(p,pl)
variable x,y,z,u,v,p2,p4 : integer,
begin
x := conv_integer(p); y:=conv_integer(pl);



















mode : in std_logic_vector(l downto 0);
start_p_row,start_p_coI : in stdjogic;
clk,start_AGU,reset_AGU : in stdjogic;
reset_PE,reset_latch,
output_PE, Z_c,start_cmp,reset_cmp,done_block,start_ipl : out
add_p_row,add_p_col,add_pl_row,add_pl_col : out std_logic_vector(4
add_c_row,add_c_col : out std_logic_vector(3 downto 0)
);
end firac_AGU;
architecture frac_AGU of frac_AGU is
type state_type is (idIe,output,finish);




variable p_row,c_row,p_col,c_col,pl_row,pl_col : integer range 0 to 16;
variable i_p, l_p, s_p_row,s_p_col : integer range 0 to 16;
variable clk_count : integer range 0 to 3;
variable size : integer range 0 to 16;
begin




reset_PE <= '1'; resetJatch <= T;
output_PE <= '0'; done_block <= '0';
clk_count :=3; Z_c <= '0';reset_cmp <= '1';











s_p_row := conv_integer(start_p_row); s_p_col :=
conv_integer(start_p_col);
l_p := 0; i_p := 0; Z_c <= 'O^outpuUPE <= '0'; done_block <= '0';
start_cmp
<= '0';resetJatch <= '1 ';
reset_PE <= '1'; clk_count :=3;
if start_AGU =
'1'
then state <= output;
else state <= idle;
end if;
when output =>
output_PE <= '0'; reset_cmp <= '0';
reset_PE <= '0'; resetjatch <= '0';
if(l_p=0)then
Z_c <= '0'; else Z_c <= '1';
end if;
p_row := s_p_row + l_p; p_col := s_p_col + i_p; pl_row := p_row + 1 ;
pl_col := p_col;








clk_count := clk_count - 1;
startjpl <='l';


















if (i_p = size and l_p = size) then state <= finish;
elsif (i_p = size) then
i_P := 0; l_p := I_p + 1 ; state <= output;





if clk_count > 0 then output_PE <= T; start_cmp <= T;
clk_count := clk_count - l;state <= finish;













reset, start : in stdjogic;
done_block_sub : in stdjogic;
elk : in stdjogic;
reset_AGU, enable_SPU, start_AGU : out stdjogic;
done_block : out stdjogic;
position : out std_Iogic_vector(l downto 0)
);
end entity frac_Controller;
Architecture frac_Controller of frac_Controller is
type state_type is (idle, startup, output1,
output2,output3,output4,finish);
signal state : statejype;
begin
process(clk,resef)
variable block_count : integer range 0 to 4;
-variable sub_block_count : integer range 0 to 14;
begin
if reset = T then
state <= idle;


















reset_AGU <= '0';enable_SPU <= '0';
done_block <= 'Opposition <= "00";





if (done_block_sub = '1') then
block_count := block_count+l;start_AGU <= 'O'jstate <=
output2;reset_AGU <= '1';






if (done_block_sub = '1 ') then
block_count := block_count + l;start_AGU<='0'; reset_AGU <= '1';
state <= output3;







block_count := block_count + 1; start_AGU<= '0';reset_AGU <= T
state <= output4;








if (done_block_sub = '1 ') then
block_count := block_count + l;start_AGU<= '0'; reset_AGU <= T:
state <= finish; done_block <= T;

















SAD1,SAD2,SAD3,SAD4 : in std_logic_vector(15 downto 0);
clk,start_cmp,reset_cmp : in stdjogic;
mv : out std_logic_vector(5 downto 0)
);
end ftac_comp;
architecture frac_comp of frac_comp is
type state_type is (idle,compare,output);
signal state : state_type;
begin
process(clk)
variable min_Si_D : std_logic_vector(15 downto 0);
variable n : integer range 0 to 16;
variable m : integer range 0 to 3;
variable mv_calc : integer range 0 to 64;
begin



















then state <= compare;
else state <= idle;
end if;
when compare =>
if (conv_integer(SADl) < conv_integer(min_SAD)) then min_SAD := SAD1;
mv_calc :=m*16+l ; end if;
if (conv_integer(SAD2) < conv_integer(min_SAD)) then min_SAD := SAD2;
mv_calc := m*16+2; end if;
if (conv_integer(SAD3) < conv_integer(min_SAD)) then min_SAD := SAD3;
mv_calc := m*16+3; end if;
if (conv_integer(SAD4) < conv_integer(min_SAD)) then min_SAD := SAD4;
mv_calc := m*16+4; end if;
n :=n + 4;
if (n=16 and m=3) then state <= output;
elsif (n=16) then m := m + 1 ; state <= idle;
134


















c,p : in std_logic_vector(7 downto 0);
reset,clk,output : in stdjogic;
SAD4 : inout std_logic_vector(15 downto 0)
);
end frac_PE;
architecture frac_PE of frac_PE is
signal SAD3, SAD2.SAD1 : std_logic_vector(15 downto 0);
begin
process(clk)
variable q,in_cjn_p, sad_temp: integer;
begin
if clk'event and elk =T then






ifoutput = T then
q:=0;
else
in_c := conv_integer(c); in_p := conv_integer(p); sad_temp :=
conv_integer(SAD4);
if (in_c > in_p) then q := in_c





















Block_num_col, Block_num_row : in std_logic_vector(3 downto 0);
position : in std_logic_vector(l downto 0);
enable : in stdjogic;
start_p_row,start_p_col : out stdjogic
);
end entity frac_SPU;
architecture frac_SPU of frac_SPU is
begin
process( position, enable)
variable p_row,p_col : integer range 0 to 8;
begin








































add_c_row, add_c_col : in std_logic_vector(3 downto 0);
in_c : in std_logic_vector(7 downto 0);
read_write : in stdjogic;
out_c : out std_logic_vector(7 downto 0);
elk : in stdjogic
);
end mem_frac_c;
Architecture rnem_frac_c ofmem_frac_c is
type block_type is array (0 to 15, 0 to 15 ) of integer,
shared variable block_sub : block_type;
begin
mem:process(clk)
variable c : integer;
begin

























add_p_row, add_p_col,add_pl_row,add_pl_col : in std_logic_vector(4
in_p : in std_logic_vector(7 downto 0);
read_write : in stdjogic;
out_p,out_pl : out std_logic_vector(7 downto 0);




Architecture mem_frac_p ofmem_frac_p is
type block_type is array (0 to 17, 0 to 17 ) of integer,
shared variable block_sub : blockjype;
begin
mera:process(clk)
variable p,pl,p2 : integer;
begin
































in_c_row,in_c_col : in std_Iogic_vector(3 downto 0);
in_p_row,in_p_col : in std_logic_vector(4 downto 0);
read_write,
frac_add_selectl,frac_add_select2 : in stdjogic;
frac_trans_c,frac_trans_p : in std_logic_vector(7 downto 0);
mode : in std_logic_vector(l downto 0);
mv : out std_logic_vector(5 downto 0);
done : out stdjogic
);
end ME_frac;
Architecture ME_frac ofME_frac is
component ME_sub_c_mux31
port(mux_select : in stdjogic;
ME_addl , ME_add2 : in
std_logic_vector(3 downto 0);
138





ME_addl, ME_add2 : in std_logic_vector(4 downto 0);




port(clk : in stdjogic;
delayjn : in std_logic_vector(3 downto 0);





port(clk : in stdjogic;
delayjn : in std_logic_vector(4 downto 0);




port(delay_in : in std_logic_vector(7 downto 0);
clk,reset : in stdjogic;




port(p, pi : in std_logic_vector(7 downto 0);
clk,start_ipl : in stdjogic;




port(p,pl : in std_logic_vector(7 downto 0);
elk : in stdjogic;




port(clk : in stdjogic;
delayjn : in stdjogic;




porttreset, start : in stdjogic;
done_b!ock_sub : in stdjogic;
elk : in stdjogic;
reset_AGU, enable_SPU, start_AGU : out stdjogic;
done_bIock : out stdjogic;




port(mode : in std_logic_vector(l downto 0);
start_p_row, start_p_col : in stdjogic;
reset_AGU, start_AGU, elk : in stdjogic;
done_block,start_cmp,output_PE, Z_c,resetjatch,start_ipl : out
stdjogic;
add_c_row, add_c_col : out std_Iogic_vector(3 downto 0);
add_p_row, add_p_col.
139
add_pl_row,add_pl_col : out std_logic_vector(4 downto 0);






position: in std_logic_vector(l downto 0);
enable : in stdjogic;
st_rt_p_row, start_p_col : out stdjogic
);
component frac_PE
port(c,p : in std_logic_vector(7 downto 0);
reset,output : in stdjogic;
Clk : in stdjogic;




port(SADl,SAD2,SAD3,SAD4 : in std_logic_vector(15 downto 0);
reset_cmp, start_cmp :in stdjogic;
elk: in stdjogic;




port(add_c_row, add_c_col : in std_logic_vector(3 downto 0);
in_c : in std_logic_vector(7 downto 0);
read_write : in stdjogic;
out_c : out std_logic_vector(7 downto 0);




port(z_c : in stdjogic;
elk : in stdjogic;
in_c: in std_logic_vector(7 downto 0);




port(add_p_row, add_p_col : in std_logic_vector(4 downto 0);
add_pl_row, add_pl_col : in std_logic_vector(4 downto 0);
in_p : in std_logic_vector(7 downto 0);
read_write : in stdjogic;
out_p : out std_logic_vector(7 downto 0);
out_pl : out std_logic_vector(7 downto 0);














signal me : std_logic_vector(l downto 0);
signal




signal me_mv_out : std_logic_vector(5 downto 0);










cont:frac_controller port map(reset => me_reset, start => me_start,
done_block_sub => me_done_block_sub,

























































































pe_l : frac_PE port
142
marXreset=>rne_deUy_reset_PE,clk=>me_clk,SA_4=>me_PE0_^
pe_2 : frac_PE port
map(reset=>me_delay_reset_PE,clk=>me_clk,SA_>4=>me_PEl_SAD,c=>me_mux_out_c,p=>me_PEl_p,output=>n
pe_3 : frac_PE port
map<reset=>me_detoy_reset_PE,clk=>me_clk,SAD4=>rr__P_2_S4^
pe_4 : frac_PE port
map(reset=>me_delay_reset_PE,clk=>rne_clk,S4M>^>me_PE3_SAD,c=>me_mux_out_c,p=>rne_PE3_p,output=>nK
mv <= me_mv_out;done<=me_done_block;
end architectureMEJrac;
143
