Implementation of JPEG compression and motion estimation on FPGA hardware by Gopalakrishnan, Ramakrishna
UNLV Retrospective Theses & Dissertations 
1-1-2008 
Implementation of JPEG compression and motion estimation on 
FPGA hardware 
Ramakrishna Gopalakrishnan 
University of Nevada, Las Vegas 
Follow this and additional works at: https://digitalscholarship.unlv.edu/rtds 
Repository Citation 
Gopalakrishnan, Ramakrishna, "Implementation of JPEG compression and motion estimation on FPGA 
hardware" (2008). UNLV Retrospective Theses & Dissertations. 2347. 
https://digitalscholarship.unlv.edu/rtds/2347 
This Thesis is protected by copyright and/or related rights. It has been brought to you by Digital Scholarship@UNLV 
with permission from the rights-holder(s). You are free to use this Thesis in any way that is permitted by the 
copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from 
the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/
or on the work itself. 
 
This Thesis has been accepted for inclusion in UNLV Retrospective Theses & Dissertations by an authorized 
administrator of Digital Scholarship@UNLV. For more information, please contact digitalscholarship@unlv.edu. 
IMPLEM ENTATION OE JPEG COMPRESSION AND MOTION ESTIM ATION ON
FPGA HARDW ARE
by
Ramakrishna Gopalakrishnan
Bachelor o f Engineering 
Anna University, Chennai, India 
2006
A thesis submitted in partial fulfillment 
o f the requirement for the
Master of Science Degree in Electrical Engineering 
Department of Electrical and Computer Engineering 
Howard R. Hughes College of Engineering
Graduate College 
University of Nevada, Las Vegas 
August 2008
UMI Number: 1460467
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy 
submitted. Broken or indistinct print, colored or poor quality illustrations and 
photographs, print bleed-through, substandard margins, and improper 
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript 
and there are missing pages, these will be noted. Also, if unauthorized 
copyright material had to be removed, a note will indicate the deletion.
UMI
UMI Microform 1460467 
Copyright 2009 by ProQuest LLC.
All rights reserved. This microform edition is protected against 
unauthorized copying under Title 17, United States Code.
ProQuest LLC 
789 E. Eisenhower Parkway 
PC Box 1346 
Ann Arbor, Ml 48106-1346
Thesis Approval
The Graduate College 
University of Nevada, Las Vegas
July 24 . 20 08
The Thesis prepared by
Ramakrishna Gopalakrishnan
Entitled
"Implementation of JPEG Compression and Motion
Estimation of FPGA Hardware"
is approved in partial fulfillment of the requirements for the degree of 
_______________ M aster o f  S c ie n c e  In  E l e c t r i c a l  E n g in e e r in g
Examination Committee Member
Exartination Cimmittee Member
Graduate College Faculty Representative
Examination Committee Chairv
o
Dean of the Graduate College
11
ABSTRACT
Implementation of JPEG Compression and Motion Estimation on FPGA Hardware
by
Ramakrishna Gopalakrishnan
Dr. Henry Selvaraj, Examination Committee Chair 
Professor o f Electrical Engineering 
University o f Nevada, Las Vegas
A hardware implementation o f JPEG allows for real-time compression in data 
intensivve apphcations, such as high speed scanning, medical imaging and satellite image 
transmission. Implementation options include dedicated DSP or media processors, FPGA 
boards, and ASICs. Factors that affect the choice of platform selection involve cost, 
speed, memory, size,power consumption, and case o f reconfiguration. The proposed 
hardware solution is based on a Very high speed integrated circuit Hardware Description 
Language (VHDL) implememtation o f the codec with prefered realization using an 
FPGA board due to speed, cost and flexibility factors.
The VHDL language is commonly used to model hardware im putations from a top 
down perspective. The VHDL code may be simulated to correct mistakes and 
subsequently synthesized into hardware using a synthesis toofsuch as the xilinx ise suite. 
The same VHDL code may be synthesized into a number o f sifferent hardware 
architetcures based on constraints given. For example speed was the major constraint 
when synthesizing the pipeline of jpeg encoding and decoding, while chip area and power
iii
consumption were primary constraints when synthesizing the on-die memory because of 
large area. Thus, there is a trade off between area and speed in logic synthesis.
IV
ACKNOW LEDGEMENTS 
Active support o f a wide variety o f professors from my department has helped my 
learning and research on this thesis. They are Dr. Henry Selvaraj, Dr. M uthukumar 
Venkatesan, Dr. Emma Regentova and Dr. Laxmi Gewah, who are all distinguished 
professors in Electrical Engineering and Computer Science department at UNLV. They 
shared their insights and allowed me to explore the concepts towards comprehensive 
learning while constantly guiding towards focused research. I am  fortunate to have Dr. 
Henry Selvaraj willing to review early drafts o f the thesis and offer very constructive 
criticism for improvement.
Arun Reddy Toomu and Nachiket Jugade, both colleagues and Graduate Assistants at 
Electrical Department, UNLV provided me detailed feedback and provided essential 
help. This effort would have been impossible without the active support o f my parents. 
Their unstinting support and willingness to take the burdens and provide support all 
through made all this possible. M y deep thanks to friends and my other family members.
TABLE OF CONTENTS
ABSTRACT.........................................................................................................................................iii
ACKNOW LEDGEM ENTS.............................................................................................................. v
LIST OF F IG U R E S .........................................................................................................................viii
CHAPTER I ........................................................................................................................................... I
IN TRODUCTION............................................................................................................................... .1
I . I Introduction to JPEG Compression and M otion Estim ation........................................... I
1.2 Why FPG A ’s? ............................................................................................................................ 2
1.3 Steps Involved in Hardware Im plem entation.....................................................................4
CHAPTER 2...........................................................................................................................................6
JPEG COMPRESSION M OD U LE.................................................................................................. 6
2.1 Block D iagram ........................................................................................................................... 6
2.1.1 Discrete cosine transform ation....................................................................................... 7
2.1.2 Q uantization.................................................................................  8
2.1.3 Run Length E ncoder....................................................................................................... 10
2.1.4 Entropy and Variable Length C odew ord................................................................... 10
2.1.5 Huffman Encoding...........................................................................................................11
CHAPTER 3.....................................................................................................  15
MOTION ESTIM ATION M ODULE............................................................................................ 15
3.1 Block D iagram ................................................................................................................... 15
3.2 W hat is Motion E stim ation?........................................................................................... 16
3.2.1 Reference frame storage...............................................................................................17
3.2.2 Current frame storage....................................................................................................17
3.2.3 Reference frame control............................................  17
3.2.4 Current Frame C ontrol................................................................................................. 18
3.2.5 SAD M odule................................................................................................................... 18
CHAPTER 4 ..........................................................................................................................   20
HARDW ARE IM PLEMENTATION SETUP............................................................................ 20
VI
4.1 Virtex-4 B o a rd ......................................................................................................................... 20
4.2 Microcontroller Frame Grabber™  (uCFG™ )..................................................................27
R E SU L T S............................................................................................................................................ 56
5.1 JPEG Compres sio n ............................................................................................................56
5.2 M otion Estim ation ................................................................................................... 58
CHAPTER 6 .........................................................................................................................................60
CONCLUSION AND FUTURE RECOM M ENDATIONS.................................................... 60
BIBLIO G RA PH Y ..............................................................................................................................62
V ITA ......................................................................................................................................................65
Vll
LIST OF FIGURES
Figure I JPEG Compression M odule.............................................................................................. 6
Figure 2 Uniform  Q uantizer..............................................................................................   9
Figure 3 Illustration of codeword generation in Huffman Coding.......................................... 12
Figure 4 MPEG M odu le ................................................................................................................... 15
Figure 5 Motion Estimator M odule................................................................................................ 17
Figure 6  Hardware Implementation Setup ....................................................................................20
Figure 7 Virtex-4 FX I2  Evaluation Board Block Diagram ......................................................20
Figure 8  Configuration Logic B lo c k ............................................................................................. 22
Figure 9 PowerPC Processor........................................................................  22
Figure 10 XCITE DCI Technology Advantages....................................................................... 25
Figure I I  Switch Configuration.................................................................................................... 26
Figure 12 Pin Configuration -  Switches......................................................................................27
Figure 13 uCFG High Level D iagram ..........................................................................................28
Figure 14 FIFO Read Point B uffer............................................................................................... 31
Figure 15 Grabbing Uncompressed F ield ................................................................................... 34
Figure 16 Downloading Decimated Field....................................................................................36
Figure 17 YCbCr to RGB Conversion  ......................................................................................37
Vlll
CHAPTER 1
INTRODUCTION
1.1 Introduction to JPEG Compression and M otion Estimation
A digital color image is a collection o f pixels with each pixel a 3-dimensionaI (3-D) 
color vector. The vector elements specify the pixel’s color with respect to a chosen color 
space; for example, RGB, YCbCr, etc [I, 2]. Joint Photographic Experts Group (JPEG) is 
a commonly used standard to compress digital color images [7]. JPEG is "lossy," 
meaning that the decompressed image isn't quite the same as the one you started with. 
(There are lossless image compression algorithms, but JPEG achieves much greater 
compression than is possible with lossless methods.) JPEG is designed to exploit known 
limitations o f the human eye, notably the fact that small color changes are perceived less 
accurately than small changes in brightness. Thus, JPEG is intended for compressing 
images that will be looked at by humans. If you plan to machine-analyze your images, 
the small errors introduced by JPEG may be a problem for you, even if they are invisible 
to the eye.
The temporal prediction technique used in M PEG video is based on motion 
estimation. The basic premise o f motion estimation is that in most cases, consecutive 
video frames will be similar except for changes induced by objects moving within the
frames. In the trivial case o f  zero motion between frames (and no other differences 
caused by noise, etc.), it is easy for the encoder to efficiently predict the current frame as 
a duplicate o f the prediction frame. W hen this is done, the only information necessary to 
transmit to the decoder becomes the syntactic overhead necessary to reconstruct the 
picture from the original reference frame. W hen there is motion in the images, the 
situation is not as simple. The problem for motion estimation to solve is how to 
adequately represent the changes, or differences, between these two video frames.
1.2 Why FPG A ’s?
For many years, electronic hardware used for computation could be divided into two 
main types, general purpose, and apphcation specific. General-purpose hardware is 
exemplified by microprocessors such as the Intel 80x86 families and the M otorola 68000 
family, which serve as the main processing unit in most personal computers. The 
architecture o f these devices is fixed and includes specific hardware to implement a 
limited, pre-defined, set of instructions. These microprocessors run programs, which are 
lists o f instructions to be executed that are stored in external memory. New programs can 
be loaded into memory from disk or other storage as needed. The software program 
determines the computation to be done, not the hardware. However, general-purpose 
computers can be very slow at performing certain kinds o f operations, such as those 
involving floating-point calculations or complex mathematical functions.
Application-specific computing hardware performs functions very quickly, but the 
price o f this speed is limited flexibility. As their name imphes, this type o f hardware can 
only perform  one function, or a group of closely related functions. The hardware
determines the type o f  computation to be done. They cannot be reprogrammed to perform 
entirely new functions that were not anticipated and included in the original design. If 
application specific hardware is needed to perform a new function, then a new hardware 
design will have to be created. Since this type of computation hardware is generally 
implemented as carefully designed Application Specific Integrated Circuits (ASICs), 
creating a new design takes a great deal of effort and knowledge. Since they are custom 
ICs, they are also very expensive to fabricate, and it takes week or months to design a 
new ASIC and have it fabricated.
In recent years, a new class of computing hardware has been gaining increasing 
research interest. Configurable computing hardware has some o f the advantages of both 
general-purpose and application-specific hardware. This type o f hardware may be based 
on commercially available Field Programmable Gate Arrays (FPGAs), or on ICs 
designed specifically for the purpose. In either case, this type o f hardware consists o f a 
relatively large number of functional units with programmable interconnections. The 
functionality o f the hardware is determined by how the interconnections between 
functional units are configured, and in most, but not all, architectures, how the functional 
units themselves are configured. By changing the configuration, the hardware can be 
made to perform a completely different function. Since the configuration is specific to the 
apphcation at hand, it is in effect a custom com puter for the particular design.
In order to map an application to this hardware, we must first design the hardware 
configuration needed to perform the necessary functions. This is done with either 
schematic capture or in this case with a Hardware Description Language (HDL) known
as VHDL. In either case, we must understand digital design and be able to separate an 
apphcation into data processing and control elements. The design must then be 
partitioned spatially, so that the design is spread across the resources available on the 
FPGAs. If  the design does not fit in the available FPGAs, then it must also be partitioned 
temporally, by allocating functional units to different configurations o f the same FPGA.
1.3 Steps Involved in Hardware Implementation 
The major components used in this implementation are
• PC for software interface (Active HDL 7.2 & Xilinx ISE 9.1)
• Virtex-4 FPGA board
• M icrocontroller Frame Grabber (uCFG)
• Camera
• RS-232 (9-pin) for serial interface between uCFG and Virtex-4 FPGA board
Firstly, an image frame is captured with the help o f a camera. The camera is
connected to a microcontroller frame grabber which calls for the image and stores it in its 
EEPROM. The frame is then segregated into luminance(Y) and Chroma (Cr, Cb) signals. 
The luminance and chroma values are stored in the FIFO buffer o f the frame grabber. The 
Frame Grabber and the Virtex-4 FPGA board are connected serially with a 9 pin RS-232 
null cable. The FPGA communicates with the Microcontroller (MCU) in the uCFG 
through this serial cable with the help o f UART-Transmitter and Receiver modules. 
W hen asked for, the microcontroller in the frame grabber communicates with FIFO 
buffer and transmits bits (ie) the luminance or chrominance values are as requested by the 
FPGA. The luminance or chrominance values are selected using a series of commands
which increase the address o f the read pointer accordingly. All the transmitted bits are 
stored in a ROM  which is instantiated in the FPGA. Once the MCU finishes transmitting 
all the bits ( 8  x 8 ) the code developed in VHDL comes into the picture. These VHDL 
modules are synthesized using XILINX ISE 9.1. All the necessary place and route 
operations are done according the requirements and all the inputs and outputs are 
assigned to the FPGA pins. Then the code is run on the stored pixel values and a 
compressed bit stream o f 32 bits is received as the output. The outputs are seen on the 
LED ’s on the FPGA board.
CHAPTER 2
JPEG COMPRESSION MODULE 
The proposed JPEG standard aims to be generic and support a wide variety o f 
applications for continuous-tone images [3]. To meet the differing needs o f  many 
apphcations, the JPEG standard includes two basic compression methods, each with 
various modes o f  operation. A DCT-based method is specified for “lossy”  compression, 
and a predictive method for “lossless”  compression. JPEG features a simple lossy 
technique known as the Baseline method, a subset o f the other DCT-based modes o f 
operation.
2.1 Block Diagram
Input Im a g e -
RGB to 
Y C bCf - Forward DCT : Q uantization ^
: Conversion :
Quantization 
tab les  -
: Differential i  
Coder
VLC Encoder
Figure I JPEG Compression Module
Huffrnah 
: Coding
Output Bitstream
Run Length :
: Encoder
Huffman
T ab les
2.1.1 Discrete cosine transformation
Discrete Cosine Transform (DCT) is a lossy compression scheme where an N x N 
image block is transformed from the spatial domain to the DCT domain. A related 
transform, the discrete cosine transform (DCT), does not have complex values. The DCT 
is a separate transform and not the real part of the DPT. It is widely used in image and 
video compression applications, e.g., JPEG and MPEG. It is also possible to use DCT for 
filtering using a slightly different form o f convolution called symmetric convolution. 
DCT decomposes the signal into spatial frequency components called DCT coefficients 
[5]. The lower frequency DCT coefficients appear toward the upper left-hand corner of 
the DCT matrix, and the higher frequency coefficients are in the lower right-hand corner 
o f the DCT matrix. The Human Visual System (HVS) is less sensitive to errors in high 
frequency coefficients than it is to lower frequency coefficients. Because o f this, the 
higher frequency components can be more finely quantized, as done by the quantization 
matrix. Each value in the quantization matrix is pre-scaled by multiplying by a single 
value, known as the quantizer scale code. This value can range in value from one to 112 
and is modifiable on a macro block basis. Dividing each DCT coefficient by an integer 
scale factor and rounding the results accomplishes quantization. This sets the higher 
frequency coefficients (in the lower right comer), that are less significant to the 
compressed picture, to zero by quantizing in larger steps. The low frequency coefficients 
(in the upper left corner), are more significant to the compressed picture, and are 
quantized in smaller steps. The goal o f quantization is to force as many o f the DCT 
coefficients to zero, or near zero, as possible within the boundaries o f the prescribed bit-
rate and video quality parameters. Thus, since quantization throws away some 
information, it is a lossy compression scheme.
The data compressed at the transmitter needs to be decompressed at the receiver. 
IDCT is used to decompress DCT compressed data in the decoder. DCT and IDCT are 
two o f the most computation intensive functions in compression. Therefore, a fast and 
optimized DCT/IDCT implementation is essential in improving the performance o f the 
video coder and decoder.
2.1.2 Quantization
The first processing step breaks the image into a stream of 8 x8  blocks o f pixels and 
transforms these grayscale values into the frequency domain using a Forward DCT 
(FDCT) [6 ]. Transforming the coefficients into the frequency domain causes most o f the 
energy to reside in the DC and low frequency terms. This occurs because pixel values do 
not vary much within such a small region and in general yields greater compression 
ratios. The output o f the FDCT results in a set o f 64 basis-signal amplitudes. These 
amplitudes, or coefficients, are then uniformly quantized with a 64 element quantization 
table (QTABLE).
o/p
reconstruction
levels
(o/p) ^  • r^
d ead  zone 
o/p zero  
for i/p [-d,,di]
decision levels 
(I/p levels)
- 1 ....... -1----------------- i/p f
di d: dj
mid tread er 
for ((dzda ]
i
—e  quantization level is r  ^
quantization is lossy
Figure 2 Uniform Quantizer
For 12-bit imagery, each element o f the QTABLE can be in the range o f 1 to 4095. 
This value specifies the scale factor, or step size, that is applied to the corresponding 
coefficient. Scaling the coefficients with the QTABLE results in the greatest source of 
pixel reconstruction error in JPEG, but also provides the greatest amount o f compression. 
After quantization, the resulting values are rounded to the nearest integer. Finally, the 
coefficients are entropy encoded using either Huffman coding or arithmetic coding.
The amount o f compression is controlled by quantizing the coefficients resulting from 
the FDCT. Our first attempt at finding an appropriate QTABLE for the cervical image set 
looked at differences between the distribution o f the noise and signal coefficients. This 
proved futile due to the high resolution o f the images relative to the 8 x8  DCT block size. 
That is, since the resolution o f the image is so high, anatomical structure is spread among 
many 8 x8  blocks. This results in blocks with low variance and shifts most o f the energy
into the lowest frequency component, known as the DC coefficient. This component is 
labeled DC with due reference to the terminology o f direct current. The remaining 
coefficients are all labeled as AC.
2.1.3 Run Length Encoder
Run-length Encoding or RLE is a technique used to reduce the size o f a repeating 
string o f characters. This repeating string is called a run; typically RLE encodes a run of 
symbols into two bytes, a count and a symbol. RLE can compress any type of data 
regardless o f its information content, but the content of data to be compressed affects the 
compression ratio. RLE cannot achieve high compression ratios compared to other 
compression methods, but it is easy to implement and is quick to execute.
Compression is normally measured with the compression ratio:
I  Compression Ratio = original size / compressed size : 1
In run-length encoding, repetitive source such as a string o f numbers can be represented 
in a compressed form, for example,
1,4,5,LA5,1,4,5 I
can be compressed to form
3 0  4 5)
Thus, giving a compression ratio o f = 9/4:1 which is almost 2: 1.
2.1.4 Entropy and Variable Length Codeword
Uniform length codeword assignment is not in general optimal in terms o f the
required average bit rate. Suppose some message probabihties are more likely to be sent
than others. Then by assigning shorter codewords to the more probable message
10
possibilities and longer codewords to the less probable message possibilities, we may be 
able to reduce the average bit rate.
Codewords whose lengths are different for different message possibihties are called 
variable-length codewords. When the codeword is designed based on the statistical 
occurrence o f different message probabilities, the design method is called statistical 
coding. To discuss the problem o f designing codewords such that the average bit rate is 
minimized, we define an entropy H as:
!=1
p.
where fis the probability that the message will be ai since it can be shown that
0 < H  < h g 2  L
The entropy H can be interpreted as the average amount o f information that a 
message contains. Suppose L=2, if P, = 0  and ? 2  = 0, H is zero and is the minimum 
possible for L = 2. In this case the message is ai with probability o f 1 ; i.e. the message 
contains no new information. At the other extreme, suppose ?i = ? 2  = 1/2. The entropy H 
is 1 and is the maximum possible for L = 2. In this case the two message possibilities ai 
and a2 are equally likely. Receiving the message clearly adds new information.
2.1.5 Huffman Encoding
It is a lossless data compression algorithm which uses a small number o f bits to 
encode common characters. Huffman coding approximates the probability for each
11
character as a power o f 1 / 2  to avoid compHcations associated with using a non-integral 
number o f bits to encode characters using their actual probabilities.
Huffman coding works on a list o f weights building an extended binary tree
with minimum weighted path length and proceeds by finding the two smallest w"s, wi and 
*-^'2, viewed as external nodes, and replacing them with an internal node o f weight + ^‘2 . 
The procedure is them repeated stepwise until the root node is reached. An individual 
external node can then be encoded by a binary string o f Os (for left branches) and Is (for 
right branches).
Messege Code Probabliry
0 p<l = —k
J  0 0 / 1
■ ^11 ( 0
/
1 /  /  
a , 1110 y
I d /
101 p== * I / 1* I /
1 4/
a. 1111 ^ . 7' ,* )
Figure 3 Illustration o f codeword generation in Huffman Coding
An example o f Huffman coding is shown in Figure 3. In the example L = 6  with the 
probabihty for each message possibility noted at each node.
12
Message Codeword Probability
a i 0 Pi = 5/S
^ 2 1 0 0 P ] = 3/32
a ] 1 1 0 ? 3  = 3/32
3 4 1 1 1 0 P4  = 1/32
1 0 1 P5 = l/S
ag n i l Pg =1/32
In the step o f Huffman coding, we select the two message possibilities that have 
two lowest probabilities. We combine them and form a new node with combined 
probabihties. We assign 'O’ to one o f the two branches and '! ' to other. Reversing this 
affects the codeword but not the average bit rate. We continue with this process until we 
are left with one message with probability '!'. To determine the specific codeword 
assigned to each message possibility, we begin with last node with probability 'T, follow 
the branches that lead to the message possibility o f interest and combine the O's and I's on 
the branches.
For example, 34 has codeword 1110. To compare performance o f Huffman coding 
with the entropy H and uniform length codeword assignment for the above example, we
13
compute average bit rate achieved by uniform length codeword, Huffman coding and the 
entropy respectively.
14
CHAPTER 3
M OTION ESTIM ATION MODULE
3.1 Block Diagram
SUBTRACTOR
QUANTIZED DCT 
COEFFICIENTS
VIDEO
OUT
MOTION
VECTORS
IDCT
DCT
FRAMES (2)
INPUT
COLOR
MOTION
ESTIMATOR
MOTION
COMPENSATOR
HUFFMAN/RUN 
LENGTH CODER
Figure 4 MPEG Module
15
3.2 W hat is M otion Estimation?
M otion estimation is the processes which generates the motion vectors that determine 
how each motion compensated prediction frame is created from the previous frame. 
Block M atching (BM) is the most common method o f motion estimation. Typically each 
macroblock (16x16 pixels) in the new frame is compared with shifted regions o f the 
same size from the previous decoded frame, and the shift which results in the minimum 
error is selected as the best motion vector for that macroblock. The motion compensated 
prediction frame is then formed from all the shifted regions from the previous decoded 
frame.
BM can be very computationally demanding if all shifts o f each macrohlock are 
analyzed. For example, to analyze shifts o f up to ±15 pixels in the horizontal and vertical 
directions requires 31 x 31 = 961 shifts, each of which involves 16 x 16 = 256 pixel 
difference computations for a given macroblock. This is known as exhaustive search BM. 
Significant savings can be made with hierarchical BM, in which an approximate motion 
estimate is obtained from  exhaustive search using a low pass sub sampled pair of images, 
and then the estimate is refined by a small local search using the full resolution images. 
Sub sampling 2:1 in each direction reduces the number of macrohlock pixels and the 
number o f shifts by 4:1, producing a computational saving o f 16:1.
16
Current Reference
frame control frame controlJ L
CufTBtlt 
tam e storage
SAD
Referee 
t a m e  a c r ^ e
JJ
M otion vectors
Figure 5 M otion Estimator Module
3.2.1 Reference frame storage
This is the very important part o f this module. The current macroblocks are 16x16 
blocks which contain the current frame information and have to be compared with the 
reference macroblocks which are already stored.
3.2.2 Current frame storage
The current frame storage is done in a similar way as the reference frame storage. The 
only difference is segregation has to be done for 16x16 macroblocks. It requires less 
memory and is faster as each time only 256 bytes o f luminance pixels have to be read.
3.2.3 Reference frame control
This module is a sliding window controller which sweeps across the 32x32 search area 
i.e. 1024 pixels. The Macroblock at each specific point in the sliding window is latched 
and fed to the SAD module for further computation. The sliding window gives an
17
accurate account o f overall search area. The probability o f finding the best match 
increases in this case.
3.2.4 Current Frame Control
This module is a 8 -byte Shift Register (SR) which shifts the 256 different luminance 
values o f the current macroblock. As soon as all the 256 values have been clocked in the 
shift register, all the values are latched into the 256in-256out structure, which then feeds 
concurrently into the input o f SAD block. The synchronization is a bit complex but not as 
complex as in the case o f logarithmic algorithms.
3.2.5 SAD Module
SAD module has inputs as reference and current macroblocks and outputs the motion 
vectors XY-coordinates. An important metric used in motion estimation is the sum of
M-l N-\
absolute differences (SAD). { C { x  + k , y  + l ) - R { x  +  i + k , y +  j  + l ))  | . The absolute
t=0 /=0
difference operation can be implemented in several ways: serial, per column in parallel, 
per row in parallel, and fully parallel. The implementation described in [5] focuses on the 
SAD 16 operation that performs the SAD on one row o f a macroblock (16x1). All the 
input values are 8 -bit unsigned binary numbers. By iteration or parallel execution o f the 
SAD 16 operation, the complete SAD operation for the 16x16 macroblock can be 
performed. First, the steps necessary to perform the 16x1 SAD operations in more detail:
• Determine the smaller o f  the two operands: As suggested in [3, 4], it is only 
necessary to determine whether (A' + B) produces a carry or not.
• Invert the smallest operand: If  no carry was produced then B must be inverted;
otherwise, A must be inverted. This is done by utilizing an EXOR operation.
18
• Pass both operands to an adder tree: After inverting either A or B, the operands 
must be passed to an adder tree. Thus, the values (A', B) or (A, B') are passed 
further.
• Add a correction term  to the adder tree: Also an additional correction term must
be added to the adder tree which is 16 in this case i.e. adding 1 to each o f the 16 
blocks.
• Reduce the 33 addition terms to 2: All 33 addition terms must be reduced to 2 
terms before the final addition can be applied. This can be done using an 8 -stage 
carry save adder tree using 243 carry save adders.
• Add the remaining two terms using an adder: The final two addition terms are
added using an 8 -bit carry look ahead adder for the most significant bits. The
result is a 13-bit unsigned binary number. However, as stated in [4, 5], the most 
significant bit o f this result can be disregarded resulting in a final 1 2 -bit unsigned 
binary number.
19
CHAPTER 4
HARDW ARE IM PLEM ENTATION SETUP
SOFTWARE
PC
VIRTEX-4
RSÔBB
uCFG
Figure 6  Hardware Implementation Setup
4.1 Virtex-4 Board
RatoFtASH Xk/FwP 1% 32) i OSRAMI Lispla»:
4
,  X llinx VIrlBX-4. 
. xc4vrxiibfrMa
Ck #& MCMF; <
+ Socket :
J Swiicrw® 
"-►j mP (S'!
I RB-m
Figure 7 Virtex-4 FX12 Evaluation Board Block Diagram
20
4.1.3 Virtex -4 FPGA
The FPGA used in our implementation is a Virtex^'^-4 FPGA produced hy Xilinx. 
Virtex™ -4 family FPGAs offer the functionahty and performance to address the widest 
range of demanding applications. It has added enhancements that accelerate productivity 
hy simplifying system design and providing the margin that makes it easy to achieve 
design targets. The major elements o f the Virtex™ -4 FPGA are 
Configurable Logic Blocks 
PowerPC® Processor 
Smart RAM
The Virtex^'^-4 FPGA features arrays o f CLBs arranged in columns surrounded on all 
sides hy input/output blocks (lOBs). The CLB is optimized for area and speed for 
compact high performance design. There are four shces per CLB which implement any 
combinatorial and sequential circuit and each slice has 4-input look-up tables (LUT), flip- 
flops, multiplexors, arithmetic logic, carry logic, and dedicated internal routing
21
Figure 8  Configuration Logic Block
The Virtex^'^-4 FPGAs provides up to two PowerPC 405, 32-bit RISC processor 
cores in a single device. It has flexible system partitioning into hardware and software 
which supports custom hardware acceleration and co processing (Control plane 
processing.)
Figure 9 PowerPC Processor
22
The Virtex^^-4 Smart RAM hierarchies not only enables us to achieve compact 
utilization and highest performance but can also configure any CLB Look-Up Table 
(LUT) to work as a fast, compact, 16-bit shift register and implement pipehne registers, 
buffers for video and wireless.
Virtex-4 FPGAs provide up to 960 user I/Os supporting over 20 single-ended and 
differential electrical I/O standards to enable several parallel system interface standards 
on one device. New ChipSync^^ technology built into every I/O block makes source- 
synchronous interfacing to the latest high-speed components easy. Plus, powered with 
XCITE technology, each I/O block deliver on-chip active I/O termination eliminating 
external termination resistors to increase signal integrity, and save board space, and 
reduce system cost.
To ensure reliable data transfer between a new generation o f high-speed devices, 
hardware designers are turning to source-synchronous design techniques, in which the 
component sending the data generates and issues its own clock signal along with the data 
that it transmits. ChipSync technology simplifies component interface design with critical 
built-in circuitry that is available in every Virtex-4 I/O. /O termination is required to 
maintain signal integrity. W ith hundreds of I/Os and advanced package technologies, 
external termination resistors are no longer viable. All Virtex-4 I/O structures include 
third-generation Xilinx Controlled Impedance Technology (XCITE) on-chip active I/O 
termination. These built-in circuits dynamically eliminate drive strength variation 
resulting from process, temperature, and voltage fluctuations.
23
4.1.3 Clocks
The available clock sources on the Virtex-4 FX12 Evaluation board are shown below.
• Single-ended, 100 MHz Oscillator -  FPGA pin “ADI 1”
• 8 -pin DIP Clock Socket -  FPGA pin “AD 12”
The on-board 100 MHz oscillator is used as the clock source for all designs. The 8 -pin 
DIP clock socket allows the user to supply their oscillator o f choice.
4.1.3 M emory
The Virtex-4 FX I2  Evaluation board is populated with both high-speed RAM  and 
non-volatile ROM to support various types o f applications. The board has 32 Megabytes 
(MB) o f DDR SDRAM and 4 MB of FLASH.
XCITE DCI TECHNOLOGY ADVANTAGES
ADVANTAGE bETA ILS
2 "^  generation Proven in the field sand used extensively by customers
technology
Lowers cost Fewer resistors, fewer PCB traces and smaller board area.
result in lower PCB costs.
Absolute I/O Flexibility Any termination on any I/O bank. Non-XCITE technology
alternatives deliver hmited functionality.
Maximum I/O Less ringing and reflections maximize I/O bandwidth.
Bandwidth
24
Immunity to temperature 
and voltage changes
Eliminates stub 
reflection 
Increases system 
rehabihty
Temperature and voltage variations lead to significant 
impedance mismatches. XCITE technology
dynamically adjusts on-chip impedance to such variations 
reducing and improving reliability.
Improves discrete termination techniques by eliminating the 
distance between the package pin and resistor.
Fewer components on board, deliver higher reliability
Figure 10 XCITE DCI Technology Advantages
4.1.4 DDR SDRAM
Two Micron DDR SDRAM devices make up the 32-bit data bus. Each device
provides 16 MB o f memory on a single IC and is organized as 2 M egabits x 16 x 4 banks
(128 Megabit). The Virtex-4 FX I2 Evaluation Board can support larger devices with
addressing support for up to 256 MB (two I-Gigabit devices). The device has an
operating voltage o f 2.5V and the interface is JEDEC Standard SSTL_2 (Class I for
unidirectional signals. Class II for bidirectional signals). The -75 speed grade supports
7.5 ns cycle times with a 2 Vi clock read latency (DDR266B).
4.1.5 Flash M emory
Non-volatile data storage is provided in the form o f Flash memory. A single Intel 
Strata Flash® device makes up the 16-bit data bus. This device provides 4 MB of
25
memory on a single IC and is organized as 2Megabits x 16 (32 Megabit). The device has 
an operating voltage o f 3.0V
4.1.6 User I/O
Basic user I/O is provided for on the Virtex-4 FX I2  Evaluation Board in the form o f 
switches and LED indicators. These peripherals are used to display the compressed bit 
stream which is the output of the JPEG compression module.
4.1.7 Push Buttons
Two momentary closure push buttons have been installed on the board and attached 
to the FPGA. These buttons for logic reset and push functions. Pull down resistors hold 
the signals low (0) until the switch closure pulls it high (I).
SW I (Pushing compressed bit stream in to led’s) SW ITCH_PBI Y I9
SW2 SWITCH PB2 Y20
Figure 11 Switch Configuration
4.1.8 DIP-switch
An eight-position DIP-switch (SPST) has been installed on the board and attached to the 
FPGA. These switches provide digital inputs to user logic as shown below. The signals 
are pulled low (0) by lOK ohm resistors when the switch is open and tied to 3.3V (1) 
when the switch is closed.
26
Sl-1 ACDC SWITCHO AB24
S l-2 ACDCl SW ITCHl AB23
S l-3 LAST SWITCH2 AC25
S l-4 LOAD SWITCH3 AC24
S l-5 READ_EN SWITCH4 AD26
S l - 6 YCbCr 0 SWITCH5 AD25
S l-7 YCbCr 1 SW ITCH6 AC23
S l - 8 SWITCH7 GIO
Figure 12 Pin Configuration -  Switches
4.2 Microcontroller Frame Grabber^*^ (uCFG™)
The uCFG system block diagram is shown in Figure 3. The uCFG provides 4 separate 
analog video inputs. All inputs accept either NTSC or PAL composite color video signals 
depending on the board’s configuration settings in non-volatile memory. Black & white 
(RS-170) video signals are also supported. The uCFG can be setup to operate in high 
quality S-Video mode where the luma and chroma components are separated out into two 
discrete signals (Y and C). In this mode, two channels o f S-Video are available by using 
input pairs 1 &3 and 2&4 respectively.
27
o 5 U c
CPLD
RS-232
level
shifter
Video
field
buffer
Video
decoder
Figure 13 uCFG High Level Diagram
The video inputs are fed to an internal video multiplexer used to select the active 
video channel for digitization. The output o f the analog multiplexer is provided at the 
uCFG’s video out terminal for live video preview. This signal is also fed to an internal 
video analog-to-digital converter (ADC) and is then converted to an industry standard 
1TU-BT656 4:2:2 digital video streams. The digital video stream is written to a video 
field buffer under the control of an 8 -bit microcontroller (MCU) as well as a complex 
programmable logic device (CPLD). Once a single field o f color video is stored into the 
buffer, the 8 -bit MCU can read out the data and transmit it through its serial port 
(LVTTL, +3.3V level). 4 external trigger inputs with programmable polarity are 
provided to trigger acquisition o f a video field from  the corresponding video channel. 
Software triggering is also possible through ASCII commands. For serial interface
2 8
compatibility with RS-232 levels, an external level shifter such as the Sipex SP3232 must 
be added.
4.1.5 FIFO Read Pointer Control
When a video field grab, is commanded either through the software GRAB/TRIG or a 
hardware trigger, the uCFG board digitizes the very next video field (odd or even 
depending on request) in the ITU-R BT656 4:2:2 digital video format. This data stream is 
stored into a long First-In/First-Out (FIFO) memory field buffer. A FIFO is simply a 
memory storage element which is linearly accessible through a read pointer. This 
operation essentially freezes the digital data stream for the entire field as-is into the FIFO. 
Once the capture is complete, the MCU can read out the contents of the FIFO one byte at 
a time starting from the beginning o f the stored digital stream. The linearly accessible 
FIFO allows the MCU to read out one byte o f data at a time and then increments the read 
pointer to the next data byte in the stored data stream. It is also possible to directly 
increment the read pointer in the FIFO by an arbitrary amount without reading the data. 
This is useful to skip some data samples for horizontal decimation, or even skip an entire 
row ’s worth of data for vertical decimation. The read pointer can also be reset to point 
back to the beginning o f the stream with the RRST command. The only restriction is that 
the pointer cannot be decremented directly. Figure below shows the physical layout of 
the uCFG’s internal FIFO buffer. As shown, the FIFO is a byte wide uninterrupted 
memory buffer accessible with the help o f a read pointer (used to index the memory 
location to read). For simplicity, the FIFO in Figure has been split up into separate video 
lines even though all the data is actually continuous inside the FIFO. As can be seen, one
29
line (720 pixels wide) o f video data actually occupies 1440 bytes o f physical FIFO space. 
Note that every second luma (Y) sample is surrounded by its chroma (Cb and Cr) color 
components which are co sited in space (i.e. belong to the same pixel). Every other luma 
sample is alone (due to the 4:2:2 chroma decimation).
The host controller is in full control o f  the FIFO access operations and readout. In 
fact, the host is responsible for issuing the appropriate commands to skip data samples 
when it requires image decimation. A number o f example commands are illustrated in 
Figure as well as their effect on the read pointer, (a) Different commands increment the 
read pointer directly. RRST resets the pointer to 0, the start o f the FIFO. SEND 111 
simply reads the current byte and increments the pointer by 1. RING with no parameters 
assumes 1 by default and increments the pointer. RING with parameters increments the 
pointer by the specified amount, (b) To download a black & white image only (the luma 
component o f the color image) at the full 720 resolution, simply position the read pointer 
on the first luma sample o f  interest with RING and issue a SEND 2 n 1 where n is the 
number o f bytes to read. Here the command will read and return every 2nd sample until a 
total o f n bytes have been returned on the serial port. To read a Vi decimated version, use 
the SEND 4 n 1 which skips every 2nd luma (i.e. return every 4th sample). The 
possibihties are quite numerous, (c) & (d) Illustrates color component download. Simply 
position the pointer to the appropriate starting position and issue a SEND 4 n 1 to read 
out every 4th sample.
30
LME, t  1 O f t  1 Y1 C ^t Y2 C W Y3 1 CT3 Y4. CC6 Y5 CfS Y6 CU7
___1___
0& 719 Y719 Y72Dj ^  
1437 143ft 143ft
W  \A3fiouB conifiiandB mcmmmmÈ lf%  R F O  re a d  p o in te r
L NE 2 Y4 I C M Y718 Y7M V 720Î -
144C 1441 1442 1443 1 4 4 4  144S 144ft 1 4 4 7
|b )  maca:&wNte (Y luma] c ^ x rp cn en t r&ad no  com pression
LINE 3
• FIFO ADDRESS
GUI Y1 Ori Y2 I C53 Y 3 I cr3 Y4 VS CIS 'fS Cb7 Y 718 CD71SY 719 0719 Y730:
ZftS@ 2661 2862 2663 2664 2336 2 6 iS  2 ^ 7  2633 2669 2696 2691 2692 4917  4 3 1 6  4919
'T-
@
w
(D) Œ ct¥cma compDmnt wAi #o comfresslKi
LINE4
FIFO
AOOREftS
GÈ1 1 Yt Cr1 1 Y2 1 CÎJ3 Y3 Ct3 V4 C&5 V5
j
GrS I V6
---------M M
0 , 7  j,"!
if
YF1@ 0>71'S Y719 CT71S Y72Ü
4920 4921 4922 4 3 2 9  4924 4925 432ft 4927 4928 4 9 2 9 499 9 5757 5 7 5 6 575ft
fNO 0 )  O ' c n rcm a  co m p o n en t re a d  i« th  i>d œ m p e e s s 'a i
#
#
#
(
CU1 Y l C ft Y2 1 Cb3
1
Y3 I Ct3 Y4 C t6 Y5
1 1
a s  I YE CUT I n  
----
AD DRESS 3441SB S441S2 3 4 4 1 MM S M 1  Ï4 4 1 6 3
YTaOt
345S97 3455SS34SSSB
Figure 14 FIFO Read Point Buffer
To read a full color image, first read all the luma values (Y), then issue a read pointer 
reset command RRST. Then, read all the Cb components and issue another read pointer 
reset. Finally, read out all the Cr components.
31
Hopefully, it now becomes clear how powerful the different FIFO commands are. 
They give full control over the position o f the read pointer, the skip factor, and the 
amount o f data to be downloaded. As an example, reading out a sub image (region o f 
interest) would be quite simple to do by directing the read pointer to the correct locations 
and selectively downloading the data. This feature could be used to perform a quick 
coarse thumbnail image preview, then a high resolution area of interest download. 
Another example could be downloading a black and white image only, but with color 
info only for a small region o f interest within the master im age...all these techniques 
reduce the downloaded image data size. To use the compression feature, simply set the 
3rd param eter o f the SEND command (compression ratio) to 2, 4 or 8 and a compressed 
image data stream is returned instead of the raw sample values. For example, SEND 2 3 2 
would return 3 compressed bytes at 2:1 compression. This means that 3 compressed bytes 
* 2 compressed pixels / byte = 6 image samples in total skipping every 2nd sample in the 
FIFO (i.e. Y samples only). The number o f bytes returned from this call would be 1 
integrator reset byte (for decompression) as header + 3 compressed data bytes = 4 bytes. 
Simply feed this compressed data vector o f 4 bytes into the decompression routine, and 
the original vector o f 6 raw image samples will be regenerated. Please note that the 
compression is done on-the-Hy as data is returned on the serial port. The original raw 
image data in the FIFO remains unaltered regardless o f the way in which the data is read 
out (i.e. decimated, compressed or not). The only time the data is altered is when the next 
field grab is commanded (or when the power is removed). It is therefore possible to read
32
out the data multiple times with different settings. This can be useful to first obtain a 
thumbnail, followed by a full download.
4.1.5 Grabbing and Downloading an Uncompressed Field
A very common operation to be performed with the uCFG is to grab a field o f video 
and download it. Figure below demonstrates one possible sequence o f commands to grab 
and downloading a full resolution field o f NTSC color video (720x240).
First the video channel select command CSEL is issued (this is optional, otherwise the 
last selected channel will be used). The GRAB command with the odd (O) field 
param eter is then specified (use E for even field). After execution o f the GRAB 
command, the return value is verified. If an error occurred, then either the selected 
channel has no valid video signal connected, or the signal is o f the wrong standard 
(PAL/NTSC). Upon a successful field grab, the triggers are temporarily disabled with the 
TREN 0 command. This will prevent (in the event of a trigger) any new images from 
overwriting the FIFO data while downloading. Next, three passes are performed through 
the FIFO. The first pass will read out all the Y luma samples. The second pass will read 
out the Cb chroma samples, and the third, the Cr chroma samples. O f course, the samples 
could all be read in one pass, but it is simpler on the host side to spht into three passes. 
As well, to read black & white data only, a single pass would be performed to collect the 
luma samples.
On the first pass, the FIFO read pointer is reset (RRST). As seen in Figure 9, FIFO 
address 0 actually points to the C bl sample (Cb value o f first pixel o f line 1). The read 
pointer therefore needs to be shifted by 1 to point to the first Y1 sample. This is done
33
with the RING 1 command. Next, for each line, a SEND 2 720 1 command is issued to 
send out every 2nd sample in the FIFO from  the start position with no compression. This 
actually reads out all the Y samples of the first hue. The process is then repeated for all 
the 240 hnes o f the image.
Dsvmtoad Y 
oonponenî
Etoïmioad Cîj 
cornponent
CcAnlcad Cr 
component
BRSr RHSr
CSEL 1
t m c  i RINC2
GRAB O
SEND 43W fSEND 2 7201 SEND 4 3691
R ead bytes Read 360 tatesRead ,'20 bytesTRENP
240 ines 240 lines read?
TREN 1
Figure 15 Grabbing Uncompressed Field
34
In this example, the host computer has enough buffer space to buffer 720 bytes of 
downloaded data. However, in the case o f a resource-hmited host (8-bit MCU), it may be 
desirable to instead download a few bytes at a time only. This is easy to do by changing 
the parameters o f the SEND command. O f course, the increased protocol overhead cost 
will result in a longer download time. The process is repeated for the 2nd and 3rd pass 
by first positioning the read pointer on the first Cb or Cr sample and then downloading 
every 4th data sample for a total o f 360 per hne (remembering that the chroma 
information is sub-sampled by a factor 2 in the 4:2:2 standard). Finally, once done the 
triggers are re-enabled with TREN I .
4 .1.5 Downloading a Decimated Field
Figure below shows a possible sequence o f command to download an uncompressed 
color decimated field o f half the full NTSC resolution (360x120). As before, most o f the 
steps are the same. The main differences are that the SEND command now skips every 4 
samples (i.e. every 2nd Y) only. This results in a horizontally decimated image by a 
factor 2. The second difference is that following the download o f the decimated data for a 
full line, a RINC 1440 command is introduced. This command effectively skips an entire 
row (or hne) o f raw video data in the FIFO. This vertically decimates the number o f hnes 
in the downloaded field by a factor 2.
35
DortTsSoad Y D o w n le a d  C b  
csMTipDnen!
D o w n lo ac ' C r  
c o m p o n e n t
Î2 0  lines re ad ? 1 2 0  L n es r e a d ?
YES YEI.
jRJWC 1
R e a d  1 SO P y le s
SE N D  8  180 1
Figure 16 Downloading Decimated Field
4.2.4 Converting 4:2:2 YCbCr to RGB for PC Display
The ultimate goal o f the image downloading is to display the images to the user on a 
PC monitor. To perform this operation correctly, a few things must be explained. First 
off, PC monitors work in the RGB color space. This means that every pixel has an 8 bit 
red, green and a blue component. The downloaded data from the uCFG is in the YCbCr 
color space and, as well, the color components are decimated by a factor 2 (4:2:2). Before 
displaying an image on a PC, this data must therefore be up sampled and converted to the 
RGB domain.
36
:CL
H
P-
‘ r Y 2 Y 3 V 4 Y S Y S  T T Y ê y & Y 1 3  V 1 1  n ] Y 1 3 Y 7 1 € Y 7 1 7 ¥ 7 1 8 ¥ 7 1 9 ¥ 7 2 0
G Ü 1
C M
C P 2
C r 3 1
C M
C r S
C C T C D Q C C 4 1
y %
C b l  3  
C M 3
0 3 7 1 7
0 7 1 7 IC b ^ l iC 7 " 9
(a] Ohgmial 4 :2 :2  d a ta  s # e a m  ev e ry  2nd ch rm n a  sa m p le
o
p
#
¥ ' ¥2 ¥3 V4 Y. V , Y~ YS YS ¥10 ¥11 ; ¥12 Y13 Y71S ¥717 m s ¥719 ¥720
Cbl Cb2 Cb3 CW Gfô Cb6 1 0 7 C£^ Cb9 CblQ G bii ■Cbl 2
i
C bl3 Cb716 0 7 ^ 8 CÜT19 Gb72C
CM CT2 0 3 Cr4 cfs c m  j C r 7 CrE C rî c n o c m
1
0 1 3 C r7t6 r Cf719 a ? 2 S
(b) AJt m issing c& 'oma s a n p le s  have been  interpolamd frcm their respectve  left aiid ^ h t  n e i ^ to r s
V fM m
a. & 0. Î&
Hi R 2 R 3 R 4 f ? 5 R 6 M 7 : R 8 R9 R I O
1
R S I  I  R . 1 2 R I 3 R 7 t 6 R 7 1 7
'
a 7 i a  j R 7 i 9
G 1 G 2 0 3 G 4 Qf: G 6 G 7 G a Q3 G 1 D G I 1  j  G 1 2 G 1 Î 1
f
G 7 1 6 G 7 1 7 G 7 T 8  G 7 1 9 G 7 2 0
8 2 ■ B 3 5 4 as B 6
_ ! L
a s 8 9 5 1 0 3 1 ’  B - 2 3 * 3
i
B 7 1 E B 7 * 7 E 7 1 6  5 ? 1 9 B 7 2 0
(c) Finally, a YGbCr to; RGB ifafisiorrr*alb? h as  b ^ n  applied to each pixel. The dala s  ready for cispiay on a  PC screen
Figure 17 YCbCr to RGB Conversion
Figure illustrates the different steps required to convert a 4:2:2 YCbCr data stream 
into a non-decimated RGB stream suitable for display on the PC monitor. Step 1 shows a 
single line o f downloaded 4:2:2 YCbCr color data samples shown overlaid over a row o f 
the 720 pixels forming the image line. It can be seen that some o f the decimated chroma 
samples are missing.
37
In Step 2, all the missing chroma values are filled in with a simple linear interpolation 
performed by taking the average o f the previous and next respective samples. For 
example, Cr2 = (C rl + Cr3)/2. After this step, each pixel has a Y, a Cb, and a Cr value. 
Step 3 performs a color space conversion from YCbCr to RGB. Each pixel’s YCbCr 
values a fed through equation 1 and the resulting RGB output vector is obtained, ready 
for display on a PC monitor.
4.1.9 RS-232 Interface
An RS-232 Interface is used for communication between the FPGA board and the 
M icrocontroller Frame Grabber (uCFG).
An RS-232 interface has the following characteristics;
• Uses a 9 pins connector "DB-9" (older PCs use 25 pins "DB-25").
• Allows bidirectional full-duplex communication (the PC can send and receive
data at the same time).
• Can communicate at a maximum speed o f roughly lOKBytes/s.
It has 9 pins, but the 3 important ones are:
• pin 2: RxD (receive data).
• pin 3: TxD (transmit data).
• pin 5: GND (ground).
Using just 3 wires, you can send and receive data.
4.1.9.1 Serial communication
Data is sent one bit at a time; one wire is used for each direction. Since computers 
usually need at least several bits o f data, the data is "serialized" before being sent. Data is
38
commonly sent by chunks o f 8 bits. The LSB (data bit 0) is sent first, the MSB (bit 7) 
last.
4.1.9.2 Asynchronous communication
This interface uses an "asynchronous" protocol. That means that no clock signal is 
transmitted along the data. The receiver has to have a way to "time" it to the incoming 
data bits.
In the case o f RS-232, that's done this way:
• Both side o f the cable agree in advance on the communication parameters (speed, 
format...). That's done manually before communication starts.
•  The transmitter sends a " 1 " when and as long as the hne is idle.
•  The transmitter sends a "start" (a "0") before each byte transmitted, so that the 
receiver can figure out that data is coming.
• After the "start", data comes in the agreed speed and format, so the receiver can 
interpret it.
•  The transmitter sends a "stop" (a "1") after each data byte.
The speed is specified in baud, i.e. how many bits-per-seconds can be sent. For 
example, 1000 bauds would mean 1000 bits-per-seconds, or that each bit lasts one 
m illisecond. C om m on im plem entations o f  the R S -232  interface (like the one used in PCs) 
don't allow just any speed to be used. If you want to use 123456 bauds, you're out o f luck. 
You have to settle to some "standard" speed. Common values are:
39
• 1200 bauds.
• 9600 bauds.
• 38400 bauds.
• 115200 bauds (usually the fastest you can go).
At 115200 bauds, each bit lasts (1/115200) = 8.7|as. If you transmit 8-bits data, that 
lasts 8 X 8.7ps = 69ps. But each byte requires an extra start and stop bit, so you actually 
need 10 x 8.7ps = 87ps. That translates to a maximum speed o f 11.5KBytes per second.
At 115200 bauds, some PCs with buggy chips require a "long" stop bit (1.5 or 2 bits 
long...) which make the maximum speed drop to around 10.5KBytes per second.
4.1.9.3 Physical layer
The signals on the wires use a positive/negative voltage scheme.
• "1" is sent using -lOV (or between -5V and -15V).
• "0" is sent using 4-lOV (or between 5V and 15V).a
So an idle line carries something like -lOV.
A VHDL UART is used to communicate between the FPGA and the FIFO Buffer o f the 
Frame Grabber.
4.1.10 VHDL-UART
The VHDL-UART used to transfer bits from  the FIFO buffer o f the uCFG to the 
ROM  of the FPGA is given below.
4.1.10.1 U ART-Receiver
library ieee;
use ieee.std_logic_l 164.all;
40
use ieee.std_logic_arith.all; 
entity UARTReceiver is 
generic 
(
frequency
baud
oversampling
);
(
elk
rxd
rxd_data
rxd_data_ready
);
; integer; 
: integer; 
: integer
: in std_logic;
: in std_logic;
: out std_logic_vector(7 downto 0); 
: out std_logic
end entity UARTReceiver;
architecture UARTReceiver Arch o f UARTReceiver is 
— defining constant
constant BIT_SPACE : integer := 10; -  8 to 11 are common
constant DIVISOR : integer := 1600;
constant FREQ_INC : integer ;= (oversampling + 1) * baud / DIVISOR; 
constant FREQ_DIV : integer := frequency / DIVISOR; 
constant FREQ_MAX : integer := FREQ_DIV + FREQ_INC -1 ;
41
— defining types
type state_type is (idle, bitO, b itl, bit2, bitS, bit4, bit5, bit6, bit?, stop);
— defining signals
signal state : state_type := idle; — receiver's state
signal rxd_sync_inv : std_logic_vector(l downto 0);
signal rxd_cnt_inv ; std_logic_vector(l downto 0);
signal rxd_bit_inv : std_logic;
signal baud_divider : integer range 0 to FREQ_M AX := 0;
signal data : std_logic_vector(7 downto 0);
signal baudover_tick : std_logic ;= 'O';
signal bit_spacing ; integer range 0 to 15;
signal next_bit : std_logic := 'O';
begin
— assignments
next_bit <= when bit_spacing = BIT_SPACE else 'O';
— processes
baud_gen : process(clk) 
begin
if clk'event and elk = T  then
baud_divider <= baud_divider + FREQ_INC; 
if  baud_divider >= EREQ_DIV then 
baud_divider <= 0;
42
baudover_tick <=
else
baudover_tick <= 'O';
end if;
end if; 
end process baud_gen; 
rxd_sync_inverted ; process(clk) — inverted to suppress phantom  character 
begin
if clk'event and elk = '1' then
if baudover_tick = '1' then
rxd_sync_inv <= rxd_sync_inv(0) & not rxd;
end if;
end if;
end process rxd_sync_inverted;
rxd_counter_inverted : process(clk) 
begin
if  clk'event and elk = '1' then
if  baudover_tick = '1' then
if rxd_sync_inv(l) = '1' and rxd_cnt_inv /= "11" then 
rxd_cnt_inv <= unsigned(rxd_cnt_inv) + 1 ; 
elsif rxd_sync_inv(l) = O' and rxd_cnt_inv /= "00" then
43
r x d _ c n t _ i n v  < =  u n s i g n e d ( r x d _ c n t _ i n v )  -  1;
end if;
if  rxd_cnt_inv = "00" then 
rxd_bit_inv <= 'O'; 
elsif rxd_cnt_inv = "11" then 
rxd_bit_inv <= T ;
end if;
end if;
end if;
end process rxd_counter_inverted; 
state_proc : process(clk) 
begin
if clk'event and elk = '1' then
if baudover_tick = T  then 
case state is
when idle =>
if rxd_bit_inv = '1' then 
state <= bitO;
end if; 
when bitO =>
if next_bit = '1' then 
state <= b itl;
44
end if; 
when b itl =>
if next_bit = T  then 
state <= bit2;
end if; 
when bit2 =>
if next_bit = '1' then 
state <= bit3;
end if; 
when bit3 =>
if next_bit = T  then 
state <= bit4;
end if; 
when bit4 =>
if  next_bit = '! ' then 
state <= bit5;
end if; 
when bits =>
if  next_bit = '1' then 
state <= bit6;
end if; 
when bit6 =>
45
if  next_bit = '! ' then 
state <= bit?;
end if; 
when bit? =>
if next_bit = T  then 
state <= stop;
end if; 
when stop =>
if next_bit = '! ' then 
state <= idle;
end if;
end case;
end if;
end if; 
end process state_proc;
bit_spacing_proc : process(clk)
begin
if  clk'event and elk = '1' then 
if  state = idle then
bit_spacing <= 0; 
elsif baudover_tick = T  then 
if  bit_spacing < 1 5  then
46
b i t _ s p a c i n g  < =  b i t _ s p a c i n g  +  1 ;
else
bit_spacing <= 8;
end if;
end if;
end if;
end process bit_spacing_proc; 
shift_data_proc : process(clk) 
begin
if  clk'event and elk = '1' then
if  baudover_tick = '1' and next_bit = '1' and 
state /= idle and state /= stop then
data <= not rxd_bit_inv & data(7 downto 1);
end if;
end if;
end process shift_data_proc;
output_data_proc : process(clk) 
begin
if  clk'event and elk = '! ' then
if baudover_tick = '1' and next_bit = '1' and 
state = stop and rxd_bit_inv = 'O' then
47
rxd_data <= data; 
rxd_data_ready <=
else
rxd_data_ready <= 'O';
end if;
end if;
end process output_data_proc; 
end UARTReceiverArch;
4.2.6.2 UART-Transmitter 
library ieee;
use ieee.std_logic_1164.all; 
use ieee.std_logic_arith.all;
entity UARTTransmitter is 
generic 
(
frequency
baud
);
port
(
elk
integer;
integer
in std_logic;
48
txd
txd_data
txd_start
txd_busy
: out std_logic;
: in std_logic_vector(7 downto 0); 
: in std_logic;
: out std_logic
);
end entity UARTTransmitter;
architecture UARTTransmitter Arch o f UARTTransmitter is
— defining types
type state_type is (idle, start, bitO, b itl, bit2, bit3, bit4, bit5, bit6, bit7, s top l, stop2);
— defining signals
signal state : state_type := idle; — transmitter's state
signal data : std_logic_vector(7 downto 0);
signal baud_tick : std_logic;
signal busy : std_logic := 'O';
signal baud_divider : integer range 0 to (frequency/100 + baud/100 - 1) := 0;
begin
— assignments
txd_busy <= busy; busy <= 'O' when state = idle else 'T;
— processes
baud_gen : process(clk) 
begin
if clk'event and elk = 'T then
49
if  busy = '! ' then
baud_divider <= baud_divider + (baud/100); 
if  baud_divider > (frequency/100) then 
baud_tick <= T'; 
baud_divider <= 0;
else
baud_tick <= 'O';
end if;
end if;
end if; 
end process baud_gen;
state_proc : process(clk) 
begin
if clk'event and elk = '1' then 
case state is
when idle =>
if txd_start = T  then 
state <= start;
end if; 
when start =>
if  baud_tick = '1' then
50
S ta te  < =  bitO;
end if; 
when bitO =>
if  baud_tick = '1' then 
State <= bitl;
end if; 
when bitl =>
if baud_tick = T' then 
State <= bit2;
end if; 
when bit2 ->
if baud_tick = T' then 
State <= bitS;
end if; 
when bits =>
if baud_tick = T ' then 
State <= bit4;
end if; 
when bit4 =>
if baud_tick = T' then 
State <= bit5;
end if ;
51
when bits =>
if  baud_tick = T  then 
state <= bit6;
end if; 
when bit6 =>
if baud_tick = T ' then 
State <= bit? ;
end if; 
when bit? =>
if baud_tick = T ' then 
state <= stopl;
end if; 
when stopl =>
if baud_tick = T' then 
state <= stop2;
end if; 
when stop2 =>
if baud_tick = T' then 
state <= idle;
end if;
end case;
end if;
52
end process state_proc;
data_load_proc : process(clk) 
begin
if  clk'event and elk = '1' then 
if txd_start = '1' then
data <= txd_data;
end if;
end if;
end process data_load_proc;
txd_proc : process(clk)
begin
if  clk'event and elk = '1' then 
case state is
when idle => txd <= '1'; 
when start => txd <= 'O'; 
when bitO => txd <= data(O); 
when b itl => txd <= data(l); 
when bit2 => txd <= data(2); 
when bits => txd <= data(3); 
when bit4 => txd <= data(4); 
when bits => txd <= data(5);
53
when bit6 => txd <= data(6); 
when bit? => txd <= data(?); 
when stopl => txd <= T ; 
when stop? => txd <= T ; 
end case;
end if; 
end process txd_proc; 
busy_proc : process(clk) 
begin
if clk'event and elk = T  then 
if state = idle then
busy <= 'O'; txd_busy <= 'O';
else
busy < = '1 '; txd_busy < = '1 ';
end if;
end if; 
end process busy_proc; 
end UARTTransmitter Arch;
The other alternative approach is to use MATLAB to capture the frames directly from 
the frame grabber and to convert the information into RGB values so that the VHDL code 
for Jpeg compression could use those vales to give the compressed bit stream. The 
compressed bit stream is displayed using the 8 led’s on the Vitex-4 FPGA board. The hst
54
o f inputs is as mentioned earlier. Since the output is a compressed bit stream o f 32 bits 
and we have only 8 led’s on the board we use SW l as a push button to push all the bits 
serially seven bits at a time into the led’s.
55
CHAPTER 5
RESULTS
The process was implemented on a Vrtex-4 FPGA board. An image containing (8x8) 
pixels was segregated into RGB and given as input. The implementation results are as 
follows
5.1 JPEG Compression 
Map Report
Target Device :
Target Package :
Target Speed :
M apper Version ;
Mapped Date :
Design Summary
Number o f warnings :
Logic Utilization;
Total Number Slice Registers :
Number used as Flip Flops :
Number used as Latches :
xc4vfxl2
ff668
-10
virtex4 — $Revision: 1.34 $ 
Sat Jul 12 16:05:03 2008
Number o f 4 input LUTs
3,265 out o f 10,944 29%
3,241
24
3,272 out o f 10,944 29%
56
Logic Distribution:
Number o f occupied Slices : 2,984 out o f 5,472 54%
Number o f Slices containing only related logic : 2,984 out o f 2,984 100%
Number o f Slices containing unrelated logic : 0 out o f 2,984 0%
*See Notes below for an explanation o f the effects o f unrelated logic 
Total Number 4 input LUTs : 3,323 out o f 10,944 30%
Number used as logic : 3,272
Number used as a route-thru : 50
Number used as Shift registers : 1
Number o f bonded JOBs : 42 out o f 320 13%
Number o f BUFG/BUFGCTRLs : 2 out o f 32 6%
Number used as BUFGs : 2
Number used as BUFGCTRLs : 0
Number o f DSP48s : 4 out o f 32 12%
Total equivalent gate count for design : 52,278
Additional JTAG gate count for lOBs : 2,016
Notes:
Related logic is defined as being logic that shares connectivity - e.g. two LUTs are 
"related" if  they share common inputs. W hen assembling slices. Map gives priority to 
combine logic that is related. Doing so, results in the best timing performance.
Unrelated logic shares no connectivity. Map will only begin packing unrelated logic into 
a slice once 99% of the shces are occupied through related logic packing.
57
Note that once logic distribution reaches the 99% level through related logic packing, 
this does not mean the device is completely utilized. Unrelated logic packing will then 
begin, continuing until all usable LUTs and FFs are occupied. Depending on your timing 
budget, increased levels o f unrelated logic packing may adversely affect the overall 
timing performance o f your design.
Delay Summary Report
The Number O f Signals Not Completely Routed For This Design Is : 0
The Average Connection Delay For This Design Is : 1.279
The M aximum Pin Delay Is : 4.318
The Average Connection Delay On The 10 W orst Nets Is : 3.580
Listing Pin Delays by value; (nsec)
5.2 M otion Fstim ation 
Map Report
Target Device :
Target Package :
Target Speed :
Mapper Version :
M apped Date :
Design Summary
Number o f Slices :
Number o f Slice Fhp Flops 
Number o f 4 input LUTs
xc4vfx20
ff672
-12
virtex4 — $Revision: 1.34 $
Sat Jul 12 13:02:03 2008
5313 out o f 8544 62%
6378 out of 17088 37%
10162 out of 17088 59%
58
Number used as logic : 10034
Number as Shift registers ; 128
Number o f lO s : 24
Number o f FIFO 16/RAMB 16s ; 2 out o f 68 2%
Number used as RAM B16s : 2
Number o f GCLKs : I out o f 32 3%
59
CHAPTER 6
CONCLUSION AND FUTURE RECOMMENDATIONS 
This paper presented the implementation o f JPEG compression and motion estimation 
on Virtex-4 FPGA hardware. The modules o f the jpeg architecture were designed and 
synthesized. The control block hardware design is also completed. The detailed pipeline 
design, operators, and the final results o f the synthesis o f the modules were also 
presented, resulting in an architecture containing 3,272 logic cells, including the control 
block with device utilization o f 29% and average timing delay o f 9.821 ns. The designed 
architecture performs the JPEG compression o f a 640 x 480 pixels gray level image in 
23.8ms, allowing its use in a JPEG compressor in hardware.
In future this implementation can be extended to Microblaze technology. The 
MicroBlaze soft processor includes several configurable interfaces that allow us to 
connect our own custom peripherals and coprocessors, as well as Xilinx provided 
peripherals. The M icroBlaze Debug Module (MDM) allows debugging o f eight 
MicroBlazes at a time. An automated partitioning system is under development. The aim 
is to provide an automated system which can take advantage of processor 
parameterization with custom  instructions, variable width registers, and multiple 
execution units, as well as assigning operations to hardware or software. FPGA-based 
codesigns allow a very large design space to be explored, and the opportunity to provide
60
a very high communication bandwidth between the processors and the hardware will 
mean that co-design solutions have a good chance o f producing more efficient designs.
61
BIBLIOGRAPHY
1. A Technical Introduction to Digital Video, C. Poynton. New York; Wiley, 1996.
2. “Digital color imaging IEEE Trans. Image Processing, vol. 6, pp. 901-932, July 
1997”, by G. Sharma and H. Trussell.
3. “The JPEG Still Picture Compression Standard”, by Gregory K. Wallace, M ultimedia 
Engineering, Digital Equipment Corporation, Maynard, M assachusetts page 1 jpeg 
comp.
4. “A System for the Implementation o f Image Processing Algorithms On Configurable 
Computing Hardware” by Benjamin Alexander Levine
5. Wong, S.; Vassiliadis, S.; Cotofana, S., "A sum o f absolute differences 
implementation in FPGA hardware," Euromicro Conference, 2002. Proceedings. 
28th, vol., no., pp. 183-188, 2002
6 . “Distributions o f  the Two-Dimensional DCT coefficients for Images”, IEEE Trans. 
Commun, vol. 31, pp. 835-839, 1983, by J. D. Gibson and R. C. Reininger.
7. “JPEG, StiU Image Data Compression Standard” Van No strand Reinhold, 1993.
8. “Effects o f Quantization Table M anipulation on JPEG Compression o f Cervical 
adiographs” by L. E. Berman, R. Long, S. R. Pillemer Society for Information 
Display International Symposium M ay 18-20, 1993.
9. http://www.cs.cf.ac.uk/Dave/M ultimedia/node259.html.
62
10. Weisstein, Eric W. "Huffman Coding." From M athW orld-A W olfram W eb Resource, 
http ;//mathwor Id. wolfram. com/HuffmanCoding. html
11. D. Buell, J. Arnold, and W. Kleinfelder, Splash 2: FPGAs in a Custom  Computing 
Machine. Los Alamitos, CA: IEEE Computer Society Press, 1996.
12. “Programmable Active memories: Reconfigurable Systems Come o f Age,” IEEE 
Trans. On VLSI Systems, vol. 4, no. 1, pp. 56-69, M arch 1996 by J. Vuillemin, P. 
Berlin, D. Roncin, M. Shand, H. Touati, and P. Boucard.
13. “JPEG Compression History Estimation for Color Images”, Ramesh Neelamani, 
Ricardo de Queiroz, Zhigang Fan, Sanjeeb Dash, and Richard G. Baraniuk
14. http://www.eee.bham.ac.uk/W oolleySI
15. “Integrated Digital Architecture for JPEG Image Compression” by Luciano Agostini 
and Sergio Bampi.
16. “Image and video comoression standards -  Second Edition, Kluwer Academic 
Publishers, USA, 1999 by vasudev bhaskaran, Konstantinos Konstantinides.
17. The International Telegraph and Telephone Consultative Committee (CCITT), 
“Information Technology -  Digital Compression and Coding o f Continuous-Tone 
Still Images -  Requirements and Guidelines” . Rec. T.81, 1992.
18. V. Bhaskaran, K. Konstantinides. Image and Video Compression Standards 
Algorithms and Architectures -  Second Edition, Kluwer Academic Publishers, USA, 
1999.
19. J.M. Saul. Hardware/Software Codesign for EPGA-Based Systems. In proceedings 
o f the 32nd Hawaii International Conference on System Sciences -  1999
63
20. http://www.xilinx.com/support/documentation/application_notes/xapp61G.pdf.
64
VITA
Graduate College 
University o f Nevada Las Vegas
Ramakrishna Gopalakrishnan
Address;
1555 E Rochelle Ave Apt 268 
Las Vegas, NV 89119
Degree:
• Bachelor o f Engineering, Electronics and Communication Engineering, 2006 
Anna University, India
Thesis Title:
Implementation o f JPEG Compression and M otion Estimation on FPGA Hardware
Thesis Examination Committee:
Chairperson, Dr. Henry Selvaraj, Ph.D.
Committee Member, Dr. Emma Regentova, Ph.D. 
Committee Member, Dr. Muthukumar Venkatesan, Ph.D. 
Graduate College Representative, Dr. Laxmi Gewali, Ph.D.
65
