Methodology and optimizing of multiple frame format buffering within FPGA H.264/AVC decoder with FRExt. by Stotts, Timothy Aaron
Rochester Institute of Technology 
RIT Scholar Works 
Theses 
8-2007 
Methodology and optimizing of multiple frame format buffering 
within FPGA H.264/AVC decoder with FRExt. 
Timothy Aaron Stotts 
Follow this and additional works at: https://scholarworks.rit.edu/theses 
Recommended Citation 
Stotts, Timothy Aaron, "Methodology and optimizing of multiple frame format buffering within FPGA 
H.264/AVC decoder with FRExt." (2007). Thesis. Rochester Institute of Technology. Accessed from 
This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in 
Theses by an authorized administrator of RIT Scholar Works. For more information, please contact 
ritscholarworks@rit.edu. 
Methodology and optimizing of multiple frame format 
buffering within FPGA H.264/AVC decoder with FRExt. 
by 
Timothy Aaron Stotts 
A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of 
Master of Science in Computer Engineering 
Approved By: 
Supervised by 
Assistant Professor Dr. Marcin Lukowiak 
Department of Computer Engineering 
Kate Gleason College of Engineering 
Rochester Institute of Technology 
Rochester. New York 
August 2007 
Marcin tukowiak 
Dr. Marcin Lukowiak 
Assistant Professor, RIT, Department of Computer Engineering 
Primary Adviser 
Ken W. Hsu 
Dr. Ken W. Hsu 




Director of Engineering, Xelic, Inc. 
Secondary Adviser 
Thesis Release Permission Form 
Rochester Institute of Technology 
Kate Gleason College of Engineering 
Title: Methodology and optimizing of multiple frame format buffering 
within FPGA H.264/AVC decoder with FRExt. 
I, Timothy Aaron Stotts, hereby grant permission to the Wallace Memorial Library to 
reproduce my thesis in whole or in part. 
Timothy Aaron Stotts 
Timothy Aaron Stotts 
:Joo f - 0& - I?-
Date 
Dedication
To Christ Jesus, my one true source of peace.
"Peace I leave with you, my peace I give unto you: not as the world giveth,




A special thank you to each of my advisers for sharing their time and experience; and
especially to Dr. Lukowiak for his patient guidance, and Mark Grabosky at Xelic, Inc. for
encouraging and equiping me to pull through. Thank you also to Thomas Warsaw for many




Digital representation of video data is an inherently resource demanding problem that con
tinues to necessitate the development and refinement of coding methods. The H.264/AVC
standard, along with its recent Fidelity Range Extensions amendment (FRExt), is quickly
being adopted as the standard codec for broadcast and distribution of high definition video.
The FRExt amendment, while not necessarily affecting the overall decoder architecture,
presents an added complexity of providing efficient memory management for buffering
intermediate frames of various pixel color samplings and depths.
This thesis evaluated the role of designing the frame buffer of a hardware video decoder,
with integrated support for the H.264/AVC codec plus FRExt. With focus on organizing
external memory data access, the frame buffer was designed to provide intermediate data
storage for the decoder, while using an efficient store and load scheme that takes into con
sideration each frame pixel format of the video data.
VHDL was used to model the frame buffer. Exploitation of reconfigurability and
post-
synthesis FPGA simulations were used to evaluate behavior, scalability and power con
sumption, while providing an analysis of approaches to adding FRExt to the memory man
agement. Real-time buffer performance was achieved for two common frame formats at
1080 HD resolution; and an innovative pipeline design provides dynamic switching of for
mats between video sequences. As an additional consequence of verifying the model, a












1.2 Thesis objective 4
1.3 Thesis chapter overview 6
2 Video Coding ... 7
2.1 Y'CbCr color model 7
2.1.1 Y'CbCr sub-sampling 9
2.2 H.264/AVC overview 12
2.2.1 H.264/AVC coding summary 14
2.2.2 H.264/AVC Fidelity Range Extensions summary 17
2.3 Thesis relevance and specifics 19
2.3.1 H.264/AVC data buffering flow 19
2.3.2 H.264/AVC data buffering organization 20
2.3.3 Macroblock pixel types 23
3 H.264/AVC Research
.25
3.1 Decoder memory case studies and research 25
3.1.1 Identification ofmemory components 26
3.1.2 Optimization techniques 27
3.2 Analysis of published results 30
4 Requirements and Modeling 32
4.1 Augmentation of decoder system 32
4.2 Algorithms used by the frame buffer. 32
4.2.1 Intra prediction buffering requirements 34
4.2.2 Deblocking filter buffering requirements 36
4.2.3 Inter prediction buffering requirements 36
vi
4.2.4 Combining of buffering mechanisms 37
4.2.5 Reference picture management 37
5 Synthesizable Implementation 40
5.1 External memory storage and control
40
5.1.1 DDR memory control
42
5.1.2 Block data controller 44
5.1.3 Implementation hiearchy 45
5.1.4 External DDR interface 46
5.2 Frame organization and addressing
46
5.2.1 Macroblock identification and frame slotting 49
5.2.2 Macroblock address mapping with FRExt 51
5.2.3 Frame storage marking 53
5.2.4 Sliding window implementation 54
5.3 Synthesis parameters 55
5.4 Frame buffer interface and pipelining 57
5.4.1 Frame buffer RTL interface 57
5.4.2 Pipeline semantics 62
5.5 Dual RAM frame buffer 63
5.5.1 Dual DDR SDRAM design 65
6 Verification - HDL Model Functionality . . .... 69
6.1 Unit testing 69
6.2 In-system verification 71
6.2.1 Augmentation of the decoder system 71
6.2.2 Testbench redesign 74
6.2.3 Video sequences 76
6.2.4 Functional simulation 77
6.2.5 Post-synthesis simulation 79
7 Results and Analysis
... 81
7.1 Implementation analysis gl
7.2 Synthesis resource analysis 86
7.3 DDR timing analysis 37
7.4 H.264/AVC timing analysis 90
7.5 Power consumption analysis 92
7.6 Cost analysis 94
8 Conclusions 97
8.1 Synthesizable models 97
8.2 Proposed system interfacing 99
Bibliography 203
vn
Software Tools and Deliverables 106
A.l Software tools 106
A. 1.1 Video processing and display 106
A. 1.2 FPGA design and simulation 107
A.2 Thesis deliverables 107
vm
List of Figures
1 . 1 Digital representation of a picture in terms of data size 1
1 .2 Digital representation of uncompressed video in terms of data size 2
1.3 The extra dimension of pixel size upon total picture data size 3
1.4 Internal partitioning of frame buffer design
4
1.5 Video decoder system partitioning augmented for testing frame buffer. ... 5
2. 1 RGB vs. Y'CbCr decomposition of a
"foreman"
test frame 8
2.2 Y'CbCr sub-sampling 4:4:4
10
2.3 Y'CbCr sub-sampling 4:2:2
11
2.4 Y'CbCr sub-sampling 4:2:0 11
2.5 Y'CbCr sub-sampling 4:0:0 12
2.6 Scope of H.264/AVC Standard: only decoding [24] 14
2.7 The correlation between source (uncoded) picture frames and encoded slices. 15
2.8 Pixel sampling and depth, increasingly stacked by FRExt profile. [24] ... 18
2.9 Buffering within the H.264/AVC Hypothetical Reference Decoder. [9] . . . 20
2.10 DPB operation: macroblock in, macroblock out 21
2.11 Buffer a row of macroblocks to retain neighbor MBs 21
2.12 Organization of the reference frame buffer 22
3.1 FPGA hybrid on-chip, off-chip decoder architecture proposed in [21]. ... 26
4. 1 Video decoder system architecture and data flow 33
4.2 Intra Prediction macroblock neighbor permutations 34
4.3 Main data path
"tap"
locations for buffering 36
5.1 Vendor-supplied Xilinx Spartan 3E DDR SDRAM controller. 43
5.2 Customized Xilinx Spartan 3E DDR SDRAM controller 43
5.3 Block data controller state machines 44
5.4 Internal partitioning of frame buffer design 45
5.5 External DDR interface 47
5.6 Binary 16x8 sub-macroblock memory maps for each 8-bit sub-sampling. . . 49
5.7 Single RAM frame buffer inner and outer interfaces 58
5.8 Dual RAM frame buffer inner and outer interfaces 65
6.1 Testbench flow with emphasis on data processing 74
6.2 Testbench flow with emphasis on storage operations 75
6.3 Elephant's Dream Frames 11290, 11310 77
IX
8.1 Proposed integration of frame buffer into decoder system 100
List of Tables
2.1 Compression ratios for various Y'CbCr sub-samplings 12
2.2 H.264/AVC standard drafting by year. [19] 13
2.3 Some of the alternate names given to the H.264/AVC standard. [19] .... 13
2.4 The three basic slice types specified by H.264/AVC 16
2.5 Sub-sampling factor for all sub-sampling ratios 23
2.6 Binary sizes of a macroblock and frame 24
4. 1 Macroblock stages of operation while passing through the decoder system. . 33
4.2 Example buffering size requirements for intra prediction 35
4.3 Example of reference picture list updates. [16] 39
5.1 External DDR pins 47
5.2 Unique addresses required to store 16x8 sub-macroblock for all pixel types. 49
5.3 Example (exact) ranges of macroblock numbers 50
5.4 Example (arbitrary) ranges of slot IDs 51
5.5 Addressing combinations loaded by block controller 53
5.6 Interpretation of frame marking booleans 53
5.7 Example contents of frame buffermetadata 54
5.8 Structural frame buffer synthesis parameters 56
5.9 Digital patterns for marking slots 62
6.1 Performance modifications to the original decoder model 72
6.2 Performance enhancements to the original decoder model 72
6.3 Functional corrections to the original decoder model 73
6.4 Key H.264/AVC test sequences 76
6.5 Typical simulation CPU time and memory for some sequences 78
7.1 Single RAM frame buffer synthesis, full pin-out 86
7.2 Dual RAM frame buffer synthesis, full pin-out 87
7.3 Comparison of frame buffer synthesis 87
7.4 Single xl6 DDR SDRAM bandwidth 88
7.5 Striped Dual xl6 DDR SDRAM bandwidth 89
7.6 xl6 DDR SDRAM bandwidth variance 89
7.7 Single xl6 DDR SDRAM frames per second 91
7.8 Striped xl6 DDR SDRAM frames per second 0|
7.9 FPGA device power consumption of frame buffer post-synthesis 92
XI
7.10 Estimated unit device cost in purchase quantity of 100 95
7.11 Minimum and maximum 1080 HD memory needs 95
XII
Listings
5.1 RTL pseudo code for store operation 60
5.2 RTL pseudo code for worst-case load operation 60
5.3 RTL pseudo code for best-case load operation 61
5.4 Example incorrect use of data striping 67
5.5 RTL pseudo code for striped store operation 67
5.6 RTL pseudo code for striped load operation 68
xin
Glossary
C++ C++ or C Plus Plus. A widely used object-oriented software programming
language.
CAVLC Context Adaptive Variable Length Coding. An improved, context-adaptive ver
sion of VLC used in the H.264/AVC Baseline Profile.
D
DVD Digital Video Disk or Digital Versatile Disk. A popular optical disk storage
technology used for videos and other applications that require large amounts of
storage.
F
FRExt Fidelity Range Extensions An amendment to H.264/AVC approved in 2004
providing
"professional"




H.264/AVC ITU-T H.264 and ISO/IEC 14496-10. Video coding standard approved in
2003 jointly by ITU-T and ISO/IEC. Delivers significantly better compression
than previous standards, including MPEG-2 and H.263.
xiv
HDTV High Definition Television. A number of high-quality resolutions standardized
for television use. Includes 1080x720 and 1920x1080 resolutions, and two
different forms of pixel arrangement (progressive and interlaced).
I
IDR Instantaneous Data Refresh. A coded frame composed of only I or SI slices.
The decoding of an IDR frame signals the reference picture list to mark its
entire list of frames as no longer needed for reference.
ISO/IEC International Standards Organization/International Electrotechnical Commis
sion. ISO is an international body responsible for developing and maintaining
a range of standards across many disciplines. IEC is the commission specifi
cally responsible for electrical technologies, including MPEG video compres
sion standards.
ITU-R International Telecommunications Union (ITU) Radiocommunication Sector.
Responsible for regulating the radio frequency spectrum used forwireless com
munications by industry and government worldwide.
ITU-T International Telecommunications Union (ITU) Telecommunications Standard
ization Sector. Responsible for developing and maintaining joint industry and
government standards for worldwide telecommunications technology.
M
MPEG Moving Picture Experts Group. The group within ISO/IEC that is responsible
for defining and adopting video coding standards.
xv
Q
QCIF Quarter-resolution Common Image Format. Defines an image size of 176 pix




Random Access Memory. Type of reusable data storage for which its contents
can be accessed in any order, and without any physical moving parts.
Rochester Institute of Technology. The author's primary university at the time
of publishing.
VCEG Video Coding Experts Group. A group from the ITU-T responsible for adopt
ing and defining video compression standards.
VCL Video Coding Layer. The layer in the H.264/AVC standard that contains actual
video information.
Verilog Verilog. A popular computer language used for modeling and describing hard
ware.
VHDL Very High Speed Integrated Circuit (VHSIC) Hardware Description Language
(HDL). A popular computer language used for modeling and describing hard
ware.
Y'CbCr Y'CbCr or YCC or YPbPr. A digital equivalent of the YUV color model, con
taining one luma, one blue chrominance and one red chrominance value. Al
though conceptually equivalent to YUV, and sometimes referred to as such,
xvi
Y'CbCr is specified by a different set of formulas. Analog component signals
which carry the Y'CbCr data are sometimes termed YPbPr.
YUV YUV. A three component color model defined in terms of one luma and two
chrominance values. YUV is commonly used within analog video
broadcast
formats to lossy compress RGB pixels by discarding a significant portion of
the color data, while retaining much of the human perceptible image quality.





The atomic unit of digital graphics technology is the pixel, or "picture
element"
'. A
digital image when rendered for display whether a still picture, a printed graphic, or an
individual frame of a video sequence consists of a finite, two-dimensional array of points.
Each point, or pixel, is represented by a sequence of binary data that describes the intensity
and color of that individual point. A single image typically consists of a uniform pixel
type viz., each point is described by exactly the same manner as all of the other points





















, tljtil 1 [1
""-i Ar &: d
" ::
=-4=H-H-^A [fl 1 | :^1
.... l<<




Figure 1.1: Digital representation of a picture in terms of data size.
'Technically, this statement is only true for raster graphics; but since human visual display of still images
and video is rasterized with present technology, the statement holds for this context.
As shown, the quantitative metrics that determine the total binary data size of an iffif<X
are: the number of bits necessary to represent a single pixel, and the total number of pix
els. Considering only pictures, or rectangular images, the total binary size of a picture is
computed as follows:
PictureBits = (WidthlnPixels * HeightlnPixels) * {BitsPerPixel) (1-1)
When considering digital video technology, the appearance of continuous motion is
facilitated by rapid display of a sequence of pictures, with most pictures being a slight
change in appearance from the previous. The data size of an uncompressed video sequence
then increases multiplicatively from the above equation, as shown in Figure 1 .2.
VideoBits = (PictureCount) * (PictureBits) (1.2)
Picture Bits
Figure 1.2: Digital representation of uncompressed video in terms of data size.
Video coding standards such as H.264/AVC provide a set of tools by which the binary
data of a source video sequence may be modified and compressed into a much smaller bi
nary representation, while retaining ameasure of the human-perceptible visual quality. The
compression operates primarily on the assumption that neighboring pictures contain a large
quantity of similar (redundant) pixels. The compressed, and thus smaller, representation is
then stored or transmitted in various manners. However, when decoding this compressed
video data back into a raw, ready-for-display format, the H.264/AVC standard requires in
termediate storage of multiple uncompressed pictures. As the video sequence is decoXa,
some pictures are stored within a frame buffer
2
for later reference by the decoding al
gorithms. This intermediate storage of select pictures produces a potential performance
concern for hardware implementations due to the memory store and load of significant data
quantity.
Very specific to H.264/AVC is the recent FRExt amendment applied to this standard.
While many video coding standards allow for only a single pixel type of a specific bit
size, the FRExt amendment introduces the option to encode a video sequence with one of a
variety of pixel types. Each pixel type uses a different number of sample bits and a different
Y'CbCr sub-sampling to represent itself, thus significantly impacting the binary sizes of
both the compressed and uncompressed data, irrespective of the total number of pixels.
The impact of changing pixel representation upon the total picture data size is depicted
with Figure 1.3. This additional
"dimension"
by which the picture data representation
may differ between video sequences introduces an additional complexity to both the data
















Figure 1.3: The extra dimension of pixel size upon total picture data size.
2Frame buffer is a more generic term for this type of hardware component. The H.264/AVC standard
terms this component of the decoding process to be the: Decoded Picture Buffer (DPB).
1.2 Thesis objective.
This thesis provides an initial methodology for implementing and optimizing the
frame
buffer of a hardware H.264/AVC decoder with FRExt. A frame buffer with external mem
ory was implemented with the functionality of the H.264/AVC codec plus support for each
of the Y'CbCr pixel formats of the FRExt
"High"
profiles. The frame buffer was designed
to be a single component, scalable to various memory capacities and frame resolutions,
and
capable of efficiently switching frame format mode (pixel type) in-hardware. Additionally,
the organizational and access schemes of the frame buffer were tailored to handle each of
the decoded frame pixel formats, with considerations toward optimization. Figure 1.4 de






























Figure 1.4: Internal partitioning of frame buffer design.
The frame buffer was modeled using the VHDL hardware description language in three
different forms: a zero-time simulation-only behavioral model, and two different synthesiz
able descriptions for implementing in hardware. Both hardware models specifically target
Xilinx FPGA technology with external DDR SDRAM memory; one using a single mem
ory chip, and the other striping data between two memory chips. All three models were
verified against each other for identical functionality within a full simulation-only decoder
system; and the two hardware models were verified both in RTL and post-place-and-route
netlist forms. Finally, each hardware model was verified as parameterizable pre-synthesis
to support any combination of the three H.264/AVC frame buffering needs: intra predic
tion, inter prediction, and deblocking filter; to optionally support multiple frame formats;
and to optionally support the sliding window algorithm.
As an additional contribution, a simulation-only VHDL model of a H.264/AVC Base
line decoder was augmented to support simulation ofHD resolutions, and emulate memory
support for multiple pixel formats of the FRExt
"High"
profiles. These additions included a
new queuing testbench design that would exercise the frame buffer according to the behav
ior of a real video sequence. The preexisting, incomplete Baseline software buffer model
was one such component augmented within the decoder, and its operation within the full


























Figure 1 .5: Video decoder system partitioning augmented for testing frame buffer.
The decoder operation with respect to these additional coding tools was verified using
reference H.264/AVC codec software [18] written in C++. After several video sequences
were used to demonstrate sufficiently correct behavior of the VHDL decoder in comparison
with the reference software, the decoder was then used as a basis for in-system simulation
and validation of the synthesizable frame buffer description, both pre- and post-synthesis.
1.3 Thesis chapter overview.
This thesis begins with a discussion of video coding concepts and the H.264/
AVC stan
dard in Chapter 2. An overview of basic video compression and color models is
presented
along with a synopsis of the H.264/AVC standard. Chapter 3 then provides
an overview
of published research on the topic ofmemory management within a
hardware H.264/AVC
decoder. Potential methods of optimizing FRExt within a hardware
decoder are conjec
tured. Chapter 4 discusses the conceptual modeling of the frame buffer
implemented by
this thesis, including requirements, algorithms, and data flow.
The actual frame buffer implementations performed are presented in Chapter 5, with
a detailed look at synthesizable descriptions, and considerations toward functionality and
optimization. Chapter 6 discusses the verification of each model of the the frame buffer
component, and how they were verified against each other with representative video se
quences. The results of the verified descriptions are presented in Chapter 7, with simulated
performance analysis for the hardware frame buffer, taking into consideration trade-offs
between speed, power consumption, and complexity. Finally, Chapter 8 concludes the the
sis with a summation of results and potential improvements. It also proposes an interfacing




This chapter briefly discusses the the Y'CbCr pixel color model used by many video codecs,
and also provides background on the history, concepts, and application of H.264/AVC.
2.1 Y'CbCr color model.
Y'CbCr is the predominate color model used within digital video coding standards, includ
ing H.264/AVC. Some popular color models, such as RGB and YMCK (commonly used
within displays and printers respectively), produce a range of colors by mixture of three or
four linear channels. These color channels are similar in effect to the mixing of primary
paints on an artist's palette. Unlike linear color models, Y'CbCr represents pixel intensity
as its own component, but with some residual interdependence with the color components.
"Y"




represent the blue and red chroma
components respectively. The
"luma"
component is gamma-corrected luminosity, and the
"chroma"
components are gamma-corrected chrominance. A comparison of linear RGB
decomposition and gamma Y'CbCr decomposition is shown with Figure 2.1.
The goal of the Y'CbCr model is to represent
R'G'B'
(gamma-corrected RGB) data in a
compressed form by discarding some of the less essential sub-pixel color resolution. Since
the human eye is predominately sensitive to brightness, and also the color green, compress





(b) R. G, B channels respectively.
X
(c) Yd Cb. Cr components respectively.
Figure 2.1 : RGB vs. Y'CbCr decomposition of a
"foreman"
test frame.
bit size while retaining much of the original appearance.
Multiple sub-sampling formu
las exist for Y'CbCr, and several are standardized by the ITU. One conventional Y'CbCr
form used to transform source video data just prior to encoding with H.264/AVC and other
codecs is:
Y'
= KR*R + {l-KR
1 ( B-Y
Cb
9 1 - Kb
C






The gamma values KR and KB are left unspecified until a specific display technology is
determined; but KR + KB < 0.5 may be assumed. A common choice for current displays
is: KR 0.2120, KB = 0.0722 [19]. As shown, the luma component does depend on
the original red and blue channels, but is primarily influenced by the green. The choice of
gamma values is dependent upon the intended video display technology; for example. CRT
and LCD displays have somewhat different ideal gamma values, as do conventional and
high definition televisions.
With respect to digital video coding, sample values are typically stored and operated
upon as integers. ITU-R BT.601 specifies a form of 8-bit integer matrix multiplication that
can be used to perform transformations between
R'G'B'
and Y'CbCr [6, 8j. Equation (2.2)
demonstrates transformation of 8-bit per channel
R'G'B'
to 8-bit per sample Y'CbCr; ai:d
equation (2.3) demonstrates transformation of 8-bit per sample Y'CbCr to 8-bit per channel
R'G'B'. When the matrix coefficients are chosen properly for the target technology,
the
transformations alone will incur very little data loss; only so much as results from precision
































By itself, transforming a picture from RGB to Y'CbCr does not reduce the size of the
binary representation. Due to the near separation of luminosity and color, sub-sampling
can be used during or following transform to reduce the binary size of the chroma samples.
This form of reduction requires that the picture be partitioned into
"macroblocks."
Each
macroblock is a base square unit of pixels, possibly 8 by 8 pixels or 16 by 16 pixels in
dimension, depending on the intended use. It is also necessary to specify sub-sampling
with a three value ratio in the form of: f:m:n. Using this ratio, the RGB is transformed
with a relative sampling rate for each component of the Y'CbCr color model, according to
the following rules:
/ is defined as an integer greater than 0.
m, n are defined as integers greater than or equal to 0.
When n is greater than 0:
/ is the horizontal sampling frequency of the luma.
ra is the horizontal sampling frequency of the first (blue) chroma.
n is the horizontal sampling frequency of the second (red) chroma.
The vertical sampling frequencies of luma and each
chroma are the same.
When n is equal to 0:
/ is the horizontal sampling frequency of the luma.
ra is the horizontal sampling frequency of each chroma.
The vertical sampling of each chroma is half the vertical sampling of the
luma.
With respect to digital video coding standards, ITU-R BT.601 specifies the luma ratio
/ as constant at 4, representing an analog-to-digital sampling rate of 13.5MHz as used by
NTSC and PAL in US and European television respectively [8]. Thus, a direct transforma
tion of RGB data to Y'CbCr without sub-sampling would be represented by the ratio 4:4:4,
and is depicted with Figure 2.2.
pixel B - sample
(a)
Y'
(b) Cb (c) Cr
Figure 2.2: Y'CbCr sub-sampling 4:4:4.
As shown, each macroblock is the same digital size with a one-to-one mapping between
samples and pixels. Performing professional quality sub-sampling of 4:2:2 yields a reduc
tion in the number of chroma samples, and thus compresses the macroblock, as depicted
with Figure 2.3. The number of pixels within the macroblock does not change; only the
number of samples, reducing the internal resolution or quality.
Consumer-grade compression including Digital Video Broadcast typically uses a sub-


















H t 1._ ;:*+:
' d - 1i.r-i
(a)
Y'
(b) Cb (c) Cr
Figure 2.3: Y'CbCr sub-sampling 4:2:2.
with Figure 2.4. Very old standards such as MPEG-1 used a different alignment for the
sampling process; but newer standards attempt to reduce re-sampling losses between 4:2:2
and 4:2:0.
-
pixel B - sample
-tf
r
- - - - [- 1
l J


























(b) Cb (c) Cr
Figure 2.4: Y'CbCr sub-sampling 4:2:0.
To represent the video stream in gray scale for "black and
white"
content, the chroma
samples are simply discarded. The gray scale sampling ratio of 4:0:0 is depicted with Fig
ure 2.5. Depending on the intended use, the gamma ratios used during transformation may
be slightly different than with other sampling ratios to preserve subjective monochrome
quality.
As shown with each of these examples, the macroblock remains a 16 by 16 block of
pixels after transformation and after sub-sampling. However, the macroblock can be repre
sented by a smaller number of samples than there are pixels, compressing the internal data
size. The effects and intended applications of the aforementioned sub-sampling ratios are




:_ _LLL. 1 1 A'









1" \i ' " - -H~t-i
(a)
Y'
(b) Cb (c) Cr
Figure 2.5: Y'CbCr sub-sampling 4:0:0.
Table 2.1: Compression ratios for various Y'CbCr sub-samplings.


















In addition to pixel-level compression, video coding standards provide computational tools
by which to significantly reduce video binary size. H.264/AVC is one such video coding
standard, recognized internationally, and put forth jointly by the ITU and ISO standards
organizations. More specifically, the standard was drafted in cooperation between ITU-T
and IEC, which are respectively the ITU sector and ISO commission responsible for video
coding standards. First drafted by the ITU-T in 2002 as H.26L, and approved in 2003 as
H.264, the standard has undergone a series of revisions and approvals by multiple organi
zations. Table 2.2 details the progress of the H.264/AVC video coding standard by year
from the perspective of the ITU-T drafting. Table 2.3 lists the various names by which the
H.264/AVC standard is sometimes referred by. Hence forth, the standard is referred to by
the name H.264/AVC as an abbreviated merge of the ITU and ISO names, commonly used
in literature [19].
The primary goal of the H.264/AVC standard is to provide video compression similar
in technique to the older MPEG-2 and H.263 standards, but with significant increase in
12
Table 2.2: H.264/AVC standard drafting by year. [19]
Date ITU event
2002 H.26L drafted.
May 2003 ITU-T H.264 Version 1 approved with Baseline, Main,
Extended profiles.
May 2004 ITU-T Corrigendum containing minor corrections.
March 2005 ITU-T H.264 Version 2, with added High, High 10,
High 4:2:2, High 4:4:4 profiles (FRExt).
Sept. 2005 ITU-T Corrigendum containing minor corrections
and three aspect ratio indicators.
June 2006 ITU-T Amendment, removal of High 4:4:4 profile,
and addition of extended-gamut color space.
TBD ITU-T Replacement of High 4:4:4 with High 4:4:4 Predictive.
Table 2.3: Some of the alternate names given to the H.264/AVC standard. [19]
H.26L H.264
ISO/IEC 14496-10 JVT
MPEG-4Part 10 MPEG-4 AVC
usage scenarios and coding efficiency. Currently, H.264/AVC is starting to become the
common standard for use within cable television Digital Video Broadcast (DVB) and High
Definition (HD) media [19]. H.264/AVC aims to succeed the MPEG-2 format with the
following changes:
improved network quality of service for mobile and LAN/Internet
increased visual quality-to-binary size ratio, especially at very low and very high
resolutions
improved visual precision with respect to motion prediction, reducing visual artifacts
more ideal coding format for HD-DVD movies and Internet TV
With the initial drafting of the H.264/AVC specification, a goal was established to
achieve an improved visual quality to compressed bit size ratio over MPEG-2 and H.263
by a best-case increase of 100% and average minimum of 50%. An ambitious goal, current
13
implementations suggest that the standard does generally provide
such compression by use
of its advanced coding tools [1 1, 24, 23, 19].
As is convention withmany video coding standards, especially
those released by ITU-T,
the specification is constrained in scope to the data format of the full
video processing
system. The scope is limited to the algorithms and functionality of the decoder, omitting
implementation and architectural details; thus allowing for maximum flexibility
of both
decoder and encoder implementations. Figure 2.6 depicts the full video processing system,
from source content to display rendering. As shown, the scope of the
H.264/AVC standard
is limited to the decoding stage of the full video processing system.
Details of the encoding






? Pre-Processing ? Encoding
Post-Processing
I




scope of standard i
Figure 2.6: Scope of H.264/AVC Standard: only decoding [24].
2.2.1 H.264/AVC coding summary.
The H.264/AVC standard specifies high-level organization of operating on raw video frames
to reduce their binary representation to a compressed format with a configurable ratio of
visual quality to size. This reduction works by a combination of lossy and lossless compres
sion. The lossy compression discards redundant data while retaining much of the original
visual quality.
Where as H.264/AVC does not specify the implementation details of the block trans
form equations, or other signal processing algorithms, it does specify organization and
14
storage of picture data. Conceptually, each picture, or frame, can be coded as one of the
following:
A complete, self-contained frame.
Differences from one past or future reference frame.
Differences from two past or future reference frames.
To organize this complex frame differencing into a more flexible arrangement, a sequence
of frames is cut into a sequence of slices, as depicted with Figure 2.7. Additionally, the
coding features are grouped into different subsets, or profiles. Depending on the profile
and configuration, the slices do not have to be exact in size to the actual picture frames,
although a one-to-one mapping is common. Additionally, the slices are not necessarily the
same raster-scan order as the pictures, even if organized into a one-to-one mapping. This









Figure 2.7: The correlation between source (uncoded) picture frames and encoded slices.
While H.264/AVC provides for five different slice types, only three slice types are used
by the Main and FRExt profiles, as
shown in Table 2.4. As the pictures are mapped into
15
Table 2.4: The three basic slice types specified by H.264/AVC.









intra-, inter- (x2) Baseline
slices, each slice is divided into a set of macroblocks (MBs), a 16-by-16 pixel base data
unit. Nearly all computational efforts are performed directly on a singleMB, with potential
reference to other MBs. Each MB may be of a different type, referring to groupings of
pixels belonging to neighbor MBs.
An I-slice contains only macroblocks that use spatial (intra) prediction, possibly refer
encing near-by macroblocks within the same slice. A P-slice contains a mixture of both
spatial (intra) prediction and temporal (inter) prediction macroblocks, where each mac
roblock uses only one type of prediction for itself. For P-slices, each temporal prediction
vector can have only one previously decoded reference. B-slices are similar to P-slices,
except that those macroblocks using temporal prediction may have two references per pre
diction vector.
The slice mapping and slice types has a direct impact upon the effective lossy com
pression of the video data. Quantization is used to discard the least important pixel data;
with larger coefficients producing more intense compression at the expense of visual qual
ity. Temporal prediction provides more effective quantization; and thus those slices using
temporal prediction compress more easily than those using spatial prediction.
The net effect of quantization is that the data is stored with an additional lossless com
pression tailored specifically to the format of the video data. One of two forms of entropy
codingCAVLC and CABAC can be used to increase the average compressibility of the
already highly compressible data when performing lossless block compression [9, 24].
To address the many usage scenarios, H.264/AVC contains a extensive list of features
available to the coding processes, as required to be supported by the video decoder. To
organize these features into a small quantity of permutations, the specification groups them
into three different feature-set profiles:
16
Baseline: contains all but the most complex and least common coding features of the
specification. The intended applications include streaming video, teleconferencing,
and other more general purpose uses. CAVLC is the only option for entropy coding,
and B-slices are unsupported.
Main: contains a primary subset of features, mostly overlapping with the Baseline
profile, and adding those features most useful for on demand commercial services.
The intended applications include DVB, distribution of video media, and others that
use high resolutions and data rates. CABAC is the default option for entropy coding,
and B-slices are supported.
Extended: contains most features of the Baseline profile, plus multiple additional
features that are complex, uncommon, and useful only in subset of situations. The
intended applications include mobile and wireless devices.
Each of these H.264/AVC profiles use a consumer-grade video depth of 4:2:0 chroma
formating and 8 bits per sample [9, 24]. Version 1 ofH.264/AVC without the FRExt amend
ment only supports the same sample depth and sub-sampling ratio as its predecessor codecs,
including H.263 and MPEG-2.
2.2.2 H.264/AVC Fidelity Range Extensions summary.
The FRExt amendment to the H.264/AVC specification essentially augments the feature set
of the original H.264/AVC to support multiple sampling depths, sub-sampling ratios, color
spaces, larger frame resolutions, and other additional coding tools. While the primary ob
jective of this amendment was to introduce features necessary for editing and production of
"professional"-grade video, the new features provide additional flexibility to managing the
quality and format of
distributed media. As one example, the High profile has already been
adopted to succeed the Main profile for use within some High Definition media, including
HD-DVD, BD-ROM, and some forms ofDVB. For sake ofbrevity, the term Fidelity Range
Extensions is hence forth abbreviated as FRExt in like convention with other literature. [24]
17
Each of the additional decoder profiles specified by FRExt are an incremental increase
of features from the Main profile due to their primarily commercial application. In sum
mary, the additional decoder profiles are:
High: contains all of the coding tools of the Main profile. Adds several coding
efficiency tools and monochrome 4:0:0 video. This profile easily replaces
the Main
profile for many applications.
High 10: contains all of the coding tools of the High profile. Adds sampling depth of
up to 10 bits per luma and 10 bits per chroma.
High 4:2:2: contains all of the coding tools of the High 10 profile, while adding
professional-grade tools, very high resolutions, and the sub-sampling ratio of 4:2:2.
High 4:4:4: contains all of the coding tools of the High 4:2:2 profile, while adding
sub-sampling of near-lossless 4:4:4. It also adds extremely high data rates and reso
lution, with some limited lossless encoding capabilities.









Figure 2.8: Pixel sampling and depth, increasingly stacked by FRExt profile. [24]
With respect to frame formatting alone, the
profiles'
supported Y'CbCr pixel formats
stack increasingly as shown in Figure 2.8. The chroma sampling ratios are representative
of the internal color quality relative to luminosity quality, ranging from 4:0:0 monochrome
18
to 4:4:4 near-lossless R'G'B'. The higher chroma ratios increase the color accuracy of the
video. An individual video sequence may also be configured to represent its luma and
chroma samples uniformly with a sample size between 8 and 12 bits inclusive. The larger
sample sizes increase the overall precision of the video.
One anticipated use of FRExt within future consumer products, both software and em
bedded hardware, is use of the High and High 10 profiles, which may be used to select sev
eral color depths and monochrome for HD video, without changing the frame resolution;
thus providing an extra degree of subjective visual quality to the video stream. Other pos
sible embedded applications include professional-grade encoder/decoders for use in video
production; especially those concerned with real-time operation. [19]
2.3 Thesis relevance and specifics.
While H.264/AVC does not specify any memory architecture, it does detail the data flow
of buffering intermediate data through the video decoder. It also specifies how the various
Y'CbCr macroblock formats are processed.
2.3.1 H.264/AVC data buffering flow.
In Annex C of the H.264/AVC standard [9], a Hypothetical Reference Decoder (HRD) is
detailed for sake of providing an example conceptual implementation of the standard in
software. This decoder contains two distinct data buffers, the Coded Picture Buffer (CPB)
and the Decoded Picture Buffer (DPB). The CPB is not considered by this thesis as it is
only a receiving cache of the
bitstream and not associated with actual decoding processes
or the frame buffer itself. (It is shown for completeness.) The DPB and its position in the
data flow is shown with Figure 2.9.
Conceptually, the DPB is a random access block of memory where buffered mac
roblocks may be stored to and
loaded from. Each of these macroblocks are partially or fully




















Figure 2.9: Buffering within the H.264/AVC Hypothetical Reference Decoder. [9]
With a software implementation, the DPB is a section of system RAM allocated on the
operating system heap, and a few pointers provide indexing of various locations within the
DPB. When implementing in hardware, using a single external SDRAM memory for the
DPB may be sufficient; but a memory hierarchy may also be necessary to obtain a higher
level of performance. A few possible hardware architectures are discussed in Chapter 3.
2.3.2 H.264/AVC data buffering organization.
Data entering and exiting the DPB is always on a basis of a macroblock ofY'CbCr samples.
While the internal storage may or may not map the three luma and chroma components
together within the same address space, the interface to the DPB always groups them on a
macroblock basis, as depicted with Figure 2.10.
H.264/AVC specifies three macroblock sizes, each having a luma dimension of: 16x16,
8x8, 4x4. Conceptually, all data store within the DPB is on a 16x16 or 4x4 macroblock






Figure 2.10: DPB operation: macroblock in, macroblock out.
Adjacent macroblock buffering.
Two processes within the decoder system the Deblocking Filter (DF) process and Intra
Prediction (IntraP) process require load and/or store access to a currently selected MB
and also to each of its four already processed neighbor MBs. The currently selected MB
moves according to raster-scan order as a frame is decoded. The MB positions are depicted
with Figure 2. 1 1 .
'





MBs per Raster-scan Row
Figure 2.11: Buffer a row of macroblocks to retain neighbor MBs.
For example, while a macroblock is processed by the DF and output into the Current
MB location, an MB from positions A or D may be loaded to assist the filtering calculations.
Similarly, the IntraP process may need to load anMB from any ofpositions A, B, C, or D.
Access to MBs from any other positions
are not needed by these processes, suggesting that
21
their buffering operations could be considered as either combined with the other decoder
processes, or perhaps using an independent section within the DPB [21].
Frame buffering.
As each frame is decoded, it is potentially stored in its entirety within the DPB for later
reference, depending on the values of metadata and memory instructions parsed from the
bitstream. Typically, an H.264/AVC profile will specify a defaults frame history depth
of four (4) to six (6) frames to store within the DPB for reference purposes. The maxi
mum frame buffer length permitted by the standard is fifteen (15) reference frames. The
maximum reference frame size of the DPB for a specific video stream depends on both
the H.264/AVC profile and the IDC Level of the current video stream. For example, the








Figure 2.12: Organization of the reference frame buffer.
The organization of reference frames within the frame buffer section of the DPB is
shown with Figure 2.12. As each frame is added to the buffer, the slice header information
is used to determine whether a particular frame is to be marked for short-term use or for
22
long-term use. Each frame has an index number for identifying itself within the frame
buffer. Two lists are maintained between frames, lists LO and LI. This first list is used
with both P-slices and B-slices, whereas the latter is only for B-slices.
When beginning the addition of a frame to the buffer, the current capacity is first
checked. If the buffer is full then a "sliding
window"
algorithm is used to discard the
oldest short-term frame. A frame may be marked for long-term storage by an explicit
memory storage instruction in the bitstream. Once a frame is marked for long-term stor
age, an explicit memory instruction from the bitstream is required to flush the frame out of
the buffer.
One process within the decoder system the Inter Prediction (InterP) process requires
load access to arbitrary MBs from previously decoded reference frames. Noting the slice
types from Table 2.4, each individual macroblock makes use of the IntraP or InterP process,
but not both. This suggests that the buffering operations of IntraP and InterP could be
combined to overlap in timing. [21]
2.3.3 Macroblock pixel types.
With additional pixel types permitted by the FRExt amendment, the binary size of a mac
roblock is relative to two additional factors beyond the pixel dimensions of the macroblock
itself: chroma sub-sampling, and sampling depth. The single pixel data size for a set of
Y'CbCr samples is computed by the sub-sampling factor times the luma sampling depth.
The sub-sampling factor for each ratio is shown with Table 2.5. Using the sub-sampling
Table 2.5: Sub-sampling factor for all sub-sampling ratios.









1.0 + 0.0 + 0.0= 1.0
1.0 + .25 + .25 = 1.5
1.0 + 0.5 + 0.5 = 2.0
1.0+1.0 + 1.0 = 3.0




the pixel dimensions of a macroblock Mw * Mh, the binary size of a macroblock can be
computed as such:
MBbits = (Mw * Mh) * (lumablts * Fss) (2.4)
As an example, for a 4x4 macroblock with 4:2:0 sub-sampling and 8 bits of
sample depth,
the binary size of the macroblock is:
MBbits = (4 * 4) * (8 * 1.5)
= 192 bits (2.5)
The binary sizes of all possible 16x16 macroblocks considered by this thesis are detailed
with Table 2.6(a). As an example of how the macroblock size affects frame storage capacity
Table 2.6: Binary sizes of a macroblock and frame.
(a) 16x16 macroblock, bits
chroma sub-sampling
4:0:0 4:2:0 4:2:2 4:4:4
2048 3072 4096 6144
2304 3456 4608 6912
2560 3840 5120 7680
2816 4224 5632 8448
3072 4608 6144 9216




4:0:0 4:2:0 4:2:2 4:4:4
1.99 2.99 3.98 5.98
2.24 3.36 4.48 6.72
2.49 3.74 4.98 7.47
2.74 4.11 5.48 8.22
2.99 4.48 5.98 8.96
within the DPB, Table 2.6(b) details the frame sizes in
bits1
for the largest HD-TV resolu
tion: 1080 HD. Note that while the visible resolution is 1920x1080, the H.264/AVC coded
luma resolution is actually 1920x1088 due to internal cropping constraints of the codec.
As shown, the largest pixel type produces a 1080 HD frame that is 4.5 times the size of
the smallest pixel type. This significant variability in data size is unique to the H.264/AVC
plus FRExt codec.




This chapter discusses published investigations and conclusions related to implementing
the frame buffer of an H.264/AVC decoder, and also presents several inferences made by
this thesis.
3.1 Decoder memory case studies and research.
Since its approval in 2003, the H.264/AVC coding standard [7, 9] has seen an extensive
degree of published research, including the performance optimization of both software and
hardware implementations. The extent of open publishing is possibly due to two major
factors. First, the baseline specification ofH.264/AVC is royalty free and open to academia
and industry alike without charge beyond
purchasing1
the specification document, from
either ISO or ITU. This is different from the preceding MPEG-2 standard, for which ISO
charges significant royalties against each implementation [20]. Second, even during the
drafting phase, the scalability of the standard was found to be suitable for applications be
yond the original focus of video conferencing, including the up-and-coming technology of
HDTV. H.264/AVC was found to perform well at both low and high bit rates and resolu
tions [20]. These significant factors, in addition to others, including industry and academia
trends, have contributed to significant publishing of H.264/AVC research and development
'As of early 2007, ITU-T is now providing free download
of the entire H.264.X group of documents and
software. Implementations of profiles other than Baseline, including some of the work performed by this






























Figure 3.1: FPGA hybrid on-chip, off-chip decoder architecture proposed in [21].
[24].
In this section, a few supporting works are discussed for the purpose of depicting doc
umented approaches to implementing the external memory of a hardware H.264/AVC de
coder without FRExt.
3.1.1 Identification ofmemory components.
Before discussing techniques to optimizing memory within the H.264/AVC decoder, it is
necessary to first identify the memory requirements of the processing blocks, and also a
realistic generic architecture for hardware implementation. Considering an FPGA imple
mentation of the DPB, a few distinct independent sections of buffering become apparent, as
already detailed in Section 2.3.2. Proposed in [21] is a mixed on-chip and off-chip FPGA
memory architecture for an FPGA decoder, as depicted with Figure 3.1 .
From the figure, the independent buffering needs of the decoder are as follows:
1. Two ping-pong buffers to FIFO macroblock data between processing stages. These
buffing components are at a maximum size of two macroblocks, and are thus small
26
enough to potentially fit on-chip in FPGA block RAM. One ping-pong buffer is
placed between the stream parser and the transform unit. The other is placed be
tween the prediction units and the deblocking filter.
2. A row buffer for data feedback to the intra-prediction unit. For small resolutions,
this should fit on-chip for an FPGA. However, for large resolutions it should only fit
on-chip for very large FPGAs with a significant quantity of block RAM.
3. A row buffer for data feedback to the deblocking filter. For small resolutions, this
should fit on-chip for an FPGA. However, for large resolutions it should only fit
on-chip for very large FPGAs with a significant quantity of block RAM. Also, it
is conceivable that resources are constrained to only allow instantiation of one row
buffer on-chip; in such a case, this row buffer can be combined with the reference
frame buffer.
4. A reference frame buffer for storing an iV depth of fully decoded frames. This mem
ory component is certain to require a memory controller with an external RAM chip
with a significant capacity.
Each of these conceptual buffering stages could be combined into a single buffering unit
with off-chip memory, or could be distributed as described. The greater the distribution
of the memory buffer on-chip, the lower the bandwidth requirements of the external RAM
interface.
3.1.2 Optimization techniques.
Custom SDRAM memory controller for H.264/AVC.
One approach to optimizing the frame buffer
performance is to implement a custom mem
ory controller with bus
access specifically tailored to the H.264/AVC decoder architecture
and frame data. Rather than just optimize the data flow logic around the memory, the mem
ory controller itself can be
re-implemented to anticipate the decoder behavior and provide
27
comparable data access at lower clock rates.
Described in [26], an HDTV H.264/AVC decoder was implemented with a custom
SDRAM control for off-chip buffering of frame data. First a standard SDRAM memory
controller was used to control the read and write accesses. For the maximum H.264/AVC
Baseline resolution of 1080 HD (1920x1080), the clock speed requirement of the mem
ory controller was determined to be 193 MHz. The controller was
then reimplemented to
improve performance. Reducing the SDRAM page-active cycle in 2-dimensional read and
write accesses provided a one-third performance enhancement over traditional page per line
architectures. Memory bandwidth was significantly conserved with the additional benefit
of increased flexibility with the implementation of other decoding blocks.
Specifically, the clock speed requirement for the the new memory controller was de
termined to be 121 MHz, hence an approximate one-third performance improvement of
the controller. The reduction in clock speed requirement also potentially reduces power
consumption and device cost.
Multiple channel memory architecture.
Another possible approach, depending on the target device technology, is to use two mem
ory controllers with one RAM each for buffering different portions of frame data within
the decoder. In [10], an ASIC H.264/AVC decoder architecture utilizes dual memory con
trollers combined with an ARM CPU, system bus, and local bus. The architecture demon
strates use of two buses and two memory controllers to facilitate a high level of controlled
parallelism between processing blocks. The performance was sufficient to process real
time 1080HD frames.
Extending this technique, a deblocking filter architecture is described in [1 1] that uses
a 2-dimensional array of memory modules. Specifically, the architecture uses eight dual-
port SRAM modules, facilitating parallel access of eight pixels. The pixels within a 4x4
macroblock are mapped in a linear shift-rotate manner. This allows an 8-way parallel load
and store of the pixel data of an arbitrary 4x4 block without incurring memory row/column
28
select conflicts.
Implementable greedy search algorithm for frame history.
To further increase the hardware performance of the frame buffer as measured by reduc
ing the necessary clock speed and RAM capacity the frame buffer data management may
discard an additional quantity of frame data. This form of lossy optimization potentially
increases performance by a reduction of data transfer across the memory bus at the expense
of reducing the subjective video output quality.
To render the decoded video within
"acceptable"
levels, an efficient search algorithm
may be used to determine what video data if any-to discard for each frame buffer opera
tion. Described in [5], the technique employs a greedy search heuristic to discard the least
important reference frames. In doing this, the prediction performance is increased over
the conventional sliding window memory management, which simply discards the oldest
frame without as much regard to context.
While the implementation complexity of this approach is small, the algorithm requires
experimental fine-tuning specific to the decoder architecture, including the memory ac
cesses. Additionally, the subjective video output quality may be compromised if the algo
rithm is not carefully tuned [5]. This approach is notably of less general application than
the H.264/AVC-specified sliding window algorithm.
Compressed memory bus accesses.
Another popular technique for optimizing decoder power and performance is use of em
bedded compression (EC) within the memory architecture. All data is compressed with a
lossless block compression algorithm just prior to memory store, and decompressed imme
diately following load from memory. Considering
that the majority of data flow that utilizes
the memory architecture are
decoded macroblocks, the potential savings in memory capac
ity and data bus operations is
significant. The reduction in physical memory operations
also potentially reduces power
consumption.
29
Described in [4] is an EC technique used to optimize a H.264/AVC decoder for real
time operation at low speed and low power consumption. While any form of EC was found
to be adequate for reducing the memory capacity requirements of a general video decoder,
three main constraints were determined as pivotal for reducing power consumption:
1 . Use a block-based compression, and independently encode each block.
2. Set a fixed compression ratio for all blocks and use this fixed ratio for the memory
mapping.
3. Store the luminance and chrominance planes jointly in memory.
3.2 Analysis of published results.
Conjecturing from the above research, several techniques may be employed to improve
external memory performance within a video decoder with a single decoded frame format.
Experimentally tune the memory controller parameters such that access bursts and




algorithm with a more lossy algorithm, reducing the
total number of store and load operations; possibly also reducing the frequency of
memory accesses.
Compress data as the first stage of a store operation, and decompress data as the final
stage of a load operation, using a lossless block compression algorithm tailed for
Y'CbCr data. This would reduce memory bandwidth and capacity requirements at
the expense of additional on-chip logic.
Use a dedicated memory controller for the frame buffer, with all intra prediction data
being stored either on-chip or via a separate memory controller.
30
Extrapolating from the above research, several plausible approaches to optimizing FRExt
memory performance include:
Re-sample data as the first stage of a store operation, and re-sample data as the fi
nal stage of a load operation. This would reduce memory bandwidth and capacity
requirements at the expense of additional on-chip logic and reduction of inter
predic
tion quality.






This chapter provides an overview of the requirements, algorithms and data flow used for
designing the frame buffer model. The addition of Y'CbCr pixel type variability
to the
frame buffer model is also discussed.
4.1 Augmentation of decoder system.
A preexisting Baseline H.264/AVC decoder testbench system was obtained from
[21].
While much of the functionality of the H.264/AVC Baseline profile was implemented, the
system was only capable of processing several QCIF frames within reasonable resource re
quirements. Portions of the system were optimized for simulation performance, as detailed
in Chapter 6. These optimizations, plus the addition of FRExt support did not change the
logical design of the models within the decoder system; only the details of the memory
management and the parsing algorithms. Figure 4.1 depicts the unchanged organization
of models within the decoder system. Some significant design changes were made to the
internal testbench mechanisms and structure, and are also explained in Chapter 6.
4.2 Algorithms used by the frame buffer.
As video data is parsed from the H.264/AVC bitstream, each macroblock flows through


























Figure 4. 1 : Video decoder system architecture and data flow.
to an ASCII file. (In an FPGA system, the output from the decoder system would likely
be display rendering, or a form of local high speed data bus transmission.) The decoding
operations encountered by a single macroblock as it is processed by the system are detailed
with Table 4.1.
Table 4. 1 : Macroblock stages of operation while passing through the decoder system.







Intra Prediction (only for I-type)
Inter Prediction (only for P/B-types)
Deblocking Filter.
Output: filtered Y'CbCr luma and chroma samples.
Intermediate storage of a significant quantity of picture data termed here as "frame
buffering"
is an algorithmic necessity of the Prediction stages and also the Deblocking




4.2.1 Intra prediction buffering requirements.
The intra prediction stage computes sample values
of the current macroblock based upon
the sample values of neighboring macroblocks. It only
takes into consideration those neigh
boring macroblocks which have already
been decoded; and so in raster-scan order, the four
possible neighbors are either one column or one row preceding the
current macroblock.
The previously decoded macroblock
referenced by the intra prediction stage are unfiltered
samples viz., they have passed through one of the
Intra/Inter Prediction stages, but not the
Deblocking Filter.










(b) b. (c) c. (d) d.
Figure 4.2: Intra Prediction macroblock neighbor permutations.
The first macroblock of a frame never has intra prediction neighbors as no other mac
roblocks have yet been decodedwithin the same frame. However, the second macroblock of
a frame, and all succeeding macroblocks within that same frame, have previously decoded
"neighbors"
in one of the four permutations shown with Figure 4.2. Thus, a macroblock
undergoing intra prediction may have one, two, three, or four neighboring macroblocks as
candidate references for prediction. The shown permutations of a., b., c, d., correlate with
the current macroblock being positioned: in the first row, in the first column, in in the bulk
of the picture, or in the last column, respectively.
From the shown permutations, the oldest neighbor is position D. This reference mac
roblock is at a distance from the current macroblock of: the number of macroblocks per
row, plus one. Thus, the maximum FIFO buffering requirements between reference D and
the current macroblock, inclusive, is the number of macroblocks per row plus two. The
actual maximum binary size of the FIFO is proportional to the frame width Fw and the
34
binary size of an individual macroblock M,bits






1- 2macroblocks) * Mblts
row
(4.1)
Extending Equation 4. 1 for the specific case of 720p HD frames with Baseline format
(4:2:0, 8 bits/sample), the FIFO buffer size would be:






A few other examples of the intra prediction buffer requirements for common SD and
HD resolutions are shown with Table 4.2.
Table 4.2: Example buffering size requirements for intra prediction.
Resolution (luma) MBs per row Binary Size







94 Kbits 141 Kbits 235 Kbits
164 Kbits 246 Kbits 410Kbits
244 Kbits 366 Kbits 6 10Kbits
Since the intra prediction stage uses unfiltered reference macroblock samples, the buffer
should obtain its input data from the same source as the deblocking filter obtains its in
puts. This requirement also prevents the intra prediction buffering from combining its data
with the buffering requirements of other stages. The points at which unfiltered and filtered
data should independently be extracted from the main data path for buffering purposes are













Figure 4.3: Main data path
"tap"
locations for buffering.
4.2.2 Deblocking filter buffering requirements.
Similar to the intra prediction stage, the deblocking filter also requires access to previously
decoded neighbor macroblock samples. The buffering needs of the deblocking filter are
identical in terms of data size and block position, as detailed with equation (4. 1). However,
the deblocking filter referencesfiltered data. As such, the location from which this buffered
data should be obtained is the immediate outputs of the deblocking filter, for unitary feed
back into the same stage, as shown in Figure 4.3.
4.2.3 Inter prediction buffering requirements.
The inter prediction stage computes sample values of the current macroblock by applying
motion vectors to the sample values of previously decoded macroblocks from any frame
stored for reference. The reference frames may be in the past or the future
'
from the
perspective of visual display order. However, the reference frames will always be obtained
from a previously decoded frame that was marked for storage. The reference frames contain
filtered samples viz., the samples were processed by every applicable stage in the decoder
system, including the Deblocking Filter.
Note that the deblocking filter and inter prediction stages both require buffering of
Display order and decoding order are similar, but not necessarily identical within an H.264/AVC bit-
stream. The encoding of a frame with one or more B-slices causes some minor frame reordering; and this




filtered samples: of the current frame and of reference frames respectively. Thus, an appro
priate location for the reference frame buffering to obtain its inputs is conditionally from
the outputs of the deblocking filter stage.
4.2.4 Combining of buffering mechanisms.
Proposed in [21], the deblocking filter could use its own row buffer for unitary feedback
of neighboring macroblock samples. As each macroblock is flushed from the row FIFO, it
is conditionally stored for reference within the frame buffer, depending if the macroblock
belongs to a reference frame. Two benefits arise from this approach. First, the reference
frame buffering is delayed by an additional row, providing the system-level controller am
ple time to determine whether the macroblocks belong to a reference frame. Additionally,
the separation of the two buffering needs reduces the number of memory accesses placed
upon a single buffering organization, potentially increasing parallelization.
If such parallelization were not possible, then the fact that the deblocking filter and
inter prediction both require buffering of filtered samples permits them to be combined
into a single buffering mechanism. The currently decoding frame could be unconditionally
stored as a spare reference frame, allowing feedback access by the deblocking filter, and
potentially re-identifying it for reference storage,
or clearing it upon completion of decod
ing the current frame. This approach simplifies the memory architecture at the expense of
decreased distribution, and potentially increased device speed requirements.
4.2.5 Reference picture management.
As each macroblock is decoded to an array of filtered Y'CbCr samples, the samples are
stored within the frame buffer. The samples are organized by frame to represent a full de
coded picture, with each individual frame being marked as: used for short-term reference,
used for long-term reference, or unused for reference. Those NAL units that were previ
ously parsed from the
bitstream with a non-zero nal_refJdc value are intended for later
37
reference; and so the frame(s) belonging to such units are marked for short-term
storage.
Additionally, if a slice header should contain a
non-zero longJerm_reference_flag value,
then the frame marking is changed from
short-term to long-term storage.
A frame that is unused for reference is soon to be displayed, and does not need to be
stored for later access by the inter prediction stage. As such, it may either:
never enter the
frame buffer, or be quickly displayed and removed from the frame buffer, depending
on
architectural details. A frame that is used for reference remains in the frame buffer until
one of the following scenarios causes it to be removed.
An instantaneous decoding refresh (IDR) command is decoded from the bitstream.
All frames are marked as unused for reference effectively discarding the contents
of the frame buffer.
The sliding window algorithm causes a short-term frame to be discarded
and over
written with another frame.
A memory control command is decoded from the bitstream, providing an explicit
instruction to remove a specific long-term frame.
Sliding window algorithm.
The reference pictures within the frame buffer are managed with a "sliding
window"
algo
rithm that organizes the status of reference frames and, as necessary, discards a short-term
frame to make room for storing a new frame. Prior to the decoding of each slice, a
fixed-
length reference picture list is re-constructed of the frames currently available for reference.
The reference picture with the highest PicNum (most recently decoded frame), is placed
at the head of the list, pushing over the other PicNum values toward the tail of the list.
This continues as each new reference frame is added to the frame buffer, and if the frame
buffer exceeds capacity, the reference frame with the lowest PicNum (least recently de
coded frame) is removed from the list and effectively discarded.
38
Table 4.3: Example of reference picture list updates [16]
Operation Reference Picture List
Initial state or IDR
Decode frame 250 250
Decode 25 1 251 250
Decode 252 252 251 250
Decode 253 253 252 251 250
Assign 25 1 to LongTermPicNum 0 253 252 250 0
Decode 254 254 253 252 250 0
Assign 253 to LongTermPicNum 4 254 252 250 0 4
Decode 255 255 254 252 0 4
When a short-term frame is latermarked for long-term storage, the frame is re-identified
by a LongTermPicNum and placed at the tail of the list. The long-term frames are listed
in ascending LongTermPicNum order, and are protected from removal by anything other
than an explicit memory control command. The sliding window will always discard the
oldest short-term frame, and not a long-term frame. Table 4.3 shows an example of the




This chapter details the structural design and implementation of the frame buffer, including
FPGA synthesis, with special regard toward performance and FRExt support. Edge-in de
signmethodology was employed to implement the frame buffer. The low-level,
technology-
specific DDR control and interfacing was designed first to establish worst-case available de
vice resources. Second, the address mapping concepts and sliding window algorithm were
implemented. Finally, internal controls, interfaces, parallelized computations, and pipelin
ing were then iteratively implemented to create a performance-oriented frame buffer.
5.1 External memory storage and control.
The FPGA technology was chosen to be one of the Xilinx Spartan or Virtex device families
due to the immediate availability of Xilinx software and ease of performance compari
son with other literature [21]. The external memory technology was chosen to be DDR
SDRAM for a balance between performance and device selection. Older technologies such
as DRAM and SDRAM do not provide the necessary performance. The newer technology
ofDDR2 appears to provide no significant advantage over DDR with the majority ofXilinx
devices.
As an example: the typical maximum frequency of SDRAM is 133MHz, and thus this
single data rate has an ideal data throughput of 133Mbps. Considering a typical resolution
of 720p HD (1280x720) at conventional 4:2:0, 8 bits per sample, the maximum SDRAM
40
throughput for storing frames alone (and not retrieving any data) would be
ratemax = 133 MHz/ (3600 ^block% 3072 J^X 12 frames/second (5.1)
frame mblock
Considering that 24 frames/second is one standard frame rate for H.264/AVC sequences,
the SDRAM bandwidth is hardly adequate for real time decoding at HD resolutions. Thus
DDR and DDR2 are the more viable technologies. The maximum frequency of DDR2 is
higher than that of DDR for the most recent and expensive Xilinx FPGA family: Virtex 5;
but for all other FPGA devices DDR and DDR2 share approximately the same maximum
operating frequency. DDR was chosen over DDR2 as the target memory technology for the
following reasons:
DDR support is available on more FPGA technologies than DDR2, so device selec
tion is more flexible when using the older technology.
DDR scales down to lower frequencies than DDR2, potentially allowing better total
device power savings at lower frame resolutions and rates, by being able to lower the
system clock.
DDR2 uses a lower voltage than DDR, allowing potential power savings. However,
the lower-power variant of DDR provides similar savings, and this technology may
also be interfaced with Xilinx devices.
DDR scales up to the same maximum frequencies of 166 MHz and 200 MHz as does
DDR2 on the Spartan family devices. The much more expensive Virtex devices do
permit significantly higher operating frequencies
for both the FPGA and memory,
which only DDR2 is able to
take advantage of. However, when considering less
than 200 MHz applications, DDR2 does not provide a noticeable bus performance
advantage, and is actually a more complex
interface.
DDR has lower CAS memory latencies than DDR2 technology; and DDR2 often has
an additive latency that does not exist with DDR. Thus, DDR can potentially obtain
41
higher throughput at the same bus frequency [25].
In addition to memory concerns,
the Xilinx FPGA family of Spartan 3E was selected
for the initial implementation. This FPGA family is the least expensive
and smallest of
the Xilinx FPGAs that support DDR SDRAM memory
interfaces. By initially targeting
the Spartan 3E device, the design constraints were maintained
at a worst-case scenarios
for each of: operational speed, area, and device resources. This
restriction allowed each
performance concern to appear earlier in the design process. Other FPGA families
from
Xilinx considered for the eventual targeting by the design were: Virtex 2 Pro, Virtex 4. No
on-chip FPGA IP was used
that could possibly exclude implementation
of the design on
both Spartan and Virtex FPGA families.
5.1.1 DDR memory control.
A reference DDR SDRAM controller for the Xilinx Spartan 3E devices was obtained from
[1], modeled with structural VHDL. Additionally, a simulation-only Micron 512MB DDR
SDRAM model was provided by the same reference, modeled with behavioral Verilog.
Both models are highly parameterizable. As depicted with Figure 5.1, this vendor-supplied
model consists of a low-level DDR controller combined with an embedded testbench that
indicates a simple pass or fail status.
The DDR burst mode is configured by default for four (4) words; and the controller
provides a read/write pipeline depth of four (4) words to facilitate the burst mode. Data
words of size 16 bits are read/written on each positive and negative clock edge. The user
interface of the DDR controller provides handshaking methods for between command wait
ing, SDRAM auto-refresh, valid data clocking, and 64 bits of data per unique address. The
vendor-supplied controller model targeted the specific Xilinx device of xc3s500, package
efg320, and speed grade of -4. This device supports an operating frequency of 133 MHz.
The ideal maximum DDR bandwidth for a -4 speed grade device is 266 Mbps.
For later integration within the structural frame buffer, the embedded testbench was















Figure 5.1: Vendor-supplied Xilinx Spartan 3E DDR SDRAM controller.
and connected where the embedded testbench had once been. In addition, the data and
addressing were propagated to an external interface, and the DCM clocking infrastructure
was repositioned to provide the necessary four degrees (0, 90, 180, 270) of phased
clocking throughout the full design hierarchy, as depicted with Figure 5.2. To obtain a 20%
increase in memory bandwidth, the same target FPGA device is also available from Xilinx
at 5 speed grade with an operating frequency of 166 MHz. The ideal DDR bandwidth



















Figure 5.2: Customized Xilinx Spartan 3E DDR SDRAM controller.
43
5.1.2 Block data controller.
Since the decoder system computes samples on a raster-scan order macroblock basis, the
frame buffer should also load and store data in terms of grouping samples by macroblock
rather than the natural order of the frame samples. To facilitate this, a block data controller
was designed. The controller idles the DDR operations until receiving a start pulse, during
which it loads two addressing values: the physical RAM address to start at, and the number
of unique addresses to perform on. The DDR controller is then driven by the block con
troller to sequentially read or write exactly the requested number of addresses, beginning
with the loaded start address. When the specified number of addresses have been read from
/ written to, the block controller places the DDR controller back into an idling "no oper
ation"








I incr. I /disable
(otherwiseli
count == maximum
Figure 5.3: Block data controller state machines.
As shown, one Finite State Machine (FSM) responds to the start pulse: to load, incre
ment, and hold the physical address. While in the incrementing state, a disable signal is
deasserted to trigger another FSM to place the DDR controller into an active read or write
state. The transition from disable deasserted to disable asserted causes the latter FSM to
44
return the DDR controller to an idling state.
5.1.3 Implementation hiearchy.
The RTL hiearchy and internal paritioning of the synthesizble frame buffer is depicted with













Figure 5.4: Internal partitioning of frame buffer design.
1 . Clocking and buffering. A Digital Clock Control with internal PLLs is used to pro
vide 4 degrees of phase-shifting. Additionally, the DDR pins and system data path
have some explicit input and output buffers to achieve full device frequency perfor
mance.
2. RAM Command. A block that drives states and commands to the external memory.
3. Address counting. A block that loads and increments the DDR address for each
operation.
4. Block controller. A block that contains the previous two blocks, and also manages
the increment signal, and burst pipeline.
5. Externalmemory controller. A controller that provides pipelined burst operation of a
DDR memory intended for a
16-bit data width memory module.
45
6. FRExt scaling. A block that
computes the address count for a single macroblock
according to the
selected pixel formatting.
7. Sliding window. A block that
manages look-up tables for the sliding window algo
rithm, including storage markings. It discards
short-term frames and performs IDR
according to the
H.264/AVC standard algorithm.
8. H.264/AVC mapper. A block that contains the previous two blocks, and manages the
address fed to the block controller. It detects the change of inputs and determines
what operation should be performed and how; and then triggers the block controller
to commence a new operation.
5.1.4 External DDR interface.
The DDR interface of the structural frame buffer is standard for wiring a DDR memory
module (not a DIMM) to one FPGA IO pin bank. The memory module connects to the
left side of the FPGA device to minimize latencies and prevent timing errors. While all
four sides top, bottom, right, left of the Spartan 3E devices can be configured for DDR
pin operation, the left and right sides provide the necessary routing density and timing to
operate the DDR SDRAM module and FPGA both at a maximum frequency [15]. The
selection of left side over right side was arbitrary. The FPGA to DDR SDRAM interface is
depicted with Figure 5.5.
The specific memory module connected to the controller is a Micron 512 MB mod
ule with product code
MT46V32M16TG6T. The maximum operating frequency is
166MHz, same as the faster (5) speed grade Spartan 3E devices. [14]
5.2 Frame organization and addressing.
The internal data organization of the frame bufferwas devised as follows. First, each unique
DDR address, from the perspective of the logic above the DDR controller, represents a
46
DDR SDRAM Frame Buffer
(15:0)-













































Table 5. 1 : External DDR pins.
storage capacity of 64 bits. Thus every store and load operation performed with the DDR
and block data controllers is a multiple of 64 bits. Second, the data is stored and loaded
with an atomic unit of a 16x8 pixel sub-macroblock of Y'CbCr samples, as a pair of either
the upper two or lower two 8x8 pixel sub-macroblocks.
The choice of 16x8 pixels per atomic operation was selected as the smallest possible
unit that would divide evenly into 64 bit word operations for all pixel types. From Ta








1 3/2 2 3
2048 3072 4096 6144
2304 3456 4608 6912
2560 3840 5120 7680
2816 4224 5632 8448
3072 4608 6144 9216
(5.2)
Finding the matrix Greatest Common
Divisor (GCD) of equation (5.2) yields the largest
47
data word size that evenly divides into
all possible 16x16 macroblock sizes.
gcd(M)
= 128 bits (5.3)
Then, dividing the actual data widthW by gcd(M) yields the factor of a 16x16 macroblock




As shown, half of a 1 6x 1 6 macroblock is the smallest atomic unit supported for a data word
size of 64 bits. 16x8 is half the size of 16x16, and thus an appropriate atomic unit. Sixteen
columns by eight rows (16x8) was selected instead of eight columns by sixteen rows (8x16)
as a decision to retain raster-scan ordering of samples within the full 16x16 macroblock.
Examining the prediction buffering requirements, this scheme appears to have a greater
probability of reducing the number of memory accesses incurred by the prediction units
and deblocking filter.
The memory map of a single 16x8 sub-macroblock for the case of 8 bits per sample
is depicted with Figure 5.6. The luma samples are grouped first, followed by the blue
chroma, and then by the red chroma. Each square in the figure represents the data of
two 64-bit word operations, and the addresses are the raw byte counts. For example, the
shown macroblock at 4:2:0 sampling maps between byte addresses 0x00 to OxBF inclusive;
totaling at 192 bytes. Macroblocks with sample sizes greater than 8 bits are mapped into
memory in like fashion, and the atomic unit of 16x8 sub-macroblock remains 64-bit word
aligned for all pixel types. However, the individual samples are not fully byte aligned due
to a sample size that is not a power of two. A memory map of other samples sizes would
show some of the 64-bit data words as containing data from two of: luma, blue chroma,
red chroma.
48





















Figure 5.6: Binary 16x8 sub-macroblock memory maps for each 8-bit sub-sampling.
Scaling the binary macroblock sizes of Table 2.6(a) by the atomic factor of equa
tion (5.4), and also dividing by 64, yields the number of unique DDR controller addresses
necessary to store a 16x8 sub-macroblock. All possible values are detailed with Table 5.2.
Table 5.2: Unique addresses required to store 1 6x8 sub-macroblock for all pixel types.
sample chroma sub-sampling
depth 4:0:0 4:2:0 4:2:2 4:4:4
8 bits 16 24 32 48
9 bits 18 27 36 54
10 bits 20 30 40 60
1 1 bits 22 33 44 66
12 bits 24 36 48 72
5.2.1 Macroblock identification and frame slotting.
The frame buffer maps DDR memory such that a linear block of RAM is reserved for
each potentially stored frame, up to the
maximum length of the buffer, and the maximum
supported frame resolution. The reserved locations for potential frames are termed here as
slots in the frame buffer.
49
For example, if the frame buffer is
synthesized with a maximum length of 6 and a
maximum resolution of 1280x720 luma samples (720p HD),
then six slots will exist at
all times within the buffer, each with a reserved memory capacity
of 3600 macroblocks.
This is regardless of run-time values, such as the
number of reference frames specified
by the most recently decoded
video sequence header. The frame buffer always operates
internally with its maximum number
of slots, regardless of the needs
of the current video
sequence. Also, the slots do not adjust in size with the
current frame resolution, but remain
at maximum capacity, even if that capacity is not fully utilized. It does
not affect the
H.264/AVC decoding process to have a buffer that is
longer than necessary, or slots that are
larger than necessary; errors would only occur if the the
buffer were too short, or if a slot
were too small.
Each 16x16 macroblock within the frame buffer is identified internally by two values:
the individual macroblock number, and the containing slot identification (ID). The mac
roblock number is in the range of 0 to to the total number of frame macroblocks minus one
at the current frame resolution. For the example frame resolution of 720p HD (1280x720),
the macroblock number is within the inclusive range of: [0, 3599]. Some example ranges
are shown with Table 5.3.












Arrangement ofmacroblocks within the buffer is on a slot basis. Each slot contains, in
sequential order, the macroblocks belonging to a single frame. A slot never contains data
from multiple frames, even if the slot's capacity is large enough to do so. The number of
slots within the frame buffer was created as a synthesis parameter viz. it is adjustable pre-
synthesis, but hard-coded within the final netlist. The slot capacity is uniform for all slots.
The slots themselves are arranged sequentially in memory according to their ID, and the
possible range of slot IDs is 0 to the frame length of the buffer minus one. Some plausible
50
frame buffer lengths and respective slot ID ranges are shown for a Baseline decoder with
Table 5.4.
Table 5.4: Example (arbitrary) ranges of slot IDs.







5.2.2 Macroblock address mapping with FRExt.
The starting memory address for an individual 16x16 macroblock is computed from: the
macroblock size Mb,ts of equation (2.4), the maximum slot capacity in terms of mac





* {Lblts * Fss) (5.5)
Second, the binary size of a slot is:
&bits = 1^1max * dibits
=
162
* Mmax * Lbits * Fss (5-6)
Third, the starting address for an arbitrary
16x16 macroblock identified by Mnum and Sld
is:
addr = (Sid* Sblts + Mnum * MMa)fW
=
(162
* S,d * Mmax * Lblts * Fss +
162
* Mnum * Lblts * Fss)/64
= 2(2LMts*2Fss){Sid*Mmax + Mnum) (5.7)
51
As shown with equation (5.7, the starting address of a macroblock is the product of the
number of unique addresses per macroblock and a respective number of macroblocks by
which to offset from the first memory address of zero. The value Mmax is significantly the
largest value in the expression; and the product of the four integers 2Lbits * 2FSS * Sld *Mmax
likely incurs resource or performance penalties during synthesis.
For implementation, equation (5.7) is re-written as a pair of unsigned integer expres
sions that contain as many powers of two as possible. Uss is the number of unique addresses
used by a 16x8 sub-macroblock for the current sub-sampling. A\ is the starting address for
the first half of the current 16x 1 6 macroblock; A2 is the starting address for the second half
of the current 16x16 macroblock. Scap is a synthesis parameter that specifies the fixed slot
capacity in terms of unique addresses.
Uss = <
2Lblts when 2FSS = 2
2Lbits + Lbits when 2FSS = 3
4Lbits when 2FSS = 4
ALbits + 2Lblts when 2FSS = 6
undefined otherwise.
(5.8)
M = (Sid * Scap) + (Mnum * 2USS)
A2 = (Sld * Scap) + (Mnum * 2USS) + Us (5.9)
If the entire memory address space were to be used for slots, then the slot capacity could be
computed from the number of address bits and number of slots: Scap =
2Ab'ts +Smax- VvcrC
some of the memory to be reserved for other uses, then the slot capacity could be computed
as ScaP =
(2' Mts -
Raddr) + Smax where Raddr is the number of addresses reserved for other
uses.
As shown, the equations (5.8) and (5.9) provide the necessary start and count values
52
for 16x16 macroblock operations, and also for the atomic operation size of 16x8 sub-
macroblock. These values are sent to the block controller in the combinations detailed
with Table 5.5.
Table 5.5: Addressing combinations loaded by block controller.
Operation Start Address Address Count
16x16 macroblock








5.2.3 Frame storage marking.
To categorize frames, two booleans of metadata exist within the frame buffer logic for each
slot to represent the H.264/AVC storage marking. One boolean is named short. The other
is named long. Together the two booleans determine the H.264/AVC storage marking of a
slot, and the interpretation of these is shown with Table 5.6.
Table 5.6: Interpretation of frame marking booleans.
long short interpretation
true false Slot contains frame that is long-term reference.
true true Slot contains frame that is long-term reference.
false true Slot contains frame that is short-term reference.
false false Slot contains frame that is unused for reference.
Marking of a slot is permitted to occur at any time, and may be any of: unused,
short-
term, long-term. A requested marking change is applied immediately to the slot metadata
associated with the current slot ED. Its effect is immediately noticed by the sliding window
algorithm.
The arbitrary marking of
each slot during any macroblock operation is permitted for
sake of flexibility when interfacing with a
system-level H.264/AVC decoder controller.
However, H.264/AVC does specify that a short-term frame should not be later marked as
unused; and that a long-term frame
should only later be marked as unused when an explicit
53
memory command is decoded from the bitstream. Compliance with these H.264/AVC
details is left to the discretion of the system-level control.
5.2.4 Sliding window implementation.
The sliding window algorithm determines which slot contains a specific previously stored
frame, and also which slot should be overwritten when next storing a new frame. The data
of each frame is identified according to its original PicNum from the H.264/AVC slice
header; and this information is recorded within a lookup table internal to the buffer logic,
as depicted with Table 5.7.
Table 5.7: Example contents of frame buffer metadata.
(a) Marking Table















In this example, slot 4 contains a long-term frame; and as such, the table entry for
slot 4 will never be overwritten with a different PicNum value, until this slot's marking is
explicitly cleared. Slot 3 is inferred to not be used for reference at all, and thus will be the
first slot to be overwritten by a new frame. Were slot 3 marked for short-term, then the
next slot to be overwritten would be slot 2, as it holds the lowest PicNum not marked as
long-term.
To automatically discard the oldest short-term frame, and only as necessary, a linear
search algorithm is executed at the start of each macroblock operation. The algorithm
loops one time over the lookup table and markings to compute in parallel three independent
values:
54
The sole slot ID that contains the same PicNum value as requested by the current
macroblock operation.
Any one slot ID that is unmarked.
The one slot ID that is not marked as long-term, and also has the oldest PicNum of
all other slots not marked as long-term.
If a slot matching the requested PicNum is found, then that slot is selected. Otherwise,
if an unmarked slot exists, then that slot is selected. Otherwise, the oldest non-long-term
slot is selected. The only possibility of all three searches failing would be if all slots within
the frame buffer were marked as long-term; and since this possibility is prevented by the
H.264/AVC standard, the condition is not explicitly handled. Additionally, the condition of
duplicate PicNum entries in the lookup table is ignored by the search algorithm as impos
sible. The condition is guaranteed to be prevented by a post-reset routine that initializes the
lookup table with unique values.
5.3 Synthesis parameters.
The parameters detailed in Table 5.8 determine both the available features and buffer per
formance of the final FPGA netlist. The default values are those determined to best suite
a general HD decoder application; but can be adjusted pre-synthesis. Chapter 7 discusses
some of the timing characteristics and trade-offs for changing these values.
Of particular note are the three Xilinx IP Core booleans. These booleans enable drop-
in modular netlists to optimize three key timing bottlenecks that the Xilinx XST synthesis
tool was unable to optimize itself. The parameter XILINX^\DDR_CNT implements a
23-bit clock-enabled binary counter with value load. The counter is internal to the block
controller. The RTL description of this binary counter with value load would not synthesize
optimally without using an IP Core due to its
width.
The parameterXILINX_MAP_MULTimplements twomultipliers. Both are sequential
55
Table 5.8: Structural frame buffer synthesis
parameters.










Support multiple frame formats;
otherwise 4:2:0, 8-bit only.
Provide the sliding window algorithm;
else, Sld = Fnum mod MAX_FRAMES.
Support using tail ofmemory
space










Use Xilinx IP Core for:
address counter; otherwise, RTL.
Use Xilinx IP Core for:
address multiplication; otherwise, RTL.
Use Xilinx IP Core for:
















Frame number Fnum range is:
[0,MAX_FRAME_NUM].
Slot ID Sid range is:
[0,MAX_REF_FRAMES
- 1].
How many of sub-sampling factor Fss to
support from set: {1, 1.5, 2, 3}.
Smallest sample size Lbits, in bits.
Largest sample size Lbits, in bits.
RAMJSIZE positive 64 Size of external RAM in MBytes.
pipelined multiplies to permit their operation at high frequencies. The first is 4-bit by 22-bit
with 22-bit output to multiply the slot ID Sid and slot address capacity 2ScapUss. It assumes
the frame buffer has no more than 16 slots total. The output value is the address offset at
which the current slot begins within the total RAM address space. The second multiplier
is 13-bit by 7-bit with 20-bit output; it multiplies the current macroblock number Mnum by
the number of unique addresses needed to store a 16x16 macroblock (2USS). The output
value is the address offset at which the current 16x16 macroblock begins within a slot. The
23-bit sum of the two multiplier outputs provides the starting address of the current 16x16
macroblock Ax according to its macroblock number il/num and slot ID Sid, as previously
demonstrated with equation (5.9). This final address value is conditionally offset by Uss
for operations requiring the starting address of the 2nd 16x8 macroblock, A2.
56
The parameterXILINXJVIAP_FD implements a 23-bit D flip-flop register to buffer the
23-bit starting address prior to load by the 23-bit binary counter. It forces the XST tool to
group the 23-bits together as a registered bus during synthesis, preventing timing hazards
between the address mapper and the block controller.
5.4 Frame buffer interface and pipelining.
The eventual integration of the frame buffer into a full decoder system predicates on the
ability to pipeline the load and store operations with low latency and high throughput.
Some careful data flow decisions were made to optimize performance; and these details
affect how the frame buffer must be interfaced to other components.
5.4.1 Frame buffer RTL interface.
The inner and outer interfaces of the frame buffer are shown with Figure 5.7. The left side
of the figure depicts the xl6 DDR SDRAM pin-out, as previously shown with Figure 5.5
and Table 5.1. It also shows the necessary FPGA pins for system clock and system reset.
The right side depicts the system interface of the frame buffer where it can be connected to
a full H.264/AVC decoder system. This interface would connect directly to the pipeline of
a H.264/AVC system, placed and routed internal to the FPGA.
Frame buffer ports: addressing.
Five input ports are used for the sole purpose of selecting a specific 16x16 macroblock or
16x8 sub-macroblock from within the frame buffer. The ports chroma and luma select
the sub-sampling ratio and sample
bit size for the current video sequence, respectively.
Change of these values only makes sense
to occur at the beginning of each independent
video sequence; and as such, change of
these values incurs an implicit IDR that clears the









































































luma(2:0) : { OblOO, ObOlO,
ObOll, others.
The ports frame, mblock, subsel, and extra specify frame number, macroblock number,
sub-macroblock selection, and extra buffer space respectively. Change of their values au
tomatically triggers the frame buffer to begin a new store or load operation; and the input
values that were held constant up until the change are used for the current operation. The
new values presented to the inputs are used for the next operation; which begins by another
change of these inputs. The frame number input is valid for any value, and the macroblock
number is valid from 0 up to one more than the number ofmacroblocks permitted per slot.
58
If a number higher than this limit is specified, the resulting behavior is undefined. The
sub-
macroblock input is used to select the operation size as one of: 16x16 macroblock, 1st 16x8
sub-macroblock, 2nd 16x8 sub-macroblock. The extra buffer space signal is used to toggle
between two buffering locations: the frame slots, and any extra address space thatmay exist
beyond the last slot, if the device were so configured pre-synthesis. This functionality al
lows the end of the memory space to be used for buffering extra data that is independent of
the sliding window algorithm; and is intended to be used for row buffering of the unfiltered





1st 16x8, 2nd 16x8.
extra(0:0) : | ObO, Obi. } >-> { frame slots, extra space. }
Frame buffer ports: data operation.
Upon changing one or more of the addressing ports, a data operation is automatically set
into motion, and the output done falls from Obi to ObO to indicate an operation is underway.
The input write specifies whether the current operation is a store or load from memory, and
should be changed within one clock period before or after changing of the addressing
ports:
write : j ObO, Obi. }
>->
{ load, store. \
The system controller is expected to hold the write input at a value of Obi only during
an explicit store operation is being performed; and in that case, for the full length of the
operation. During a load operation, the input must be held at ObO. The value of this input
is ignored during any idle period when the no operation is being performed.
A store operation is performed against the data clock data_ck by presenting a new 32-
bit data word on every rising edge
that the increment signal incr is high. After changing
one or more address inputs, the pseudo code for performing a store operation is as shown
59
with Listing 5.1.

















drive ( data.i )
end if
end loop
Similarly, a load operation is performed against the data clock data_ck by latching a
new 32-bit data word on every falling edge that the increment signal incr is high. The
pseudo code for performing a load operation with worst-case hand-shaking is shown with
Listing 5.2.
Listing 5.2: RTL pseudo code for worst-case load operation.
write <=
'0'













capture ( data_o )
end if
end loop
For both operations, the port done indicates whether a macroblock operation has com
pleted; and the port incr indicates whether the current clock edge should be use to
read-
/write data from the respective data port, dataj or data_o. The data input and output ports
may be used simultaneously on boundary between a store following load pair of opera
tions, as the incr signal will assert high for zero to two cycles past the rise of done. This
approach increases the efficiency, but is optional due to its complex nature. An example
pseudo code for the best-case hand-shaking of a load is shown with Listing 5.3. When the
first parallel process finishes, the next macroblock addressing operation may begin, and the
second parallel process captures the remaining few words off the pipeline.
60
































wait until f al ling _edge ( data.ck )




During a store operation, the data output port data is undefined, and during a load
operation, the data input port is ignored. Were these data ports to synthesized as an external
bus, then it would be practical to combine them into a 32-bit bidirectional port. (This would
require use of worst-case hand-shaking for the load operation.) However, since the design
intention is for the system interface to be later integrated internal to a full decoder system,
the data input and data output ports were kept independent.
Frame buffer ports: initialization.
After reset, the frame buffer requires approximately 350 us to initialize the DDR SDRAM
memory module, and also to
calibrate the Digital Clock Management (DCM). The port
done, in addition to its normal functionality, duals as a ready indicator post-reset. When
the FPGA is RESET, the port starts low at ObO. It does not rise until the FPGA has both
come out of reset and initialization has completed. This occurs approximately 350 us after
61
the FPGA has come out of reset. When it rises to Obi, the frame buffer will perform
operations.
Frame buffer ports: storage marking.
One signal remains that is used explicitly for marking the storage status of slots within
the frame buffer: marking. This port is continuously monitored for input of a specific
pulse pattern. While the port is held low at ObO, no marking operation occurs; however, a
particular number of clock cycles for which the signal is held high at Obi will cause the
slot of the current operation to be marked according to Table 5.9. As shown, holding the
Table 5.9: D gital patterns for marking slots.
Marking Pattern Scope Interpretation
ObOOOOO n/a Does nothing.
ObOOOlO current slot Clear markings to indicate unused for reference.
ObOOllO current slot Mark as used for short-term reference.
ObOlllO current slot Mark as used for long-term reference.
ObllllO all slots Clear markings to indicate unused for reference.
signal steady at ObO does not incur any change in the slot markings. Providing a pulse
width of one, two, or three clock cycles will cause the currently operating slot to be marked
as an unused, short-term, or long-term reference respectively. The currently operating slot
is determined from the frame input of the previous operation, and not the currently input
frame signal, as per the pipeline semantics discussed in Section 5.4.2. Providing a pulse
of four clock periods, or just holding the signal high at Obi, triggers all slots to be marked
as unused for reference. This operation implements the H.264/AVC Instantaneous Data
Refresh (IDR) to reset the frame buffer to an empty state.
5.4.2 Pipeline semantics.
The purpose for the five addressing inputs being used for next operation, rather than cur
rent, is twofold. First, the high-level controller of the full decoder system should have
the addressing values ready at least one macroblock operation in advance since the frame
62
buffering occurs toward the end of the system pipeline; and so the frame buffer makes
use of this early information availability. Second, the addressing computations and sliding
window algorithm perform large multiplications and a linear search routine respectively.
This processing requires a significant number of pipelined clock cycles to operate at high
frequencies. To minimize delays between adjacent frame buffer operations, the addressing
and windowing computations for the next operation run in parallel with the current store or
load data operation.
The ports write and marking, however, do affect the current operation, and thus are
staged one operation later than the five addressing inputs. With this control flow, the sim
plest way to perform an Instantaneous Data Refresh (IDR) marking all slots as unused
for reference is to hold the marking signal high at Obi for the full data operation prior
to the store of the first macroblock of an IDR-triggering frame. (Technically, though, only
4 clock periods of holding the signal are necessary, and an IDR can be performed without
respect of the currently selected slot.)
5.5 Dual RAM frame buffer.
To further improve the overall frame buffer bandwidth, a frame buffer with two DDR
SDRAM interfaces was designed to target the same Xilinx Spartan 3E family as the design
with a single RAM. By using two DDR interfaces on the FPGA, the memory bandwidth
can theoretically be doubled, depending on the usage pattern.
There are several possible
approaches to using two
independent memories, each with their own advantages and dis
advantages.
Logically append the two RAMs as a
single memory from the perspective of the ad
dressing mapper. This could
provide very minor performance improvement due to
independent auto refresh and bank charging, and also decrease the material cost of
the DDR SDRAM by using two less dense chips
instead of one dense chip. However,
the extra 10 pins required for two DDR
interfaces would significantly increase the
63
total power consumption and FPGA package size. Also, it would be likely for one
RAM to idle in a power-down state whenever the other was operating, causing an
inherently inefficient use of the FPGA operating frequency.
Independently use one RAM for row buffering and one RAM for frame
buffering.
This would increase the overall bandwidth; but would not be guaranteed to evenly
distribute the performance across both memories. For example, at any moment the
row buffering could be operating at maximum throughput and the frame buffering
idling; or the opposite. So this approach would increase the simplicity ofmaking the
frame and row buffering independent, and also increase the performance. However,
the maximum operating frequency would still be much more than 50% of the
single-
RAM frequency due to the inherent inefficiencies; and this could negatively impact
power consumption and device speed requirements.
Stripe the data across both RAMs at a sub-macroblock level. By using one RAM for
half of a macroblock, and the other RAM for the other half, the throughput can be
nearly doubled with very little inherent memory inefficiency. Whether performing
row buffering, or frame buffering would not change the efficiency of the memory
throughput. This could permit lower operating frequency of the entire device, al
most by 50%, and the memories, providing significant power savings for the FPGA
without requirements of very fast memories. It also permits both RAMs to be of
equivalent size, and their capacity fully utilized. The only inherent complexity is
the data interface between the frame buffer and the full decoder system. The data
striping would cause the two sub-macroblocks to read or write into different internal
pipelines, with two independent hand-shaking controls.
The method selected for using two RAMs was to stripe the data across both RAMs,
with each RAM having its own independent DDR controller and block controller. The ad
dressingmapper would then use the same address and command inputs for both controllers.
However, each DDR memory would have its own data path.
64
5.5.1 Dual DDR SDRAM design.
The inner and outer interfaces of the dual memory frame buffer design are shown with
Figure 5.8. The left side of the figure depicts the xl6 DDR SDRAM pin-out for two mem
ories; exactly double the interface described previously with Figure 5.5 and Table 5.1. It
also shows the necessary FPGA pins for system clock and system reset. The right side de



































































Figure 5.8: Dual RAM frame buffer inner and outer interfaces.
This design simply reuses the
same internal components of the single memory design
to stripe data across two memories for
improved performance. The Xilinx DDR controller
plus the custom block controller (previously
shown with Figure 5.2 and Figure 5.3) are
instantiated twice, whereas the addressing
mapper is still only instantiated once. The phys
ical FPGA now has a x 1 6 DDR RAM
interface on both the left and right sides of the device
package, instead of just the left
side. For sake of convention, the left interface is numbered
65
as 0, and the right as 1 .
The system interface differs from the single RAM design only by having two data
paths with independent handshaking. The 32-bit data input dataJ is replaced with two
32-bit inputs: dataOJ and datalj. The 32-bit data output data.o is replaced with two 32-
bit outputs: dataO.o and datal_o. Both paths share the same clock and are controlled in
parallel by the same addressing inputs. However, the increment signal incr is replaced with
two increment signals: incrO and incrl. In addition, selection of 16x8 sub-macroblocks
with subsel will provide an extra 32-bits of data for those pixel formats that do not divide
evenly (small chroma, odd bit sizes).
Dual DDR SDRAM data striping.
The relative timing of the increment signals between the two independent data paths is not
guaranteed to synchronize in any manner. Each operation does perform the same number
of increments on each path, but delays such as auto refresh may occur on one DDR path
and not the other at any given time.
To maintain data integrity, the data must be carefully organized at this dual interface.
If the 32-bit words of a 16x16 macroblock were simply written with an indexing value on
each increment signal, there would be no guarantee that the data would maintain identical
order for each operation. As an example, a simple but unreliable use of this interface for
performing a striped store operation is depicted with the pseudo-code of Listing 5.4. In
this example, the next available 32-bit data word is written on every clock edge that one
of the two paths is ready to receive a new word. However, this usage would require that
both paths are always synchronized in their increment signals; which is not the case for this
design.
A reliable stripe operation can only be performed when the two paths are operating on
two independent sets of data organized in the system data path prior to interfacing with
the frame buffer. The intended usage is that 1st 16x8 sub-macroblock always be read
from/written to data path 0, and the 2nd 16x8 sub-macroblock on data path 1. Using this
66
















drive ( data0_i , nextvalue)
end if





drive ( data 1 _i , nextvalue)
end if
end loop
scheme, the data integrity is maintained. As an example, a store operation is performed
against the data clock data.ck by presenting a new 32-bit data word to dataOJ on every
rising edge that the increment signal incrO is high; and also presenting a new 32-bit data
word to datalJ on every rising edge that the increment signal incrl is high. After changing
one or more address inputs, the pseudo code for performing a striped store operation is as
shown with Listing 5.5.














drive ( dataOJ , nextvalue.of-block1 )
end if
if incrl = T then
drive ( datal _i , nextvalue_of_block2 )
end if
end loop
Similarly, a striped load operation is
performed against the data clock data_ck by latch
ing a new 32-bit data word from
data0_o for the 1st sub-macroblock on every falling edge
that the increment signal incrO is high; and also latching a new 32-bit word from datal.o
for the 2nd sub-macroblock on every falling edge that
the increment signal incrl is high.
67
Example pseudo code for performing a striped load operation is shown
with Listing 5.6.
Listing 5.6: RTL pseudo code for striped load operation.
write <=
'0'





wait until falling-edge ( data.ck )
if incrO = '1
'
then





capture ( datal _o , nextvalue_of_block2)
end if
end loop
As a side effect of this implementation, and the fact that sub-macroblock selection is
not boundary-aligned for some frame formats, the striping could be performed differently
than by splitting the full 16x16 macroblock into top and bottom halves. Other valid com
binations are possible, and could improve timing when interfacing to the system data path.
A few of these potential combinations are:
Path 0 is the 1st 16x8 sub-macroblock. Path 1 is the 2nd 16x8 sub-macroblock.
Path 0 is the 1st 8x16 sub-macroblock. Path 1 is the 2nd 8x16 sub-macroblock.
Path 0 is the evenly indexed 32-bit words of the 16x16 macroblock. Path 1 is the
oddly indexed 32-bit words of the 1 6x 1 6 macroblock.
In any case, the organization scheme has no bearing on the internal frame buffer perfor
mance as the buffer does not ever interpret the data words themselves.
68
Chapter 6
Verification - HDLModel Functionality
This chapter details how the decoder testbench was partially redesigned for an improved
verification methodology, support of the FRExt amendment. It also details how the frame
buffer designs were tested for proper operation.
6.1 Unit testing.
To facilitate an incremental implement and verify of each feature of the synthesizable frame
buffer model, a stand-alone testbench was created. This testbench performs a sequence
of five unit tests, each specifically targeting several edge conditions of the single RAM
design. Each test may be executed any number of times; and the ordering between tests
is configurable to many permutations. The primary purpose of the unit testbench is to
quickly simulate many significant edge
conditions to check for obvious breakage after each
modification of the frame buffer model during development.
For simplicity, no real video data is used. Instead, the input data is generated as a long
repeating sequence of counting
values for easy visual identification. The starting value is
OxffffOOOl ; and succeeding values decrement the first
1 6-bits and increment the latter 1 6-
bits. Two FIFOs recorded exactly every 32-bit word that was driven to the frame buffer
inputs; or captured from its data output. Each test analyzes this recorded trace of the data
buses to determine pass or fail.
The five unit tests are as follows:
69
1. WRITE_READ_SHORT. This test stores a few macroblocks to a few frames. It
then reads them back in the same order, and compares the values to check if they are
identical.
2. WRITE_READ_LONG.This test store a long sequence of macroblocks to a few
frames. It then reads them back in the same order and compares the values to check
if they are identical. It repeats this multiple times.
3. MAPPING-SHORT. This test writes a few macroblocks per frame for a sequence
of frames that is 3 times the capacity of the frame buffer. It marks some frames as
short-term and some as unused for reference. Values are read back from old frames;
some of which are on the edge of being discarded by the sliding window. Values are
then compared to determine whether a frame was discarded prematurely.
4. BACK_AND.FORTH. This test emulates the inter prediction behavior exhibited by
a decoder system with one reference frame capacity. It first writes an initial frame
0. Then it stores one macroblock of frame 1, loads one macroblock of frame 0,
repeating. The next frame 2 does the same; storing one current macroblock, and
loading a macroblock of the previous frame. This repeats for a long sequence, and
tests the buffer's ability to switch back and forth between different frame IDs and
different store/load commands.
5. WRITE_READ_EXTRA. This test emulates the switching between intra prediction
and deblocking filter behavior. It assumes the frame buffer supports the extra signal
to indicate switching between references frames and an unfiltered frame row stored
as
"extra"
data at the tail of the buffer. This repeats for a sequence that verifies the
back-and-forth switching between accessing reference frame data and the
"extra"
macroblock buffer, and tests that their values and addresses do not overlap. It also




A preexisting H.264/AVC decoder system was obtained from [21]. The system consisted
of:
Two pre-processing and post-processing executables that convert Y'CbCr data be
tween binary and ASCII text form. These executables are written in the C language.
A purely behavioral VHDL description of the early H.26L draft of the H.264/AVC
version 1 standard. Approximately 90% of the baseline standard is implemented and
tested with processing of a few frames.
Synthesizable VHDLmodels of the transformation data path and the deblocking filter
of the full decoder system.
6.2.1 Augmentation of the decoder system.
While much of the functionality of the H.264/AVC Baseline profile was implemented by
[21], the system was only capable of processing several QCIF frames within any reason
able length of time or quantity of workstation RAM. This also implied that the system
was not verified for proper decoding beyond a sequence of two frames. Due to the nec
essary performance for verifying
the frame buffer with a significantly longer and higher
resolution sequence of frames estimated minimum of 30 frame sequence with 5 refer
ence frames multiple modifications were made to the testbench for the sole purpose of
increasing performance by several orders of
magnitude. The changes made to the origi
nal testbench are itemized with Table 6.1. The first grouping of modifications reduced the
statically allocated memory
and CPU requirements. The latter grouping of modifications
allows the testbench to simulate an indefinite
sequence of frames without consuming more
memory than necessary.
To verify that the
performance modifications did not incur decoding errors, the original
1 -frame and 2-frame test sequences from [21] were
processed by the testbench prior to
71
Table 6.1: Performance modifications to the original decoder model.
Iterative modifications.
CPU & Static Memory.
1 . Removed of all non-behavioral descriptions.
2. Flattened the testbench hierarchy to one testbench.
3. Unwrapped top-level procedure calls to be single call, with only variables.
4. Removed unnecessary data structures (while retaining debugging structures).
5. Moved all signal-based data structures to shared memory.
6. Moved all temporary procedure data structures to shared memory.
Dynamic Memory.
7. Adapted NAL (bit stream) parsing to be incremental.
8. Adapted allocation ofmemory for slice/frame data to be incremental.
9. Adapted write-out and deallocation of slice/frame data to be incremental.
modification, and the results saved. Then, iteratively between each performance modifica
tion, the original 1 -frame and 2-frame results were compared byte-wise with the outputs
of the modified testbench. Completely identical results were obtained each time. The im
provements in simulation performance are itemized with Table 6.2.
Table 6.2: Performance enhancements to the original decoder model.
Estimated simulation capability with ModelSim on Linux.
(Intel T7200: 2.0 GHz, 4 MB L2 cache, 2 GB RAM)





4 frames 19 frames A-frames
1 frames 5 frames 5 frames
Total Macroblocks (MBs) 396 43757 N * 2303
Simulation RAM
Simulation Duration
Average Time / 100 MBs
1069 MB 820 MB 570MB
60 minutes 1 30 minutes N * 2 minutes
15 minutes 18 seconds 5 seconds
Additional enhancements that could have been implemented, but were not, include:
incremental reading of the NAL bitstream from file, and dynamic allocation/deallocation
of non-frame, non-slice data. However, the memory savings at even extremely large frame
sizes are estimated to be less than 100MB.
72
Baseline profile corrections.
The Baseline model was found to have increasing discrepancy with the reference decoder
when processing QCIE sequences of more than two frames. A significant number of minor
algorithm modifications were necessary to correct stream parsing and inter prediction, and
are itemized with Table 6.3.
Table 6.3: Functional corrections to the original decoder model.
Iterative modifications.
1 . Parsing of slice header fields with respect to reference lists.
2. Parsing of CAVLC
'te'
golomb.
3. Short-term marking of reference frames.
4. Wrapping frame list creation.
5. Sliding window discard of old reference frames.
While implementing each correction, the appearance of the output decoded video stream
was observed to contain fewer visual artifacts. In addition, to determine these inherent bugs
it was necessary to implement stream parsing
"trace"
logging procedures, and debugging
procedures. The trace procedures essentially dump the internal data structures for compari
son with the reference software's
"trace"
function, for purposes of comparing the bitstream
parsing between the model and the
reference software.
The only remaining issue with the
decoder system appears to be incorrect computation
of the inter prediction values, in a way that compounds between P frames very quickly.
Tracing the H.264/AVC bitstream, the reference
picture lists LO and LI, and comparing the
computations with [9] revealed that the inter prediction is correctly accessing the reference
samples to perform prediction. So this outstanding issue of partially incorrect computation
of P slices does not appear to affect the memory management; though it does degrade
the visual quality. The issue was not resolved for its




Some significant changes were made to the original decoder testbench [21] to support in-
system verification of the synthesizable frame buffer. The original system used simple
comparison of results between a behavioral model and a synthesizable model of each Unit
Under Test (UUT), as depicted with Figure 6.1. As an example, the Deblocking Filter used
this form of integration within the decoder system. Both models are implemented with
similar clocking behavior; the same inputs are fed into both models; and their outputs are
compared for identical value after the processing of each macroblock. Discrepancies are
logged for later analysis. The behavioral model uses primarily VHDL procedures wrapped
within an active clocked process; the synthesizable model uses pipelines to process input
data to output data with passive synchronous processes.










Figure 6. 1 : Testbench flow with emphasis on data processing.
While adequate for in-system testing of data processing components, such as the De
blocking Filter and Transform Unit, the immediate comparison of output based directly
upon input was found to be restrictive for testing of a storage-based UUT. External mem
ory storage lends itself to a series of operations that semi-arbitrarily switch between data
input and data output across a data bus; and this behavior may also vary substantially be
tween implementations.
To keep system integration of the frame buffer simple and simulation efficient, but in
crease the testing flexibility, a new testbench structure was designed. Similar to the other
test benches, the newer structure still compares data output based upon data input. How
ever, the testbench uses an operation queue to exercise the synthesizable model, and places
74
the data comparison functionality directly inside of the behavioral model, as depicted with
Figure 6.2.








queue ? UUT (synth)
?
Figure 6.2: Testbench flow with emphasis on storage operations.
The non-queued, zero-time operation of the behavioral model in combination with a
queue-driven interface to the synchronous synthesizable model provides three additional
system-level testing features:
Operations fed to the frame buffer may be filtered and/or reordered to more closely
resemble real on-chip system behavior.
The behavioral model may perform its work in zero simulation time as a purely
software model, without any clocking. This permits a single behavioral model to
compare itself against different synthesizable models of different behaviors.
The synthesizable model may be fed inputs that are a mixture of the current operation
and the next operation, depending on the synchronous pipeline input requirements of
the synthesizable model. For example, the addressing information could be fed to
the model one operation preceding or lagging behind the data.
The decision to implement the operation queue within the testbench should not be mis
understood as though the frame buffer were implemented without its own internal pipeline.
75
In actuality, there is a queue within the synthesizable frame buffer
model. This internal
queue is designed with the intention of integrating the frame buffer model into a full syn
chronous FPGA decoder system. The testbench queue is used solely for the purpose of pro
viding an interface between
the software system and a synchronous synthesizable model.
Were the entire system a direct wiring between synthesizable models,
the testbench opera
tion queue would be unnecessary and omitted.
6.2.3 Video sequences.
To provide data input for the decoder system, a variety of test video sequences were ob
tained from [2], and the tools listed in Appendix A. 1.1 used to convert the media between
raw Y'CbCr frame data and other uncompressed formats. In particular, once converting
and truncating a sequence to raw Y'CbCr binary data of approximately 100 frames, the
JM H.264/AVC reference software [18] was then used to encode a section of the sequence,
creating H.264/AVC bitstreams for system input. All bitstreams created for testing were
between 2 frames and 100 frames in length. The key sequences used for testing of the full
decoder system are listed in Table 6.4.
Table 6.4: Key H.264/AVC test sequences.
Test Video Length Size I-P Ref.
baseline- 1
"Foreman"
2 QCIF (176x144) 2 1
baseline_2
"Foreman"
50 QCIF (176x144) 2 5
baseline_3
"Foreman"
50 CIF (352x288) 20 5
baseline_4
"Elephant"
19 Wide XGA (1024x576) 2 5
baseline_5
"Elephant"
50 1080p (1920x1088) 20 3
baseline_480p
"Elephant"
50 Dl NTSC (720x480) 20 5
baseline_720p
"Elephant"
50 HD720p (1280x720) 20 5
baseline.1080p
"Elephant"
50 HD 1080p (1920x1080) 20 5
The
"Foreman"
test sequence is the classical MPEG sequence with a moderate degree
of both spatial (intra) and temporal (inter) movement. To use a realistic HD sequence,
76
the source 1080p frame images of the open source "Elephant's
Dream"
movie were down
loaded for a single scene, and encoded into various H.264/AVC bitstreams, scaled to vari
ous resolutions. This movie was selected as a readily available sequence of lossless format
that could provide full HD resolutions. From the movie, a 101 frame sequence was ex
tracted and used for testing; the sequence starts with movie frame 11280 and ends with
frame 1 1380. The sequence contains a high degree of both spatial and temporal move
ment; and shows a detailed animated character quickly following a receding wall. The
detail is down to the individual pixel, even at 1080p since the native resolution was higher
than 1080p. The speed of the movement increases slightly each frame. Frames 1 1290 and
11310 are shown with Figure 6.3. This sequence was chosen from the movie as one that
Figure 6.3: Elephant's Dream Frames 1 1290, 11310.
would encode with a high degree of inter prediction, with significant movement from both
the focused subject and the environment, plus camera panning. It was expected to generate
P Frames with moderately intense motion vectors; and after encoding this was confirmed
by examining the bitstream parsing trace of the JM H.264/AVC
reference software [18].
6.2.4 Functional simulation.
The synthesizable frame buffer designs were integrated into the decoder system with indi
vidual testbench queuing structures as
described in section 6.2.2. A combination ofVHDL
packages and generics were used to instantiate the VHDL system for a specific mode of
operation, including which frame buffers were enabled,
and the operation clock speeds.
77
Additionally, the testbench optionally enabled or disabled support for each of the buffering
paths: intra prediction, inter prediction, deblocking filter.
By processing the test sequences of Table 6.4, the inter prediction and deblocking filter
memory operations were tested to be correct without error. Proper
operation of the memory
testing was demonstrated by disabling the sliding window algorithm, and observing errors
eventually introduced after several frames had processed. Re-enabling the sliding window
algorithm caused the sequences to complete without memory errors.
The approximate JM encoding times andModelsim execution times for each sequence
are detailed with Table 6.5. These times are for the pixel format of 8-bit 4:2:0, but should
be 10% with repect to other pixel formats. These execution times are similar between
an Intel T7200 processor and an AMD Athlon FX-5 1 on the Linux operating system, each









baseline_l 1 sec 1 MB 2 min. 200 MB 20 min. 300 MB
baseline_2 30 sec. 60 MB 25 min. 300 MB 2 hours 400MB
baseline_3 1 min. 150MB 3hr. 350 MB 1 day 400 MB















Emulation of frame formats.
Testing of multiple frame formats was found to be difficult due to being unable to find
tools supporting a sample size other than 8-bit. Even the JM reference software, while
supporting all of the chroma sub-samplings, did not yet support a sample size greater than
8 bits.
To emulate the twenty frame formats, it was necessary to emulate support for the change
in macroblock size. As such, the behavioral model of the frame buffer creates extra chroma
samples and extra sample bits as needed to provide full data input to the frame buffer. These
78
generated values are shifted XNORs of real video data from other samples and bits within
the same macroblock, to provide reliable testing of the memory.
Additional items verified.
The deblocking filter and inter prediction accesses were verified for each of the formats at
small and large resolutions. The memory accesses were demonstrated to store and load the
datawithout errors, and pipeline correctly for real decoding behaviors. The intra prediction
was partially verified; but due to simulation issues, was unable to be as thoroughly verified
as the other two forms of buffering. The unit test WRITE_READ_EXTRA described in
section 6.1 verifies the functionality of using the extra signal: the two data regions are
independent, and the sliding window algorithm is not inadvertently triggered by use of
the extra space. However, the full system testbench has an unresolved queuing issue that
prevents a full frame from completing with the intra prediction is enabled; and it is fairly
certain that the issue is only due to a bug in the verification environment.
One important edge condition verified is the boundary between changing frame formats.
Using a loop to repeat decoding of the same sequence over and over, but changing formats
on the boundaries, demonstrated both correct memory operation and zero stalling effect
upon the pipeline. Dynamic switching of frame formats between any operation was found
to have zero penalty and zero errors.
6.2.5 Post-synthesis simulation.
To verify correct synthesis of
frame buffer designs, the RTL descriptions were swapped-out
and post-synthesis Verilog netlists swapped-in to the system. Operation of the testbench
did not change, and identical functionality between RTL descriptions and post-synthesis
netlists was demonstrated.
The primary caveat with the
post-synthesis simulations were the Digital Clock Man
agement constraints provided by Xilinx for achieving phased clock performance. These
over-constraints were the only method by which to force the
Xilinx software to correctly
79
place and route the DCM with acceptable timing. Xilinx documents that these constraints
have been thoroughly tested on real FPGA hardware; however, they do incur timing er
rors with the Modelsim simulator. Thus, a full operating frequency post-place-and-route
simulation does not simulate correctly due to DCM errors. Visual examination of the wave
forms confirmed that the DCM was the only portion of the netlist generating signal error
'X'
values during simulation.
To work-around the DCM constraints, it was necessary to adjust them manually after
synthesis, but just prior to generating the Verilog simulation netlist. By re-adjusting these
over-constraints, both designs were able to verify post-synthesis at approximately 75% of
the target operating frequency. Full operating frequency simulation was not possible due to
other complex clock phasing performed by the DDR controllers; but the operation of these
controllers was verified in hardware by Xilinx prior to distribution. As such, correct FPGA





This chapter provides an analysis of the final frame buffer designs, including their handling
of edge conditions and overall performance. Implementation efficiency, synthesis results,
timing analysis, and power analysis are discussed in detail.
7.1 Implementation analysis.
Inherent to the pipeline is a degree of inefficiency between the data controllers and the ex
ternal memory. This is unavoidable due to the complex nature of bidirectional synchronous
data flow which requires one or more levels of hand shaking. However, the efficiency
should still be as close to 100% as possible.
The efficiency of the data pipeline was measured by performing a sequence of contin
uous operations with the frame buffer, without any idling at the decoder system interface.
The percent efficiency is then equal to the number of clock periods for which the DDR
SDRAM chip(s) are performing essential commands, and are not simply delaying, divided
by the total clock periods during that span of time.





total frame buffer periods
Note that the active DDR periods do include commands such as auto refresh. The only
commands which would be considered non-essential would be unnecessary delay, no-op,
or power-down cycles where the DDR controller simply did not operate the memory at full
potential. This measurement is not the efficiency of the memory itself, but rather of the
pipeline data flow.
By examination of the design RTL in combination with simulation waveforms, the
following internal inefficiencies were determined:
A burst operation is shortened for the conditions of changing RAM row, or the end
of an operation that is not a multiple of the pipeline depth. The total impact of any
overhead loss of the unit burst operation is thus magnified by these scenarios.
The Xilinx DDR controller has zero clock periods lost due to the operation of burst
mode, for each burst.
The Xilinx DDR controller can handle up to eight (8) addresses per burst before
losing performance, and so this maximum pipeline depth is always used.
The interface between Xilinx DDR controller and the custom block controller has
one clock period lost due to handshaking, per burst operation.
The interface between block controller and system interface always has one clock
period lost for one full write operation due to handshaking.
The interface between block controller and system interface has either zero or two
clock periods lost for one full read operation, depending on the system handshaking
implementation. No clock periods are lost if the system interface continues latching
data via the incr signal for one or two periods following the done signal. However,
if handshaking is simply performed on the done signal, up to two periods may be
lost.
Each DDR controller experiences a 30 clock period auto refresh where the pipeline
is paused; and this occurs on an average of every 31 8-address burst operations. For
the single memory design, this does not incur any loss of efficiency. However, for the
82
dual memory design, the pipeline could experience an auto-refresh of one memory
and not the other during the same operation, causing a loss of 30 periods during that
operation. The worst case scenario is that the auto refresh never overlaps between
the two memories; and 30 periods are lost on an average of 31 8-address bursts. The
best case scenario is that the auto refresh always within the same operation between
the two memories; zero periods are lost on an average of 31 8-address bursts.
From these observations, the frame buffer efficiency is reduced primarily when there are
shortened burst operations, when the system interface implements the load operation with
simplified handshaking, or when the auto refresh cycles are not synchronized between
memories. In all cases, the efficiency increases hyperbolically proportional to the oper
ation size.
As an example computation, the efficiency of a sequence of 30% load and 70% store of
8-bit, 4:2:0 16x16 macroblock operation is computed as follows.
1. From equation 5.2, the operation size is 3072 bits.
2. The block controller feeds the pipeline with a maximum burst size of eight unique
addresses, or sixteen 32-bit data words. As such, one burst without shortening is
16*32 = 512 bits.
3. Dividing the operation size by maximum burst length:
3072 -e- 512 = 6. Thus, a
single RAM design would require 6 bursts; and a dual RAM design would require 3
bursts. No pipeline truncations ever occur.
4. For a typical xl6 DDR RAM, the row boundary occurs every 1024 addresses of
16-
bit data; or every 256 addresses of
64-bit data. The burst size divided by word size:
512/64 = 8 addresses. Since 256 mod 8
= 0 and 3072/64 mod 8 = 0, no column
pipeline adjustments ever occur. So every burst is
maximum length.
5. Auto-refresh occurs approximately every 31 +
6 = 5.167 operations for the single
RAM design, and approximately every 31
-=-3 = 10.33 operations for the dual RAM
83
design.
6. At an operating frequency of 150MHz, one clock period is 6.667 ns of time.
A 50 frame reference QCIF sequence was simulated for both frame buffer designs. The
total number of macroblock operations was 8007 with 38% load and 62% store. The time
required to perform this continuous sequence of operations with the signal RAM design
was 10881507 ns -4- 6.667 ns = 1632144 clock periods. Based on the above scenario, the
efficiency of the typical-case single RAM design is:











The time required to perform this continuous sequence of operations with the dual RAM
design was 5834368 ns -4- 6.667 ns m 875111 clock periods. Based on the above scenario,
the efficiency of the typical-case dual RAM design is:











Using these same calculations for 9-bit, 4:2:0 demonstrates the lower efficiency of the
odd sample sizes.
1 . From equation 5.2, the operation size is 3456 bits.
2. The block controller feeds the pipeline with a maximum burst size of eight unique
84
addresses, or sixteen 32-bit data words. As such, one burst without shortening is
16* 32 = 512 bits.
3. Dividing the operation size by maximum burst length: 3456 4 512 = 6.75. Thus, a
single RAM design would require 7 bursts; and a dual RAM design would require 4
bursts. Pipeline truncations do occur for the last burst of every operation.
4. For a typical xl6 DDR RAM, the row boundary occurs every 1024 addresses of
16-
bit data; or every 256 addresses of 64-bit data. The burst size divided by word size:
512/64 = 8 addresses. Since 256 mod 8 = 0 and 3456/64 mod 8 = 6, pipeline
column adjustments do occur. Each operation is 3456/64 = 54 addresses, so row
changing occurs every 256/54
= 4.741 operations. 6/8 = 75% of these operations
incur a column adjustment. So at a frequency of 4.741/. 75 = 6.321 operations,
pipeline truncations occur due to row changing. These truncations lose an estimated
3 clock periods each. For the dual ram, the frequency is every 6.321 * 2 = 12.64
operations.
5. Auto-refresh occurs approximately every 31 4 6
= 5. 167 operations for the single
RAM design, and approximately every
314-3= 10.33 operations for the dual RAM
design.
6. At an operating frequency of 150MHz, one clock period is
6.667 ns of time.
Using the same 50 frame QCIF sequence,
the total number of full 16x16 macroblock op
erations was 8007 with 38% load and 62% store. The time required
to perform this contin
uous sequence of operations with the signal
RAM design was 12516063 ns 4 6.667 ns =
1877316 clock periods. Based on the above scenario,
the efficiency of the typical-case
single RAM design is:














The time required to perform this continuous sequence of operations with the dual RAM
design was 6805388 ns 4 6.667 ns ps 1020757 clock periods. Based on the above scenario,
the efficiency of the typical-case dual
RAM design is:













7.2 Synthesis resource analysis.
Synthesis of the two frame buffers was found to take a reasonable degree of FPGA real
estate on a few low-cost, mid-capacity Xilinx Spartan 3E devices. Table 7.1 shows the
synthesis frequencies and area used by the single RAM design; and Table 7.2 shows the
same for the striped dual RAM design. Note that the dual RAM design could not map both
DDR interfaces plus the system interface to the physical IO pins.
Table 7.1: Single RAM frame buffer synthesis full pin-out.
Device Max Frequency Gate Count Slices Used
xc3s500e-4 127.0MHz 28463 18%
xc3s500e-5 146.8 MHz 28211 17%
xc3sl200e-4 133.1MHz 28196 8%
xc3sl200e-5 145.6MHz 28241 9%
To provide a more accurate comparison of FPGA area and frequency between the two
designs, the system interface pins of the dual RAM design were combined down to the same
number as the single RAM design. This combination used XOR and addition to prevent
86











the synthesis tool from optimizing away any internal 10, is only valuable for demonstration
purposes. Table 7.3 shows the signal RAM design in comparison with the reduced pin-
out dual RAM design. From the results, it is apparent that the dual RAM design takes
approximately 50% more gates to implement; and this is expected, due to the duplication of
both the DDR and block controllers. Unexpectedly, the single RAM design synthesizes at
a lower frequency than the dual RAM; but this could likely be improved by experimentally
tuning the synthesis constraints.









146.8 MHz 28211 17%
133.1MHz 28196 8%
145.6 MHz 28241 9%
(b) Dual RAM
Freq. Gates Slices




7.3 DDR timing analysis.
Both the single RAM and dual RAM designs of the frame buffer place and route on each
of the -5 speed grade Spartan 3E FPGA devices with an operating frequency of 150.
8MHz. As such, an operating frequency of 150 MHz is used to compare the typical DDR
performance of the two designs. Frames per second cannot be used directly to evaluate the
buffer DDR performance as the average number of macroblock operations necessary for a
video sequence depends significantly on the degrees
of spatial and temporal prediction used
during the original encoding of the sequence.
Different sequences of the same pixel format
and frame resolution may incur a different
magnitude ofmemory operations depending on
87
H.264/AVC slice types, and other details.
The principle metric by which to measure the frame bufferDDR performance is the data
bandwidth of the buffer: the quantity of data per second that may be loaded and stored. To
evaluate this, the frame buffers were connected to the decoder testbench with a worst-case
handshaking implementation. Decoding of a 50 frame QCIF sequence was simulated, once
for each of the 20 possible frame formats. With only inter prediction buffering enabled, and
only for full 16x16 macroblock operations, the sequence required a total of 8007 loads and
stores. Using the measured simulator time between the start of the first operation and end
of the last at 150MHz, the bandwidth of each frame format was computed in two forms:
macroblocks per second, and gigabits per second. These measurements are expressed for
the single RAM design with Table 7.4.
Table 7.4: Single xl6 DDR SDRAM bandwidth.










chroma sub -sampl ng
4:0:0 4:2:0 4:2:2 4:4:4
1091 736 555 372
923 640 494 325
880 572 445 298
767 515 405 267
736 494 372 249
chroma sub-sampling
4:0:0 4:2:0 4:2:2 4:4:4
2.082 2.105 2.117 2.129
1.982 2.059 2.121 2.093
2.097 2.047 2.124 2.134
2.011 2.027 2.127 2.103
2.105 2.121 2.129 2.137
The numeric trends shown by these measurements confirm the efficiency analysis of the
DDR pipeline. As expected, the number ofmacroblocks per second decreases as the binary
size of the macroblock unit increases. Also, the bandwidth marginally increases by 5% as
the macroblock unit increases in size from 8-bit 4:0:0 to 12-bit 4:4:4. This is expected
as the larger macroblock sizes increase the quantity of full length DDR burst operations,
without pipeline flushing. Another trend revealed by these measurements is that the even
sample sizes are more efficient than the odd size samples. From the efficiency analysis
in section 7.1, this can be explained as the odd sizes causing an increased frequency of
pipeline flushing due to DDR row strobing occurring within the macroblock operations,
88
and even in the middle of some bursts; and not only on the boundaries between them. In
other words, the macroblock units with odd sample sizes have a much higher percentage
of starting and ending addresses that are not physically aligned to the RAM, causing lower
performance.
Table 7.5: Striped Dual xl6 DDR SDRAM bandwidth.










chroma sub -sampl ing
4:0:0 4:2:0 4:2:2 4:4:4
2115 1440 1091 736
1633 1177 923 640
1563 1118 880 572
1489 960 767 515
1440 923 736 494
chroma sub-sampling
4:0:0 4:2:0 4:2:2 4:4:4
4.035 4.121 4.165 4.210
3.504 3.787 3.963 4.118
3.727 4.000 4.194 4.094
3.905 3.776 4.022 4.053
4.121 3.963 4.210 4.242
The measured bandwidth of the striped dual RAM design is shown with 7.5. Comparing
this with the single RAM design, the performance is nearly double for the majority of
formats. This is also expected from the efficiency analysis. The potentially unsynchronized
auto-refresh, and the striping operation boundaries cause a minor loss of efficiency that
prevent a speedup of 100%. Examining the table, it also becomes apparent that the odd
sample sizes negatively impact the efficiency of the dual RAM striping, much beyond that
of the single RAM. This is expected as the odd bit sizes encounter frequent row strobing
latencies mid-operation, flushing the pipeline. While the odd sizes negatively impact both





Table 7.6: xl6 DDR SDRAM bandwidth variance.
(a) Std. dev. by sub-sampling.
chroma sub-sampling
4:0:0 4:2:0 4:2:2 4:4:4
.0554 .0397 .00477 .0199
.248 147 .111 .0797
(b) Std. dev. by sample size.
8
sample bit sizes
9 10 11 12
.0201 .0601 .0389 .0567 .0137
.0746 .263 .201 .126 .125
For both designs the worst performing frame format is the smallest odd size of 9-bit
4:0:0; and the best performing format is the largest even size of 12-bit 4:4:4. The difference
89
of these two formats is a factor of 0.738 gigabits/sec, or approximately 20%. Table 7.6
details the variance of the DDR bandwidth according to both chroma sub-sampling format,
and also sample size. Each value is a standard deviation of the respective column or row of
Tables 7.4, 7.5. The total standard deviation of all the frame format bandwidths is: .0445
for the single RAM, and .191 for the striped RAM.
7.4 H.264/AVC timing analysis.
To evaluate the video decoding performance of the frame buffer designs, a 50 frame video
sequence was processed at 150MHz for six common frame formats at the three most com
mon digital television resolutions. Buffering of inter prediction and deblocking were en
abled; and buffering of intra prediction was not enabled since it would not affect a statistical
computation regarding the P Frames. The ratio of I Frames to P Frames was 1 : 20, and
the length of the reference list was 5 frames. Using this data, the following heuristic was
applied to determine an estimated achievable frames per second:
1 . Run the simulation for as many frames as the simulator tools can handle; enabling
and disabling the structural frame buffer as needed.
2. Find a period of P frames that appears representative of the sequence; and use this as
a basis for the frames per second calculation.
It was necessary to use such a heuristic for accurate statistics, for two reasons. First, the
Modelsim simulator would occasionally crash, especially when approaching 0.5 seconds
of simulator time for which the frame buffer was enabled
'
. Second, the start of a test se
quence is not representative of a full length video; and a mid-section should be used where
possible. P Frames with a high degree of temporal (inter) prediction incur more memory
operations than the I Frames do, and demonstrate the most intensive portion of the video.
'
The H.264/AVC decoder testbench was adapted to be capable of dynamically enabling and disabling
simulation of the structural frame buffer models in parallel with the behavioral model. Simulator time refers
the quantity of hardware operation time represented by the simulation; it does not refer to human time.




test sequence used for these measurements contained a moderately inten
sive motion sequence, that starts slowly and grows in speed, as described in section 6.2.3.




























































The estimated video decoding performances of the two designs in terms of frames per
second are shown with Table 7.7 and Table 7.8. Due to the fact that
each one of these
simulations takes between 10 hours and 4 days to simulate on a high-end
single processor
workstation, some values are
omitted. The worst-case of 9-bit sample size was simulated
for the lowest resolution to demonstrate that while
that bit size is less efficient on the DDR
bus, the frames per second is still lower
than the 8-bit sample size, and higher than the
10-
bit sample size. This is expected. Those values
in italics are scaled estimates rather than
actual simulation due to Modelsim issues, or lack of
computational power to complete all





7.5 Power consumption analysis.
Xilinx FPGA devices.
Using the Xilinx XPower tool, the estimated power usage of each post-place-and-route
synthesis result is detailed with Table 7.9. These estimates are for the device in DDR op
eration, but do not include the DDR SDRAM chips themselves. These estimates total the
power consumed at all three voltage levels supported by the FPGA 10 pins, as some pins
were configured pre-synthesis for different voltage standards than other pins. The operat
ing frequency of each synthesis result is the maximum operating frequency, as previously
detailed with Tables 7.1.7.2,7.3. As expected, the dual DDR design shows an increase of






















power consumption over the single DDR design. This would be due to both the additional
IO pins and the additional slices occupied by the design.
Micron DDR SDRAM.
To estimate the power usage, a representative xl6 128 MB DDR memory from Micron is
analyzed for continuous active read and write cycles. The specific model is MT46V8M16TG-6T.
The equations were taken from [12] and the device values from [13].
The active standby state:
p(ACT_STBY) = IDD3n*\'dd
p(ACT_STBY) = 70mA* 2.7 V
92
p(ACT_STBY) = 189mW (7.6)
The active row selection power usage, using the simplified average case model:
p(ACT)
= (IDD0 - Idd3n) * VDD
p(ACT) = (125mA -70mA)* 2.7V
p(ACT) = 148.5mW (7.7)














The total write operation power:
p(TOT)
= p(ACT) + p(WR) + p(ACT_STBY)
p(TOT)
= 148.5mW + 116.7mW + 189mW
p(TOT) ps 454mW (7.9)




num of RD cycles
p(DQ)
= (num of DQ + num of DQS) * AJfjf
4CK
p(DQ)














= (145mA -70mA)*2-* 2.7V
p(WR)
= 115.7mW (7.11)
The total read operation power:
p(TOT)
= p(ACT) + p(RD) + p(DQ) + p(ACT_STBY)
p(TOT)
= 148.5mW + 115.7mW + 70.8mW + 189mW
p(TOT) ps 524mW (7.12)
From these computations, assuming a realistic case of 60% writes and 40% reads as
av
erage operation of the frame buffer, the DDR SDRAM power consumption would average
at: .6 * 454mW + .4 * 524mW = 482mW.
These results show that the DDR memory consumes a fair amount of power due to the
burst length of 4 and also the higher DDR voltage of 2. 7 V. Were the pins to be be driven in
low power DDR mode, the performance would be similar with significant power savings.
7.6 Cost analysis.
The hardware components required for the frame buffer are one ( 1 ) Spartan 3E device plus
either one (1) or two (2) DDR SDRAM memories, depending on the design configuration.
The approximate cost of these devices for a quantity of one hundred (100) is shown with
Table 7.10. Note thatMicron quotes the cost ofDDR SDRAM memories the same for both
100 and 200 quantity, and so the estimated cost is applicable to both the single and dual
memories frame buffer designs. Also, the low power equivalent memory models are the
same price, but require a minimum purchase of one thousand (1000) quantity. The memory
94
Table 7.10: Estimated unit device cost in purchase quantity of 100.
Device Model Freq. Capacity Unit Cost
Xilinx Spartan 3E XC3S500E-4 FG320C 133 MHz 10476 Cells $36.00
Xilinx Spartan 3E XC3S500E-5 FG320C 166 MHz 10476 Cells $40.00
Xilinx Spartan 3E XC3S1200E-4FG400C 133 MHz 195 12 Cells $55.00
Xilinx Spartan 3E XC3S1200E-5FG400C 166MHz 19512Cells $64.00
Micron x 16 DDR MT46V8M16TG-6T:D TR 166MHz 128MBytes $3.04
Micron xl 6 DDR MT46V16M16FG-6:F TR 166 MHz 256MBytes $7.84
Micron xl6 DDR MT46V32M16FN-6IT:F 166 MHz 5 12 MBytes $14.29
prices quoted are the better price listed between both the on-line Micron store and Avnet
[3], a certified Micron distributor. The FPGA prices quoted are round estimates provided
verbally by the Avnet sales office; also a certified vendor for Xilinx devices. [3]
2
Table 7.11: Minimum and maximum 1080 HD memory needs.









4:0:0 4:2:0 4:2:2 4:4:4
8.0 12.0 16.0 24.0
9.0 13.4 18.0 26.9
10.0 15.0 20.0 29.9
11.0 16.4 22.0 32.9
12.0 18.0 24.0 35.9
(b) 15 reference frames, MBytes
chroma sub-sampling
4:0:0 4:2:0 4:2:2 4:4:4
31.8 47.8 63.7 95.9
35.8 53.8 71.7 107.5
39.8 59.8 79.7 119.5
43.8 65.8 87.7 131.5
47.8 71.7 95.7 143.4
The assumption of this design is that the main data path of the H.264/AVC
decoder
will be instantiated on the same FPGA device as the frame buffer.
Based on performance
and cost estimates detailed by other literature [22], these listed Xilinx
FPGA devices are
the least expensive and slowest models that are plausible
for a real-time decoder system
implementation. Based on the performance analysis of section 7.3, one of the above Xilinx
devices may be selected
based on operating frequency; and also the memory chip(s)
based
on necessary capacity.
The required memory capacity
would be proportional to the number
of reference frames; with 3 being the minimum and 15 the
maximum. As such, the number
2The prices of these electronic devices can change significantly in
a short period of time; and there is
little value in quoting with more detail than round
dollars for devices over $20 per unit. In particular, the
precise Xilinx FPGA prices change very frequently within a few
dollars due to promotions, prototypes, and
manufacturer revisions. The real purpose here is to show relative cost between
device models.
95
of slots would be a minimum of 4 up to a maximum of 16. The memory needs of 1080 HD
for each frame format are shown with Table 7.1 1. Add one additional Megabyte to each




This chapter discusses the verification results and performance analysis of the two syn
thesizable H.264/AVC frame buffer models, with special regard to FRExt support. Based
on the measured and analytical results, improvements are suggested for the frame buffer
designs. Additionally, approaches to system interfacing are proposed; and a full hardware
H.264/AVC plus FRExt decoder architecture is proposed that could make efficient use of
this frame buffer.
8.1 Synthesizable models.
The frame buffer models were able to synthesize and verify near or at the maximum oper
ating frequency of their target FPGA devices, demonstrating efficient design partitioning.
Additionally, the designs synthesized with efficient FPGA resource utilization on some
of the lowest cost devices that could plausibly fit a full H.264/AVC decoder design. The
models and their DDR interfaces also demonstrated potential for power savings. These
results, combined with the ability to exclude from
synthesis some of the buffering features,
and adjust the external memory capacity,
demonstrate that the designs scale well for many
different scenarios.
97
General frame buffer improvements.
The primary area of potential improvement is the total quantity of data that the frame buffer
loads from memory after storing it. While the full 16x16 macroblock data transfer makes
efficient use of the DDR bandwidth for store operations, the load of a full macroblock is
seldom needed. The designs do provide a smaller transfer size of 16x8 sub-macroblock;
but this level of granularity could be increased specifically for load operations, likely also
increasing the total frames per second for all formats.
One difficulty with providing macroblock size granularity with the frame buffer is the
poor degree of base 2 sizes for some of the frame formats, such as 9-bit 4:2:0 and 1 1-bit
4:2:0. From Table 5.2, these formats use an odd number of addresses for the 16x8 sub-
macroblock unit. As such, dividing that data unit further into even units would cause some
load operations to provide too much or too little data, depending on the implementation.
This lack of an even division into the data bus width does not prevent higher granularity;
but does significantly increase the complexity when interfacing between the system and
frame buffer. Some data would have to be understood by the system as irrelevant to the
current operation, for a few select frame formats.
A possible solution for allowing higher granularity is to register the data and increment
signals one additional clock cycle prior to output, and have an early cut-off of the increment
signal as needed. At a minimum, this would allow one more even division of size without
outputting any irrelevant data. Another plausible solution would be for the frame buffer
to output a data mask in parallel with the data output bus. This mask would always be a
value of zero, except for the last 32-bit data word of a load operation, indicating which
bits should be ignored. Combining of the increment cut-off with the data mask would
allow even sub-macroblock partitioning to many degrees at the cost of increasing interface
complexity. Further more, the best-case operation hand-shaking of section 5.4.1 might be
required to prevent introduction of additional pipeline inefficiencies.
98
DDR striping improvements.
Mentioned in the efficiency analysis of section 7.1, the lack of synchronizing auto-refresh
cycles between the two DDR controllers can potentially cause additional stalling of the
pipeline beyond what is actually necessary. To synchronize the auto-refresh between the
two controllers, it would be necessary to combine some of the internal finite state machines.
Doing this would also permit instantiation of a single block controller instead of two, sav
ing device resources in addition to increasing pipeline efficiency. The primary reason to
keep the two DDR controllers completely independent, however, is to increase the maxi
mum operating frequency. Combining the controllers could potentially cause unavoidable
latencies between the left and right sides banks of the FPGA device, causing issues with
place and route. In either case, combining the auto-refresh cycles appear that it would be
simple to achieve in RTL, but a very experimental and manual task to maintain synthesis
performance.
8.2 Proposed system interfacing.
Integration of this frame buffer into a full H.264/AVC decoder system could be achieved
by multiplexing the inputs and outputs to each of the: inter prediction, intra prediction,
and
deblocking filter. The data buses would require a degree of FIFO buffering
between units
and the buffer. The multiplexer for input, and the multiplexer for output would each be
controlled by either the H.264/AVC system controller, or a form
of competition between
units. The two prediction units should never have to compete on the same H.264/AVC
slice; so a round-robin on-demand access
between prediction and deblocking stages would
also be a realistic approach.
Assuming that the H.264/AVC decoder is a hardware-only
implementation with a pipelined
data path, Figure 8.1 shows a proposed method of interfacing the frame buffer. It only
shows the three stages that require significant buffering, and one manner of how to connect


















































Figure 8. 1 : Proposed integration of frame buffer into decoder system.
would be to arbitrate access to the frame buffer between all three stages. First, a chip-select
signal could be used to chose one unit to access the external memory, as shown on the
left side. Shown on the right, the system controller multiplexes the data buses to the same
device that was selected.
Intra prediction interfacing.
The address connections for the intra prediction would always hold the extra signal high;
and use the macroblock number and sub-selection signals to indicate its buffering needs;
using the macroblock number as a circular FIFO pointer value. It would not connect to the
frame number signal at all as to not interfere with the sliding window algorithm. When
the intra prediction is selected by the control, its addressing signals would propagate to the
frame buffer, and automatically incur a new data operation. As shown in Figure 8.1, the
100
intra prediction is positioned with a two macroblock ping-pong buffer between itself and
the frame buffer. This macroblock buffer, with areas labeled A, B would be distributed
memory, each bank the maximum macroblock size supported by the system. When the
intra prediction unit is performing computations, area A would be used to load values from
the extra buffering space of the frame buffer, in parallel with writing new data into area B
New values would continue to be loaded from external memory into bank A as needed for
the oldest 3 of the 4 neighboring macroblocks. Once B is filled with the new unfiltered
data, the deblocking filter could then read the data from B simultaneously as B is written
out to external memory. For its first neighbor, the intra prediction reads back from B and
begins writing results to A. B would then be used to load more data from the external
memory, as A continues to fill with the prediction results. The banks would toggle once
with every succeeding operation.
Inter prediction interfacing.
The address connections for the inter prediction would always hold the extra signal low;
and use the frame number, macroblock number, and sub-selection signals to indicate its
buffering needs. When this unit is selected by the controller, its addressing values propagate
to the frame buffer to cause a new operation. As shown, the inter prediction is positioned
to share the same ping-pong buffer as the intra prediction as the two units never process
the same input macroblock. So while one is active, the other is not. The inter prediction
addresses only historical frame numbers, and not ever the current one. When the results are
completely written to one ofbanks A or B, that bankmay propagate down to the deblocking
filter in the same fashion as did the results from the intra prediction. The other bank would
continue loading in new data from memory. It would be possible to operate this way each
cycle with or without toggle of the banks.
101
Deblocking filter.
The address connections for the deblocking filter would always hold the extra signal low,
and also hold the frame number at the currently processing frame. The macroblock number
would always be the actual macroblock number (and not a FIFO pointer like the intra
prediction). These values would incur a new operation in like fashion to the other units.
Similar to the intra prediction, the deblocking filter would read in historical macroblocks
into bank C while outputting results to bank D. Once a cycle completes, the banks toggle,
and the bank with results (D in this case) is written to memory for storage and output to
the video driver stage of system. In parallel to this, it is also read back as the first neighbor
for the next cycle while results are put in the opposite bank.
Controller.
The controller would be responsible for marking the current frafhe sometime before its
completion as one of: short-term, long-term, or unused for reference. It would also trig
ger IDR and change of frame format where needed. Also when a long-term frame is last
accessed by the inter prediction, the controller would signal the frame buffer to mark it
as unused. Combining this control with prediction and filtering units that each under
stand multiple frame formats provides an architecture for real-time decoding of a series of
H.264/AVC plus FRExt sequences; potentially each one being a different frame format;
without pause between the sequences.
102
Bibliography
[1] Spartan 3E DDR reference design for the Spartan 3E starter kit.
ftp://ftp.xilinx.com/pub/applications/misc/s3e_starter_revd_mig_ddrf 1 ].zip, Septem
ber 2006. Revision D.
[2] Xiph.org testmedia: collection of test sequences and clips for evaluating compression
technology, http://media.xiph.org/, May 2007.
[3] Avnet. Avnet ElectronicsMarketing Home, http://www.em.avnet.com, August 2007.
[4] A. Bourge and J. Jung. Low-power H.264 video decoder with graceful degradation.
In S. Panchanathan and B. Vasudev, editors, Visual Communications and Image Pro
cessing 2004, volume 5308, pages 372-383, San Jose, CA, USA, Jan 2004. SPIE.
[5] H. Chung and A. Ortega. Efficient memory management control for H.264. In Image
Processing, 2004. ICIP '04. 2004 International Conference on, volume 2, pages
777-
780, 2004.
[6] K. Holm and O. Gustafsson. Low-complexity and low-power color space conversion
for digital video. In Norchip Conference, 2006. 24th, pages 179-182, 2006.
[7] ISO/IEC 14496-10: Information Technology. Coding of audio-visual objects
- Part
10: Advanced Video Coding, March 2003. Approved ISO/IEC FDIS Draft.
[8] ITU-R Rec. BT.601 version 6. Studio encoding parameters of digital television for
standard 4:3 and wide-screen 16:9 aspect ratios, January 2007. In Force ITU-R
Recommendation .
[9] ITU-T Rec. H.264 version 4. Advanced video codingfor generic audiovisual services,
March 2005. In Force ITU-T Recommendation.
[10] H. Kang, K. Jeong, J. Bae, Y. Lee, and S. Lee. MPEG4 AVC/H.264
decoder with
scalable bus architecture and dual memory controller. In Circuits and Systems, 2004.
ISCAS '04. Proceedings of the 2004 International
Symposium on, volume 2, pages
145-148, 2004.
[11] L. Li, S. Goto, and T. Ikenaga. An efficient deblocking
filter architecture with 2-
dimensional parallel memory for H.264/AVC.
In Design Automation Conference,
2005. Proceedings of the ASP-DAC
2005. Asia and South Pacific, volume 1 , pages
623-626, 2005.
103
[12] Micron, Inc. Calculating Memory System Power for DDR, March 2005. TN-46-03.
Rev. B.
[13] Micron, Inc. Micron 128 MB Double Data Rate (DDR) SDRAM Datasheet, April
2007. DDR: Rev. F; Core DDR Rev. A.
[14] Micron, Inc. Micron 512 MB Double Data Rate (DDR) SDRAM Datasheet, April
2007. DDR: Rev. L; Core DDR Rev. A.
[15] K. Palanisamy. Interfacing Spartan-3 Devices with 166 MHz or 333 Mb/s DDR
SDRAMMemories. Xilinx, Inc., October 2004. XAPP 768c v2.0.
[16] I. E. G. Richardson. H.264 andMPEG-4 Video Compression. John Wiley and Sons,
Ltd., 2003.
[17] I. E. G. Richardson. H.264/MPEG-4 Part 10 tutorials and white papers.
http://www.vcodex.com/h264.html, March 2007.
[18] K. Suhring. H.264/AVC reference software, http://iphome.hhi.de/suehring/tml/, 04
2007. Version JM 12.2.
[19] G. J. Sullivan, P. N. Topiwala, and A. Luthra. The H.264/AVC advanced video coding
standard: overview and introduction to the fidelity range extensions. In Applications
ofDigital Image Processing XXVII, volume 5558, pages 454X74, Denver, CO, USA,
November 2004. SPIE.
[20] P. N. Topiwala. Status of the emerging ITU-T/H.264 / ISO/MPEG-4, Part 10 video
coding standard. In Applications ofDigital Image Processing XXV, volume 4790,
pages 261-277, Seattle, WA, USA, November 2002. SPIE.
[21] T. Warsaw. VHDL modeling of an H.264/AVC video decoder. Master's thesis,
Rochester Institute of Technology, Department of Computer Engineering, August
2005.
[22] T. Warsaw and M. Lukowiak. Architecture design of an H.264/AVC decoder for
real-time FPGA implementation. In Application-specific Systems, Architectures and
Processors, 2006. ASAP '06. International Conference on, pages 253-256, 2006.
[23] T Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. J. Sullivan. Rate-constrained
coder control and comparison of video coding standards. Circuits and Systems for
Video Technology, IEEE Transactions on, 13:688-703, 2003.
[24] TWiegand, G. J. Sullivan, G. Bjntegaard, and A. Luthra. Overview of the H.264/AVC
video coding standard. Circuits and Systemsfor Video Technology, IEEE Transactions
0/7,13:560-576,2003.
104
[25] Xilinx, Inc. XilinxMemory Interface Generator (MIG) User Guide, April 2007. Ver
sion 1.7.2.
[26] J. Zhu, L. Hou, W. Wu, R.Wang, C. Huang, and J. Li. High performance synchronous
DRAMs controller in H.264 HDTV decoder. In Solid-State and Integrated Circuits




Software Tools and Deliverables
A.l Software tools.
A collection of proprietary and open source software tools
were used to perform this the
sis. The more crucial of these tools and their versions are listed here. Some tools were
continuously upgraded in parallel with performing
the thesis project; or multiple versions
were used for sake of comparison or availability. As such, only the newest version of each
tool is listed here. Those open source softwares listed with a date were obtained via repos
itory snapshot on that date, rather than using an official release. All of the tools were run
primarily on a 2006/2007 GNU/Linux distribution.
A.l.l Video processing and display.
PRODUCT VERSION URL
MJPEG Tools 1.9.0 RC2 http://mjpeg.sourceforge.net/
FFmpeg 20070616 http://ffmpeg.org/
Linux Media Player 20070622 http://www.mplayerhq.hu/




Image Magick 6.3.4 http://www.imagemagick.org/
H.264.2 Reference JM 10.1a http://www.itu.int/rec/T-REC-H.264.2-200503-I/en
H.264/AVC Reference JM 12.3 http://iphome.hhi.de/suehring/tml/
A.1.2 FPGA design and simulation.
PRODUCT VERSION URL
ModelSim SE 6.2d http://www.model.com/
Xilinx ISE Foundation 9.2i http://www.xilinx.com
Xilinx Platform Studio 9. Ii http://www.xilinx.com
GHDL 0.26 http://ghdl.free.fr/
A.2 Thesis deliverables.
The following items were created by this thesis and submitted on DVD-ROM
media.
Complete, synthesizable single RAM frame buffer model in VHDL.
Complete, synthesizable dual RAM frame buffer model in
VHDL.
Augmented VHDL H.264/AVC decoder test bench with queuing structures.
Created test sequences from
"Foreman"
and "Elephant".
Configurations of all tools, all synthesis runs, and all simulation runs,
including: JM
reference software, Xilinx Foundation, and Modelsim.
Automated command-line build, simulate, synthesize, view, and analyze environ
ment for targeting every possible device,
frame format, RTL vs. netlist, and data
input permutation mentioned in the thesis. Uses
GNU Make and some Bourne shell
scripts.
107




09/07 04-172-rjo <*.= S
