Network-on-Chip Based H.264 Video Decoder on a Field Programmable Gate Array by Barge, Ian
Marquette University
e-Publications@Marquette
Master's Theses (2009 -) Dissertations, Theses, and Professional Projects
Network-on-Chip Based H.264 Video Decoder on
a Field Programmable Gate Array
Ian Barge
Marquette University
Recommended Citation
Barge, Ian, "Network-on-Chip Based H.264 Video Decoder on a Field Programmable Gate Array" (2017). Master's Theses (2009 -).
411.
http://epublications.marquette.edu/theses_open/411
NETWORK-ON-CHIP BASED H.264 VIDEO DECODER
ON A FIELD PROGRAMMABLE GATE ARRAY
by
Ian J. Barge, B.S
A Thesis submitted to the Faculty of the Graduate School,
Marquette University,
in Partial Fulfillment of the Requirements for
the Degree of Master of Science
Milwaukee, Wisconsin
May 2017
ABSTRACT
NETWORK-ON-CHIP BASED H.264 VIDEO DECODER
ON A FIELD PROGRAMMABLE GATE ARRAY
Ian J. Barge, B.S
Marquette University, 2017
This thesis develops the first fully network-on-chip (NoC) based h.264 video decoder
implemented in real hardware on a field programmable gate array (FPGA). This thesis starts with
an overview of the h.264 video coding standard and an introduction to the NoC communication
paradigm. Following this, a series of processing elements (PEs) are developed which implement
the component algorithms making up the h.264 video decoder. These PEs, described primarily in
VHDL with some Verilog and C, are then mapped to an NoC which is generated using the
CONNECT NoC generation tool. To demonstrate the scalability of the proposed NoC based
design, a second NoC based video decoder is implemented on a smaller FPGA using the same
PEs on a more compact NoC topology. The performance of both decoders, as well as their
component PEs, is evaluated on real hardware. An analysis of the performance results is
conducted and recommendations for future work are made based on the results of this analysis.
Aside from the development of the proposed decoder, a major contribution of this thesis
is the release of all source materials for this design as open source hardware and software. The
release of these materials will allow other researchers to more easily replicate this work, as well as
create derivative works in the areas of NoC based designs for FPGA, video coding and decoding,
and related areas.
iACKNOWLEDGMENTS
Ian J. Barge, B.S
There are many people I must thank for making this thesis possible. Firstly, I thank my
parents, since without them many of the great opportunities I have had would be out of reach.
Second, I thank my adviser, Dr. Cris Ababei, for all of the work he has invested in me over the
course of this thesis. Thirdly, I thank Dr. Edwin Yaz and Dr. Henry Medeiros for serving on my
thesis committee. I thank Prof. William Barnekow and Dr. Russ Meier who played an important
role in my undergraduate education. Finally, I thank all of the members of the Marquette
Embedded System Lab for the feedback and encouragement they have provided throughout this
thesis.
ii
TABLE OF CONTENTS
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3.1 H.264 Decoder Designs for FPGAs . . . . . . . . . . . . . . . . . . . . 1
1.3.2 NoC based H.264 Decoder Simulations . . . . . . . . . . . . . . . . . . 2
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 DESCRIPTION OF H.264 ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Input Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Color Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Entropy Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Exponential-Golomb . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Context Adaptive Variable Length Coding . . . . . . . . . . . . . . . . 8
2.3.3 Context Based Adaptive Binary Arithmetic Coding . . . . . . . . . . . 8
2.4 Inverse Quantization and Inverse Transform . . . . . . . . . . . . . . . . . . . . 9
2.5 Intra Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Inter Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6.1 Motion Vector Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6.2 Luma Motion Compensation . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6.3 Chroma Motion Compensation . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Deblocking Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
iii
3 INTRODUCTION TO NETWORK-ON-CHIP . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Overview of Network-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Network-on-Chip Design Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Comparison to Other Communication Schemes . . . . . . . . . . . . . . . . . . . 16
3.4 Network-on-Chip Tool For FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 NETWORK-ON-CHIP BASED H.264 DECODER ARCHITECTURE . . . . . . . . . . . . . 18
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Partitioning the H.264 Algorithm into Processing Elements . . . . . . . . . . . . 18
4.2.1 NAL Parsing and Entropy Decoding . . . . . . . . . . . . . . . . . . . 18
4.2.2 IQIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.3 Intra Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.4 Deblocking Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.5 Luma Motion Compensation . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.6 Chroma Motion Compensation . . . . . . . . . . . . . . . . . . . . . . 19
4.2.7 Buffer Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.8 Display Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 H.264 Algorithms on NoC Based Decoder . . . . . . . . . . . . . . . . . . . . . . 20
4.3.1 Intra Prediction Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.2 Inter Prediction Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.3 IQIT Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.4 Deblocking Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 Network-on-Chip Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4.1 Virtual Channel Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 Mapping H.264 Nodes to Network-on-Chip . . . . . . . . . . . . . . . . . . . . . 22
5 H.264 ALGORITHM NODES FOR NOC BASED DECODER . . . . . . . . . . . . . . . . . 24
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
iv
5.2 NIOS II Based Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.1 Flit Formatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.2 Send State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.3 Receive State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2.4 Parser Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2.5 Buffer Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Network Interface Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4 Generic State Machine for Hardware Nodes . . . . . . . . . . . . . . . . . . . . . 31
5.5 Inverse Quantization Inverse Transform Node . . . . . . . . . . . . . . . . . . . 32
5.5.1 Parsing and Input Packet Format . . . . . . . . . . . . . . . . . . . . . 33
5.5.2 Zig-Zag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.5.3 Inverse Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.5.4 Inverse Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.5.5 Packet Generation and Output Packet Format . . . . . . . . . . . . . . 34
5.5.6 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.6 Luma Motion Compensation Node . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.6.1 Parsing and Input Packet Format . . . . . . . . . . . . . . . . . . . . . 36
5.6.2 Sample Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.6.3 Interpolator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.6.4 Packet Generation and Output Packet Format . . . . . . . . . . . . . . 37
5.6.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.7 Chroma Motion Compensation Node . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.7.1 Parsing and Input Packet Format . . . . . . . . . . . . . . . . . . . . . 39
5.7.2 Interpolator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.7.3 Packet Generation and Output Packet Format . . . . . . . . . . . . . . 40
5.7.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
v5.8 Intra Prediction Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.8.1 Parsing and Input Packet Format . . . . . . . . . . . . . . . . . . . . . 42
5.8.2 Intra Prediction Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.8.3 Packet Generation and Output Packet Format . . . . . . . . . . . . . . 43
5.8.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.9 Deblocking Filter Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.9.1 Parsing and Input Packet Format . . . . . . . . . . . . . . . . . . . . . 45
5.9.2 Deblocking Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.9.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.10 Display Control Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.10.1 Parsing and Input Packet Format . . . . . . . . . . . . . . . . . . . . . 47
5.10.2 Color Space Transformation . . . . . . . . . . . . . . . . . . . . . . . . 47
5.10.3 VGA Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.10.4 VGA Digital to Analog Converter . . . . . . . . . . . . . . . . . . . . . 48
5.11 Compilation for FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6 SCALABILITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2 Scaling Design to Fit a Smaller Target FPGA . . . . . . . . . . . . . . . . . . . . . 49
6.2.1 Porting Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2.2 Porting Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.3 Compilation for FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7 PERFORMANCE TESTING AND PROFILING . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.2 Test Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.3 Buffer Node Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.3.1 Discussion of Profiling Results . . . . . . . . . . . . . . . . . . . . . . . 54
vi
7.4 Performance Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.4.1 Discussion of Performance Comparisons . . . . . . . . . . . . . . . . . 57
8 FURTHER OPTIMIZATIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . 59
8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.2 Future Work Targeting Performance . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.2.1 Parser and Buffer Node Optimization . . . . . . . . . . . . . . . . . . 59
8.2.2 Further Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.2.3 Combined Display and Deblocking . . . . . . . . . . . . . . . . . . . . 62
8.2.4 Alternative Communication Pattern . . . . . . . . . . . . . . . . . . . 62
8.2.5 Parallelization of Inter and Deblocking . . . . . . . . . . . . . . . . . . 63
8.3 Other Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
vii
LIST OF TABLES
5.1 Modes supported by the flit formatter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Resource utilization of the proposed 3x3 NoC based h.264 decoder. . . . . . . . . . . . . 48
6.1 Resource utilization of the proposed 2x2 NoC based h.264 decoder. . . . . . . . . . . . . 51
7.1 Profiling Results from the 3x3 NoC Based Decoder. Times indicated are in units of
seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2 Profiling Results from the 2x2 NoC Based Decoder. Times indicated are in units of
seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.3 Comparison of the NoC based decoders with an open source software based decoder
running on the NIOS II core and HPS core. All reported numbers are in units of
frames per second. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.4 Comparison of USHA decoder and the 3x3 and 2x2 NoC Based Decoders . . . . . . . . 57
viii
LIST OF FIGURES
2.1 Top level diagram of h.264 decoder. Based on diagram from [1]. . . . . . . . . . . . . . . 4
2.2 Simplified depiction of h.264 encoding and decoding. . . . . . . . . . . . . . . . . . . . . 6
2.3 Organization of NAL stream. Based on description from [2]. . . . . . . . . . . . . . . . . 6
2.4 Format of encoded Exp-Golomb data based on description from [3]. . . . . . . . . . . . . 7
2.5 Zig-Zag scan order shown on a 4x4 block [4]. . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Depiction of 4x4 intra prediction. Grey samples are from neighboring macroblocks.
White blocks are from current macroblock. The arrows show the direction of the 8
non-DC prediction modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.7 Inputs (gray blocks) and outputs (white blocks) for luma sub-pixel motion
compensation portion of inter prediction [4]. Here each row or column of gray
samples corresponds to a potential input to Eq. 2.8. The white samples are the
outputs of various applications of this equation as well as Eq. 2.9. . . . . . . . . . . . . 12
2.8 Inputs and outputs for Chroma sub-pixel motion compensation portion of inter
prediction. Based on diagram from [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.9 Pseudocode of normal deblocking rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 Example of communication parallelism in NoC. . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Examples of NoC topologies. 3x3 Mesh (Left), 3x4 Torus, 8 point Star (Right) . . . . . . 16
4.1 Mapping of h.264 decoder to an NoC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1 Nios II Node used for both the parser node and the buffer node. . . . . . . . . . . . . . . 24
5.2 Flit formatter component used in NIOS II Nodes . . . . . . . . . . . . . . . . . . . . . . . 25
5.3 Simulation results showing the flit formatting and CPU-FPGA hand shaking. . . . . . . 26
5.4 Nios II Node NoC receive state machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.5 Start inter prediction command. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.6 Start intra prediction command. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.7 Allocate frame command. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.8 New frame command. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.9 High Level design of the NoC interface component used by each of the nodes in the
network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
ix
5.10 General structure of the state machines used in the hardware-only nodes. . . . . . . . . 32
5.11 High level design of inverse quantization inverse transform node. . . . . . . . . . . . . . 33
5.12 Request packet format for the inverse quantization inverse transform node. . . . . . . . 33
5.13 Response packet format for the inverse quantization inverse transform node. . . . . . . 34
5.14 Simulation of the IQIT node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.15 High level design of luma motion compensation node. . . . . . . . . . . . . . . . . . . . 36
5.16 Request packet format for the luma motion compensation node. . . . . . . . . . . . . . . 36
5.17 Response packet format for the luma inter prediction node. . . . . . . . . . . . . . . . . . 37
5.18 Simulation of the luma motion compensation node. . . . . . . . . . . . . . . . . . . . . . 38
5.19 High level design of chroma motion compensation node. . . . . . . . . . . . . . . . . . . 39
5.20 Request packet format for the chroma motion compensation node. . . . . . . . . . . . . 39
5.21 Response packet format for the chroma inter prediction node. . . . . . . . . . . . . . . . 40
5.22 Simulation of the chroma motion compensation node. . . . . . . . . . . . . . . . . . . . . 41
5.23 High level design of intra prediction node. . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.24 Command packets recognized by the intra prediction node. . . . . . . . . . . . . . . . . 42
5.25 Organization of the intra core. Note that the upper three bytes of the input samples
written to address four are ignored. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.26 Response packet format for the intra prediction node. . . . . . . . . . . . . . . . . . . . . 44
5.27 Simulation of the intra prediction node showing the 16x16 plane prediction mode. . . . 44
5.28 High level design of the Deblocking Filter Node. . . . . . . . . . . . . . . . . . . . . . . . 45
5.29 Deblocking Filter Node request packet format. . . . . . . . . . . . . . . . . . . . . . . . . 45
5.30 Response packet format for the Deblocking Filter Node . . . . . . . . . . . . . . . . . . . 45
5.31 Simulation results showing the correct operation of the deblocking node. . . . . . . . . . 46
5.32 High level design of the Display Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.33 Display node write pixel command packet format. . . . . . . . . . . . . . . . . . . . . . . 47
6.1 Scaled down version of the proposed NoC based h.264 decoder. . . . . . . . . . . . . . . 50
7.1 3x3 implementation decoding the ”hall” test video sequence. . . . . . . . . . . . . . . . . 53
x7.2 2x2 implementation decoding the ”akiyo” video sequence. . . . . . . . . . . . . . . . . . 53
7.3 Diagram of timer start/stop positions within the buffer node software. . . . . . . . . . . 54
7.4 Average time spent in each section of the buffer node code for the 3x3 NoC based
decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.5 Average time spent in each section of the buffer node code for the 2x2 NoC based
decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.1 Architecture with modified buffer and parser node. . . . . . . . . . . . . . . . . . . . . . 60
8.2 A dual buffer node architecture of a 3x4 NoC based decoder. Note that in this
architecture, the NoC topology is increased from 3x3 to 3x4. . . . . . . . . . . . . . . . . 61
8.3 Diagram of the current communication pattern (left) and an alternative which may
allow for better parallelism (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
1CHAPTER 1
INTRODUCTION
1.1 Problem Statement
This thesis develops an h.264 video decoder on a field programmable gate array (FPGA).
This design uses a network-on-chip (NoC) as the communication infrastructure. The
implementation of this video decoder begins with the implementation of each of the algorithms
which make up the h.264 standard. Each of these algorithms is implemented either directly on the
FPGA fabric using VHDL or on a NIOS II soft core processor. When applicable, the functionality
of each of these algorithms is verified using hardware description language (HDL) simulation.
Each algorithm is then mapped to the NoC with the goals of providing communication
parallelism and minimizing communication delay between algorithms with frequent
communication.
1.2 Objectives
The primary objectives of this thesis include 1) to demonstrate a working h.264 video
decoder based on a NoC communication infrastructure on an FPGA for the first time and 2) to
study the performance and resource utilization of this design. In addition, another objective of
this work is to analyze the scalability of the proposed design. Finally, to enable future work in
NoC based video decoders, the source materials for this thesis will be made publicly available.
1.3 Previous Work
Previous work includes full and partial h.264 decoder implementations on FPGAs which
do not use NoCs as the communication infrastructure as well as studies which develop NoC
based h.264 decoders in simulation but do not test them on real hardware. These two areas of
related works are discussed in the following subsections.
1.3.1 H.264 Decoder Designs for FPGAs
Examples of full implementations of non-NoC based h.264 decoders on FPGAs include
[5, 6, 7]. Partial implementations of h.264 decoders for FPGAs include [8, 9, 10, 11]. The study in
[8] presented implementations of inverse quantization inverse transform (IQIT), intra prediction,
inter prediction, and deblocking modules. In addition, that study presented a method for
2debugging these modules on the FPGA using an available hardwired processor. However, they
did not report implementation of the entropy decoder. The solution in [9] provided optimization
techniques for the entropy decoder and intra prediction modules. The study in [10] reported a
pipelined design of the intra prediction module while the study in [11] reported design solution
for a CALVC decoder, which is a component of the entropy decoder module.
1.3.2 NoC based H.264 Decoder Simulations
Several simulations of NoC based h.264 decoders have been reported in [1, 12, 13, 14]. The
work in [1] studied area occupied by various NoC components on FPGAs. It also reports
simulation results using MLDesigner on the bandwidth required between different modules of
the decoder. The study in [1] reported simulations on a 3x4 mesh topology. Similarly, the study in
[12] presented simulation results, for a network topology consisting of 2 star networks connected
by a 3x3 grid. Both of these studies conducted a similar analysis to find the traffic between
different modules of the decoder. The study in [13] uses linear programming to map h.264
modules onto mesh and fat-tree NoC architectures in a way that maximizes throughput and
reduces power relative to a random mapping. A comparison between a generic NoC architecture
based decoder and an NoC architecture tailored to the h.264 decoder is reported in [14]. They
reported significant improvements in area, power and performance in the custom NoC versus the
general NoC. The study in [15] reported synthesis results for an NoC based h.264 decoder
targeting a Virtex 4 FPGA implementation. The paper briefly mentions the decoder running on an
FPGA but it does not include test results about it. The work in [16] further discusses the h.264
implementation on an FPGA but reports results only as HDL level simulations.
The study in [17] proposes a unified software and hardware architecture for video
decoding where the communication infrastructure is implemented with an array of modified NoC
routers. The processing elements are light weight processor tiles that enable software and
hardware implementations to coexist, while a programmable interconnect enables dynamic
interconnection of the tiles. While they reported an FPGA prototype, the source codes are not
publicly available. An important note on this design is that while an NoC is used for some
communication, a shared program and data bus is used by each tile as well, making the
communication infrastructure a hybrid of NoC and bus techniques.
31.4 Contributions
This thesis develops the first fully NoC based h.264 video decoder which is verified on
real hardware. Additionally, this thesis studies the performance and resource utilization of this
NoC based decoder. Finally, all of the source materials for this thesis are made open source and
publicly available to enable future research in related areas.
1.5 Thesis Organization
The remainder of this thesis is organized as follows. Chapter 2 gives an introduction to
the h.264 decoding standard. Chapter 3 gives a background on the NoC communication
infrastructure. Chapter 4 gives a high level overview of the NoC based h.264 decoder
implemented in this thesis while Chapter 5 discusses the components of this architecture in more
detail. A modified version of the proposed h.264 decoder is presented in Chapter 6 to
demonstrate its scalability. Results from the performance testing and profiling are presented in
Chapter 7. Finally, recommendations for future work are discussed in Chapter 8.
4CHAPTER 2
DESCRIPTION OF H.264 ALGORITHM
To provide some context for the decoding algorithm a brief overview of the encoding
processes is given in this section. An h.264 encoder consists of a variety of algorithms which
process an input video stream before sending it to an entropy coding stage which performs the
actual compression. The goal of these algorithms is to reduce the entropy (see Eq. 2.1) of the
video stream as much as possible before the video stream reaches the entropy coder. An h.264
encoder uses prediction based techniques and transform based techniques to reduce the entropy
of the transmitted video. A simplified encoding and decoding process is shown in Fig. 2.2.
Figure 2.1: Top level diagram of h.264 decoder. Based on diagram from [1].
5The prediction based techniques for reducing entropy use either spatial or temporal
redundancy in the input video stream to predict the pixel values for a section of the frame being
encoded. Since the decoder has identical prediction algorithms, the encoder only needs to
transmit the error of the prediction instead of the actual pixel values. However, before
transmission further processing is done on these error values. This additional processing comes in
the form of the transformation and quantization stage of the encoding process. A transformation
similar to the discrete cosine transform, referred to in this thesis as the transform, is used to
transform the error values, or residuals, for a section of the frame into the frequency domain.
After transformation into the frequency domain, a quantization matrix is applied to the residuals
which preserves more of the low frequencies than the high frequencies. After prediction,
transformation, and quantization, the residuals and the information required to perform the
prediction algorithms are compressed using an entropy coder and stored or transmitted. On the
decoder side, the entropy coded data is decoded, the prediction algorithms are employed, and the
inverse quantized inverse transformed (IQIT) residuals are added to the prediction results to
reconstruct the original video.
A high level description of the h.264 decoding algorithm is presented in the block
diagram from Fig. 2.1. In this diagram the prediction algorithms are shown as intra prediction, a
spatial prediction algorithm, and inter prediction, a motion based prediction algorithm.
Additionally, an h.264 video decoder must have a frame buffer for storing the current frame as
well as a frame buffer for storing the previous frame or frames. Finally, a deblocking filter is
necessary for removing artifacts from the decoded frame.
H(x) = −
n
∑
i=1
P(xi)log2P(xi) (2.1)
2.1 Input Format
The coded video input is made up of network abstraction layer (NAL) units, which can
either be in a packet or a stream of bytes [2]. The NAL units can be further divided into video
coding layer (VCL) and non-VCL units. VCL NAL units contain the information required to
reconstruct the pictures. Non-VCL NAL units contain parameter sets and other additional data. A
depiction of the VCL portion of the NAL input stream is shown in Fig. 2.3. Access units contain
the information for a picture. Each coded video sequence is independently decodable and the
6Figure 2.2: Simplified depiction of h.264 encoding and decoding.
NAL stream contains one or more coded video sequences making the entire video. The parameter
sets can be either part of the NAL stream or communicated through an additional channel.
Figure 2.3: Organization of NAL stream. Based on description from [2].
72.2 Color Format
H.264 uses the luminance chroma blue-difference chroma red-difference color space
referred to as LCbCr and also as YCbCr. Within the context of this thesis, YUV is also synonymous
with LCbCr. This color space is used to separate the light intensity of a pixel from its color. This is
done to allow sub-sampling of the color components of the image since details are more apparent
in light intensity than light color [2]. Each frame in h.264 uses one chroma sample per channel for
every luma sample. In total, there are half as many chroma samples as there are luma samples.
2.3 Entropy Decoder
There are three entropy decoders used in an h.264 video decoder [18]. These entropy
decoders are: Exponential-Golomb, context adaptive variable length coding (CAVLC) and context
based adaptive binary arithmetic coding (CABAC). Exponential-Golomb is used for everything
except the transform domain coefficients where either CAVLC or CABAC are used [18]. An
overview of each of these decoders is given below.
2.3.1 Exponential-Golomb
Exponential-Golomb or Exp-Golomb is a variable length coding scheme. The format for
Exp-Golomb is shown in Fig. 2.4. In this diagram the bits d[i] represent a codeword which can be
mapped a positive integer. Once codeword is found it can be decoded using Eq. 2.2. The diagram
and equation given below are based on descriptions in [3], where more details are provided on the
decoding process. Additionally, it is important to note that this coding only works for
non-negative numbers [19]. Integers are mapped to unsigned numbers either directly (in the case
that all of the integers are non-negative), by alternately assigning positive and negative numbers
to successive code words or through a predefined mapping specified in the standard [19].
Figure 2.4: Format of encoded Exp-Golomb data based on description from [3].
8ddecoded = 2n + dencoded − 1 (2.2)
2.3.2 Context Adaptive Variable Length Coding
Context adaptive variable length coding (CAVLC) is one of the methods used to encode
transform domain coefficient values in h.264. The process for decoding CAVLC is described in
section 9.2 of [4] as well as in [3]. The main inputs to the CAVLC encoder, and therefore the
outputs of the CAVLC decoder, are the residuals (error values) of the prediction algorithms used
in h.264. These residuals are in the transform domain and are reordered using a zig-zag scan as
illustrated in Fig. 2.5. This zig-zag order allows efficient encoding because CAVLC takes
advantage of the large number of zeros towards the end of this array during encoding. Additional
inputs are the maximum number of non-zero coefficients and the index of the current 4x4 block
being decoded for the luma or either chroma channel [4].
Figure 2.5: Zig-Zag scan order shown on a 4x4 block [4].
2.3.3 Context Based Adaptive Binary Arithmetic Coding
Context based adaptive binary arithmetic coding (CABAC) is an entropy coding
technique based on binary arithmetic coding. This method is about 5-15% more efficient than
CAVLC, but is not required for the baseline profile [2]. Because this decoder is not required, this
thesis does not include CABAC in the entropy decoder implementation.
92.4 Inverse Quantization and Inverse Transform
Data from the entropy coding stage is in the frequency domain and needs to be
transformed back into the spatial domain using the inverse transform described in [20].
Additionally, the quantization step which occurs in the encoder must be reversed here as well,
which happens before transformation back into the spatial domain. The inputs to this process are
the coefficients from the entropy decoding stage and the outputs are the luma or chroma residuals
to correct the prediction results in either the inter or intra prediction blocks [2]. The transform is
applied to 4x4 blocks, and in some cases may be preceded by an additional 2x2 transform, which
is the Hadamard transform (see Eq. 2.5) [4], which is done for smooth areas in the video stream
[2]. The inverse transform equation is given by Eq. 2.3 [20] and the inverse quantization equation
is given by Eq. 2.4 [4]. In Eq. 2.4 Qp is the quantization parameter, which is provided by the
input stream, and LevelScale4x4 is a look-up table defined in the standard [4]. Additionally, cij
and dij are samples from the quantized and inverse quantized residuals respectively.
xr =


1 1 1 1/2
1 1/2 −1 −1
1 −1/2 −1 1
1 −1 1 −1/2

Xr + 25

1
1
1
1


 6 (2.3)
dij =

(cij ∗ LevelScale4x4(Qp mod 6, i, j) + 23−Qp/6) (4−Qp/6) Qp < 24
(cij ∗ LevelScale4x4(Qp mod 6, i, j)) (4−Qp/6) otherwise
(2.4)
f =

1 1 1 1
1 1 −1 −1
1 −1 −1 1
1 −1 1 −1

∗

c00 c01 c02 c03
c10 c11 c12 c13
c20 c21 c22 c23
c30 c31 c32 c33

∗

1 1 1 1
1 1 −1 −1
1 −1 −1 1
1 −1 1 −1

(2.5)
The resulting output samples are referred to as residuals and are added to the prediction
result for either intra or inter predictions depending on the type of frame.
2.5 Intra Prediction
An intra picture is a picture which does not use motion based prediction. This allows it to
be decoded without previous frames from the input stream [2]. Intra prediction is used to decode
10
these types of pictures. A 4x4 luma macroblock which is part of an intra frame can either be
directly encoded or encoded using 1 of 9 modes which copies sample values from neighboring
macroblocks in 1 of 8 directions or, for the ninth option, determines the DC values from the
neighboring samples and copies this value into the entire macroblock [2]. An important aspect of
this prediction mode is that the encoder selects which of these prediction modes has the lowest
error, resulting in improved compression. For 16x16 luma macroblocks, there are 4 intra
prediction modes, vertical, horizontal, DC and plane. 8x8 luma intra prediction is also supported
in some profiles of the decoder, but is not used in this thesis. Chroma intra prediction is also
supported for 4x4 chroma blocks. Because of the similarities between luma and chroma intra
prediction the luma algorithms are reused to also preform chroma intra prediction in this design.
The intra prediction equation for 4x4 diagonal down left is shown in Eq. 2.6 as an example.
Additional modes can be found in the h.264 standard [4].
pred[x, y] =

(p[6,−1] + 3 ∗ p[7,−1] + 2) 2 x = y = 3
(p[x + y,−1] + 2 ∗ p[x + y + 1,−1] + p[x + y + 2,−1] + 2) 2 otherwise
(2.6)
Figure 2.6: Depiction of 4x4 intra prediction. Grey samples are from neighboring macroblocks.
White blocks are from current macroblock. The arrows show the direction of the 8 non-DC
prediction modes.
2.6 Inter Prediction
This type of prediction uses a reference picture which has already been decoded and a
motion vector to predict the output picture [2]. Inter prediction can be broken down into three
11
steps. First, motion vector prediction determines the required motion vectors for the next two
steps. Second, luma motion compensation applies the motion vector to the luma component of
the reference frame or frames. Similarly chroma motion compensation applies the motion vector
to the chroma components of the reference frame or frames, but uses a different algorithm for
doing so.
2.6.1 Motion Vector Prediction
The motion vectors used to determine the inter prediction result for a particular
macroblock are correlated with neighboring macroblocks. The h.264 coding standard uses this as
another opportunity to reduce the total entropy of the transmitted video by using the neighboring
motion vectors to predict the motion vector of the current macroblock. In the case of P-Slices,
which are inter predicted slices which are not bi-directional, the predicted motion vector could be
either one of the motion vectors from a neighboring macroblock or sub-macroblock, or the median
of the 3 neighbors used in the prediction method [4].
2.6.2 Luma Motion Compensation
P-type macroblocks can be sized by powers of 2 between 16x16 and 4x4 and may be
non-square. B-Type macroblocks (inter predicted macroblocks which are bi-directional) can be
sized 16x16 to 8x8 and can also be non-square. Motion vectors are quarter-sample accurate. A
variety of interpolation filters are used to achieve this accuracy. Up to two 6-tap FIR filters are
used for interpolation when half pixel accuracy is needed, and averaging is done between the
integer and half numbered samples to determine quarter sample values [2]. Fig. 2.7 shows the
input samples and interpolator results for luma sub-pixel motion compensation. Equations 2.7 -
2.12 show the equations used for sub-pixel luma motion compensation as well as how to calculate
selected example outputs. In these equations x represents a vector of six luma samples, x0...x5.
Similarly, X represents a 6x6 block of luma samples where each row is referenced as Xi.
clip(x) =

255 x > 255
0 x < 0
x otherwise
(2.7)
f ilter(x) = (x0 − 5x1 + 20x2 + 20x3 − 5x4 + x5 + 16)/25 (2.8)
12
Figure 2.7: Inputs (gray blocks) and outputs (white blocks) for luma sub-pixel motion
compensation portion of inter prediction [4]. Here each row or column of gray samples
corresponds to a potential input to Eq. 2.8. The white samples are the outputs of various
applications of this equation as well as Eq. 2.9.
average(x) = (x0 + x1 + 1)/2 (2.9)
j(X) = clip

f ilter


f ilter(X0)
f ilter(X1)
f ilter(X2)
f ilter(X3)
f ilter(X4)
f ilter(X5)



(2.10)
b(X) = clip( f ilter(X2)) (2.11)
f (X) = clip(average(b, j)) (2.12)
13
2.6.3 Chroma Motion Compensation
Chroma motion compensation uses the same motion vectors used by luma motion
compensation at the same location. However, the interpolation method used by chroma motion
compensation is considerably different from luma motion compensation. Interpolation for chroma
motion compensation is performed using a 2 dimensional linear interpolator, see Eq.2.13.
Additionally, only 4 reference samples are needed for each predicted sample versus 36 reference
samples required for each predicted sample in luma motion compensation, the location of the
input samples A, B, C, and D are shown in Fig. 2.8.
Figure 2.8: Inputs and outputs for Chroma sub-pixel motion compensation portion of inter
prediction. Based on diagram from [4].
p = ((8− xFrac)(8− yFrac)A + xFrac(8− yFrac)B + (8− xFrac)yFracC + xFracyFracD + 32) 6
(2.13)
14
2.7 Deblocking Filter
Since h.264 has a block oriented structure, artifacts are common at the boundaries of these
blocks [2]. The h.264 decoding algorithm uses a special digital filter at these block edges to
remove artifacts between blocks while preserving any true edges which occur at the boundaries of
block to avoid blurry images. To achieve this, threshold functions α and β are used to determine if
the boundary of a block is a true edge or an artifact. The threshold functions α and β take the same
quantization parameter referenced in the IQIT step of the algorithm. The deblocking filter uses the
4 samples nearest the boundary from both macroblocks on the boundary and changes up to 3 of
these samples. The pseudocode description for the normal mode of the deblocking filter process is
shown in Fig. 2.9.
Algorithm: Normal Deblocking
1: d0← |p0− q0| < α(QP)
2: d1← |p1− p0| < β(QP)
3: d2← |q1 − q0| < β(QP)
4: d3← |p2 − p0| < β(QP)
5: d4← |q2− q0| < β(QP)
6: if d0 and d1 and d2 then
7: FILTER(p0, q0)
8: end if
9: if d3 or d4 then
10: FILTER(p1, q1)
11: end if
Figure 2.9: Pseudocode of normal deblocking rules.
15
CHAPTER 3
INTRODUCTION TO NETWORK-ON-CHIP
3.1 Overview of Network-on-Chip
Network-on-Chip (NoC) is an emerging on-chip communication technology for
integrated System-on-Chip (SoC) designs. This communication technique allows multiple
modules on the same integrated circuit (IC) to communicate concurrently using packets. These
packets are routed from the sender to the receiver through a series of routers and channels
organized according to a chosen topology. The packet switched behavior of an NoC allows a
much higher level of parallelism in communication relative to a more conventional bus based
communication scheme [21]. Additionally, NoCs have much better scalability when compared to
both Bus and point to point communication schemes. Finally, the physical properties of buses
often make them difficult to operate at high frequencies [21].
Modern SoCs consist of several modules on the same IC. These modules need to
communicate and access shared resources in order to implement the desired behavior in a specific
application. Several communication techniques exist to solve this problem with NoC being the
most recent. In an NoC based system each module on an IC is connected to a port on a router.
Connecting a module to a port also associates that module with an address. This address is then
used by other components in the network to communicate with this module. A module which is
mapped to a port in an NoC based system is also referred to as a node or as a processing element
(PE).
Communication in an NoC based system occurs using packets. A packet consists of one
or more flits. A flit is the unit of data in an NoC. The size of a flit is often related to the physical
width of the channels, which are the links between two routers. However, this is not necessarily
the case as a flit is defined in terms of flow control [22]. Each flit in a packet is passed off from one
router to the next as it progresses towards its destination node. As a flit leaves one router it
releases the resources it previously held allowing them to be used for another flit. These resources
include flit buffer locations and the channel the flit was previously routed through. Because these
resources are free they can be used to route other flits, even flits from other packets, through the
network. An example of this and how this allows for high levels of parallelism in NoC based
16
system is shown in Fig. 3.1. In this figure, A, B, and C are flits, the unit of data in an NoC, which
are simultaneously injected into the network.
Figure 3.1: Example of communication parallelism in NoC.
3.2 Network-on-Chip Design Parameters
NoCs have a variety of design parameters which impact the performance, resource
utilization, and operation of the NoC. These parameters include the NoC topology, which can be a
regular mesh, or grid shape, a tree shape, or a variety of other topologies including topologies
specific to a given application. Examples of various NoC topologies are shown in Fig. 3.2.
Another important parameter in an NoC is the width, in bits, of the flit. This parameter impacts
both the area utilization and performance since a wider flit will give better network bandwidth
but also require more circuitry and larger buffers in the routers and channels.
Figure 3.2: Examples of NoC topologies. 3x3 Mesh (Left), 3x4 Torus, 8 point Star (Right)
3.3 Comparison to Other Communication Schemes
There are two other communication schemes worth comparing against NoCs. These
communication schemes are buses and point to point links. Bus based designs suffer from two
major limitations. First, buses scale poorly in performance since only one component can be
17
writing to the bus at a time [21]. On a system with a large number of components this becomes a
problem because the overall performance degrades as a result of many components waiting for
bus access. Additionally, as buses grow so does the critical path in the bus. This means that buses
connecting many components not only suffer from the waiting problem, but may also need to be
run at lower clock frequencies. A network-on-chip solves both of these problems. First, NoCs
scale very well in performance since a unit of communication, in this case a flit, only needs to
control the router and channel it is currently on. This means that there is much less waiting in an
NoC based design than in a bus based design since simultaneous communications will often not
require the same resources in NoCs. Additionally, as an NoC grows in size, the critical path
remains constant, assuming sufficient resources. In other words, a 4x4 or 5x5 NoC should be
capable of running at the same speed as a 3x3 NoC as long as the CAD tools are able to do their
job properly.
Point to point communication is very fast for small designs. In some sense, each of the
nodes within this design uses point to point communication between sub-modules because any
more substantial communication infrastructure would not be worth the performance or area cost.
However, for large designs, point to point communication becomes a problem very quickly. This
is because point to point designs scale very poorly in area, and eventually in performance as
critical path length increases. NoCs scale much better in area since adding an additional node
only requires a linear growth in communication infrastructure, for most common topologies,
rather than the exponential growth required by point to point communication.
3.4 Network-on-Chip Tool For FPGAs
This thesis uses the CONNECT Network on Chip generator [23]. This tool allows users to
specify a variety of NoC parameters such as flit width, topology, number of virtual channels, flow
control method, flit buffer depth and a few others. The routing algorithm in CONNECT is based
on look up tables [24] which route each packet in a fixed manner. The CONNECT NoC tool
generates and provides Verilog code implementing the specified NoC. This thesis uses a 3x3 mesh
NoC with 64 bit wide flits, 8 flit deep buffers, 2 virtual channels and peak flow control. A
modified version of the architecture presented in Chapter 6 uses a 2x2 mesh network with
multiple ports per router.
18
CHAPTER 4
NETWORK-ON-CHIP BASED H.264 DECODER ARCHITECTURE
4.1 Introduction
This chapter describes the NoC based h.264 decoder architecture developed in this thesis.
The design process starts with partitioning the h.264 decoding algorithm into a set of modules
which will later be mapped to nodes in the network. This chapter also gives a behavioral
overview of these nodes and gives the rationale for the chosen mapping.
4.2 Partitioning the H.264 Algorithm into Processing Elements
The h.264 decoding algorithms discussed in Chapter 2 are partitioned into eight nodes or
processing elements (PEs). One PE is used for NAL parsing and entropy decoding. Additionally, a
separate PE is dedicated to each of the following seven functions: IQIT, intra prediction, sub-pixel
luma motion compensation, sub-pixel chroma motion compensation, reference and working
frame buffer control and integer motion compensation, deblocking filter, and display driver.
These PEs closely follow the algorithms introduced in Chapter 2 with some exceptions. A more
detailed discussion of each PE and how they interact with each other is included below.
4.2.1 NAL Parsing and Entropy Decoding
The task of this PE is to parse the input NAL stream and perform necessary entropy
decoding functions on the data. These 2 functions are grouped together since the entropy decoder
is only required by the NAL parser. This PE interacts with two other PEs during normal
operation. This PE sends transform domain residuals to the IQIT node, which then processes
them and forwards the results to the buffer node. The parser node also sends information directly
to the buffer node. This information includes prediction modes, parameters for those prediction
modes, and the coordinates of the macroblock which those predictions should be performed on.
Additionally, the parser node also sends commands to start a new frame and to start the video
sequence.
4.2.2 IQIT
The IQIT node receives transform domain residuals from the parser. The IQIT node then
performs inverse quantization and inverse transform procedures on this data. After the IQIT
19
process has been completed, the residuals are sent to the buffer node and added to the prediction
results. The IQIT node does not preform the Hadamard transform used for smooth areas. Instead,
this is done by the parser node before transmission to the IQIT node.
4.2.3 Intra Prediction
The intra prediction node processes one block of up to 16x16 pixels at a time. The
reference pixels for this prediction are provided by the buffer node, while the parameters for this
prediction are provided by the parser node, but are routed through the buffer node before arriving
at intra prediction. Because of this, the intra prediction node only communicates directly with the
buffer, so a placement goal when performing the mapping is to place these nodes near each other.
4.2.4 Deblocking Filter
The deblocking filter node is used immediately before displaying the completed frame.
This node accepts pixels near a macroblock boundary from the buffer node, performs the
deblocking procedure on them, and sends the results back to the buffer node.
4.2.5 Luma Motion Compensation
The luma motion compensation algorithm is divided between two processing elements.
Integer luma motion compensation occurs on the same PE as the reference buffer since this
portion of inter prediction is bound exclusively by frame buffer access time. Sub-pixel motion is a
separate PE because it is computationally intensive and has good potential for parallelization.
Sub-pixel motion compensation uses up to two successive six tap FIR filters to interpolate pixel
values. Additionally, a third two point FIR filter may be used when quarter pixel accuracy is
required. The luma motion compensation node performs interpolation for eight luma samples at a
time. By interpolating eight samples at a time the luma motion compensation node is able to
match filter output to the throughput of the network. Although some profiles of the h.264
standard use multiple reference frames for inter prediction, the design implemented in this thesis
uses only a single reference frame.
4.2.6 Chroma Motion Compensation
Chroma motion compensation implements the sub-pixel motion compensation algorithm
for chroma samples. Similarly to the luma motion compensation node, this node performs
20
interpolation for eight samples at a time. In this case the eight samples are made up of 2x2 blocks,
one for each chroma channel. Implementing the chroma and luma motion compensation
algorithms on different nodes allows both of these algorithms to run in parallel, taking advantage
of the parallel communication made available by the NoC.
4.2.7 Buffer Control
The buffer node controls access to both current frame buffer and reference frame buffer.
This node receives parameters from the parser node which trigger intra prediction, inter
prediction, deblocking and display events. This node also receives residuals from the IQIT node
which are added to the working frame buffer at the specified location. When a packet containing a
command to perform a prediction action is received, the buffer node packages up any relevant
information for that prediction and sends it to the respective node or nodes. Similarly, if residuals
are received from the IQIT node, the residuals are added to the working frame buffer at the
specified location. Because the parser is capable of overwhelming the rest of the network during
certain algorithms, the buffer node also controls the rate the parser node sends commands using
one flit acknowledgment packets.
4.2.8 Display Driver
The display driver receives eight bit LCbCr pixels and converts these values into six bit
RGB values. These six bit RGB files are stored in RAM local to the display node and used to drive
an open source hardware VGA driver originally intended for use with the Raspberry Pi [25]. Six
bit RGB values are chosen since this matches the VGA drivers precision, eight bit LCbCr values
can be used to determine full eight bit RGB values.
4.3 H.264 Algorithms on NoC Based Decoder
The following subsections provide a high level description of how each of the algorithms
involved in the decoding process is executed across the whole system.
4.3.1 Intra Prediction Process
The intra prediction process begins on the parser node when an intra-predicted
macroblock is parsed. At this point, the parser sends all the relevant information regarding this
macroblock to the buffer. This information contains a pre-formatted intra prediction request flit,
21
along with the buffer coordinates of the macroblock where intra prediction will be performed.
After forwarding this information to the buffer the parser node is free to continue with the parsing
processes. When the buffer node receives the intra prediction information packet from the parser,
it collects the required reference samples to perform intra prediction and forwards these samples
along with the included pre-formatted intra prediction request to the intra prediction node. At
this point the buffer node idles until the intra prediction node sends a response. While it would be
possible to avoid idle time by having the buffer progress to the next command at this point, care
must be taken to ensure the current intra prediction result is received before dispatching any
future intra prediction requests since intra prediction has a data dependency on the data in the
working frame buffer. Ultimately, the intra prediction response time is low enough that removing
this idle time is unlikely to improve performance noticeably. Once an intra prediction response
packet is received by the buffer the samples are parsed and added to the buffer at the previously
specified location. After this an acknowledgment packet is sent to the parser to indicate that the
buffer is ready to receive more commands.
4.3.2 Inter Prediction Process
At a high level of abstraction the inter prediction process is in many ways similar to the
intra prediction process, despite the fact that the algorithms themselves are quite different. First,
the parser provides 16 motion vectors per inter prediction command packet. Each of these motion
vectors corresponds to a 4x4 block within a 16x16 macroblock. Additionally, each motion vector
acts on both the luma and each chroma channel. Since each 4x4 luma block and each pair of
corresponding 2x2 chroma blocks are 1 packet, each incoming inter prediction command results in
32 packets exiting the buffer node. This means there is significant opportunity to take advantage
of the NoC’s parallelism during inter prediction.
4.3.3 IQIT Process
The IQIT process starts when the parser node parses a 4x4 residual block in the transform
domain. This block is then sent to the IQIT node along with coordinates specifying the macroblock
these residuals should be added to. At the IQIT, the 4x4 block is inverse zig-zag scanned, inverse
quantized and inverse transformed then repackaged and sent to the buffer node along with the
coordinates originally specified by the Parser. Once the buffer receives the residuals, they are
added to the correct location and an acknowledgment is sent to the parser node.
22
4.3.4 Deblocking Process
Deblocking is triggered by a new frame command sent from the parser to the buffer node.
During deblocking, the buffer sends samples from each of the macroblock edges to the deblocking
filter which processes them and returns the result. Considerable parallelism could be achieved
here by using multiple deblocking filter nodes, but this has not been investigated in this thesis.
4.4 Network-on-Chip Parameters
The NoC used in this thesis is a 3x3 mesh topology with 64 bit flits. This topology was
chosen because the number of provided ports closely matches the number of nodes in the system.
The flit width of 64 bits was chosen because it allows most of the packets in the system to be
relatively short, without using too much of the FPGA resources.
4.4.1 Virtual Channel Selection
The way each node interfaces with the NoC port it is attached to allows it to receive
packets on any virtual channel (VC). However, each node only sends packets on one of the VCs.
All nodes except the parser node send on VC zero, while the parser node sends on VC one.
4.5 Mapping H.264 Nodes to Network-on-Chip
The mapping used in the proposed h.264 decoder design was done manually using
information known about the behavior of each node. The mapping seeks to place frequently
communicating nodes near each other. Additionally, the mapping aims to minimize the potential
for congestion in the network. Thus, the parser is placed close to the IQIT and frame control
nodes. Additionally, the buffer node should be close to intra, inter and deblocking. Nodes which
have a connection to resources off chip are also placed on the outside edges of the network to
allow for better physical placement of the circuit. These nodes are the VGA controller, the parser
node and the buffer node, which have access to the VGA DAC, Memory Module 0 and Memory
Module 1 respectively. The mapping used in this thesis is shown in Fig. 4.1.
23
Figure 4.1: Mapping of h.264 decoder to an NoC.
24
CHAPTER 5
H.264 ALGORITHM NODES FOR NOC BASED DECODER
5.1 Introduction
This chapter describes the design and implementation of each of the PEs or nodes used in
the proposed h.264 decoder. Two main categories of nodes are used in this system. The NIOS II
based nodes are discussed first. These nodes consist of a NIOS II soft core processor accompanied
by some custom hardware written in VHDL. The other type of node is a hardware only node
which uses a state machine to interact with the NoC and control the computational component of
the node. The hardware only nodes are written in VHDL, with one node, the display node,
utilizing an open source Verilog component for color space conversion [26].
5.2 NIOS II Based Nodes
Both the parser node and the buffer node are implemented on essentially identical NoC
nodes, shown in Fig. 5.1, built around the NIOS II core running different software. These nodes
have one NIOS II core as the main computational component. In addition to the NIOS II core, they
each also have a DDR2 DIMM which serves as main memory, and some custom hardware
described in VHDL. The custom hardware as well as the software running on each of the NIOS II
cores are described below. In the actual VHDL implementation the flit formatter, as well as the
Send and Receive State Machines are a single module. However, their behavior is largely
independent, so they are presented separately below.
Figure 5.1: Nios II Node used for both the parser node and the buffer node.
25
Table 5.1: Modes supported by the flit formatter.
format code description
0 Pack lowest 8 bits of tx0 through tx7
1 Pack lowest 16 bits of tx0 through tx3
2 Concatenate tx0 with tx1
3 tx0 (zero fill)
4 tx0 (sign fill)
5 Intra prediction set command (alias for mode 0)
6 Intra prediction start command (alias for mode 0)
7 Pack lowest 8 bits of tx0 through tx5 and lowest 16 bits of tx6
8 IQIT header
9 IQIT body (alias for mode 0)
5.2.1 Flit Formatter
The goal of the flit formatter is to accelerate the process of packing data into flits before
transmission to another node. Since this is a very common process, particularly on the buffer
node, it is worth the relatively small amount of resources required. The flit formatter is depicted
in Fig. 5.2. This component has eight 32 bit wide inputs for data, a mode selection input and a 64
bit wide output. The flit format node supports ten modes as described in Table 5.1.
Figure 5.2: Flit formatter component used in NIOS II Nodes
5.2.2 Send State Machine
The send state machine is used to send flits from the NIOS II node to the rest of the
network. The state machine is responsible for ensuring the NoC is ready to accept a flit as well as
26
ensuring exactly one flit is sent. This is done using a simple four state handshaking procedure.
Verification of this procedure was done in the simulation. Simulation results, showing how the
handshaking procedure properly controls the ”send flit” control signal, are shown in Fig. 5.3.
Each of the formats supported by the flit formatter are shown in this simulation as well.
... 0102030405060708 0001000200030004 0000000100000002 0000000000000001
... 00 01 02 03 04
... 00000001
... 00000002
... 00000003
... 00000004
... 00000005
... 00000006
... 00000007
... 00000008
/noc_control_plus_tb/clk
/noc_control_plus_tb/send_data
/noc_control_plus_tb/send_flit
/noc_control_plus_tb/send_cmd_cpu
/noc_control_plus_tb/send_ack
/noc_control_plus_tb/format_select
/noc_control_plus_tb/tx_0
/noc_control_plus_tb/tx_1
/noc_control_plus_tb/tx_2
/noc_control_plus_tb/tx_3
/noc_control_plus_tb/tx_4
/noc_control_plus_tb/tx_5
/noc_control_plus_tb/tx_6
/noc_control_plus_tb/tx_7
... 0102030405060708 0102030405060007 0001020300280607 0102030405060708
... 05 06 07 08 09
00000001
00000002
00000003
00000004
00000005
00000006
00000007
00000008
/noc_control_plus_tb/clk
/noc_control_plus_tb/send_data
/noc_control_plus_tb/send_flit
/noc_control_plus_tb/send_cmd_cpu
/noc_control_plus_tb/send_ack
/noc_control_plus_tb/format_select
/noc_control_plus_tb/tx_0
/noc_control_plus_tb/tx_1
/noc_control_plus_tb/tx_2
/noc_control_plus_tb/tx_3
/noc_control_plus_tb/tx_4
/noc_control_plus_tb/tx_5
/noc_control_plus_tb/tx_6
/noc_control_plus_tb/tx_7
Figure 5.3: Simulation results showing the flit formatting and CPU-FPGA hand shaking.
5.2.3 Receive State Machine
The receive state machine is used to coordinate the processor and the NoC interface while
reading flits from the network. The state machine diagram for this component is shown in Fig 5.4.
Once a flit is read in by the processor it is written to a statically allocated array of packet structures
based on the packet’s id number. This id number must be located in the least significant byte of
the first flit of a packet. Using a statically allocated packet buffer which is indexed by an id allows
the buffer node to only copy the data once prior to parsing. Efficiently handling memory access is
27
especially important to the performance of the buffer node since the buffer node receives a very
large number of packets.
Figure 5.4: Nios II Node NoC receive state machine.
5.2.4 Parser Node
The parser node software is derived from a conventional open source h.264 decoder [27].
The decoder was modified to remove all of the code to perform the prediction algorithms, IQIT,
and all writes to the reference or working frame buffers. In any instance where this modified
28
software would have initiated one of these actions it instead sends the required information to the
relevant node which then performs the action and may hand off additional work to another node.
None of the data produced by any of these algorithms is required by the parser node, so the only
packets sent to the parser are acknowledgments from the buffer node after the each command is
processed. NAL Units, which form the input of the parser node, are read in using Altera’s HostFS
file system [28].
5.2.5 Buffer Node
At the top level, the buffer node is a state machine running on a Nios II core which
provides several ways to access and modify the reference and working frame buffers. A special id
number, 255, is set aside for use with the buffer. Any packet with this id which arrives at the
buffer node is parsed as a buffer command. All of the commands sent from the parser directly to
the buffer are shown in Figs. 5.5- 5.8. The command used by the IQIT Node is shown in Fig. 5.13.
The behavior initiated by each of these commands in covered in the rest of this section.
Figure 5.5: Start inter prediction command.
Figure 5.6: Start intra prediction command.
Figure 5.7: Allocate frame command.
29
Figure 5.8: New frame command.
Buffer Response to Start Inter Command
Upon receiving a command to start an inter prediction block the buffer node first parses
the motion vectors out of the command packet. For each of these 16 motion vectors the buffer
node finds the integer and fractional components of the vector. The buffer node performs the
integer portion of motion compensation while writing the reference samples to the input of the flit
formatter by using the integer components of the motion vector as a pair of offsets into memory.
These integer compensated samples are then sent to the luma and chroma sub-pixel interpolation
nodes where they are processed simultaneously. Once the results are received, the samples are
written to the working frame buffer and the buffer node sends an acknowledgment to the parser
node.
Buffer Response to Start Intra Command
Upon receiving a command to perform intra prediction, the buffer writes the neighbor
samples from the working frame buffer at the specified coordinates to the intra prediction node.
Then, in the case of luma intra prediction, the buffer forwards the intra request flit unaltered to the
intra prediction node and waits for the prediction result. In the case of chroma intra prediction,
the buffer must make small changes to the intra request flit before forwarding it to the intra
prediction node.
Buffer Response to New Frame Command
A new frame command triggers several actions in the buffer control node. First,
deblocking is performed on the current frame. After deblocking, the entire frame is written to the
display node. Finally, the reference frame pointer is updated to point at the current frame buffer.
Similarly, the current frame buffer pointer is updated to point at the old reference frame buffer.
Buffer Response to Allocate Frame Command
Upon receiving an allocate frame command, the buffer allocates memory for both the
reference and working frame buffers based on the specified size. After allocating a frame, the
buffer sends an acknowledgment back to the parser. This command plays an important role in the
30
parser node. Since this command is only sent once by the parser, it is used to let the parser get one
acknowledgment ahead of the buffer. In other words, the parser does not wait for the
acknowledgment after sending an allocate frame command, but instead waits for an
acknowledgment before sending each other command. This allows the parser and buffer to do
meaningful work concurrently, improving performance.
Buffer Response to IQIT Packet
From the perspective of the buffer, each IQIT packet looks like a command from the
parser. However, instead of triggering additional algorithms to run elsewhere in the network,
IQIT packets are simply added to the working frame buffer at the specified coordinates.
5.3 Network Interface Component
Figure 5.9: High Level design of the NoC interface component used by each of the nodes in the
network.
The NoC interface component, Fig. 5.9, is responsible for interacting with the port each
node is assigned to and providing a set of control and status signals to the node. This component
also fulfills the buffering requirement described in the CONNECT NoC [23] README file.
Specifically, the NoC interface provides a FIFO buffer with the same depth as the router buffers
for each of the virtual channels in the network. In the NoC used in this thesis, this means two
buffers which can hold eight flits each. The NoC interface component provides a set of signals to
the attached node specifying if the port is ready to accept a flit, if there are any flits currently in
either buffer, as well as if the currently read flit is a tail flit. Additionally, this component allows
the NoC to read and dequeue flits from either VC’s buffer individually, as well as initiate a send to
a specified address.
31
5.4 Generic State Machine for Hardware Nodes
While each of the hardware nodes uses a slightly different state machine for interacting
with the NoC interface, they all follow the general pattern shown in Fig. 5.10. The state of this
machine initializes to idle. A transition out of idle occurs whenever either of the virtual channels
contains an unread flit. During the select VC state the virtual channel containing the flit is saved
to a register, and one clock cycle later the receive process begins. For nodes which only accept a
small number of packets, the receive counter may be omitted, and the receive loop unrolled, to
save clock cycles. Receive loop refers to the section of the state machine containing the states ”rx”,
”dequeue” and ”wait rx”. Both the rx and dequeue states must transition to the next state after
one clock cycle. The wait rx state transitions to rx after more flits have arrived. As opposed to the
idle state, which transitions when data is in either buffer, the wait rx state only transitions when
flits arrive in the previously selected VC buffer. After a tail flit is received, the state machine
transitions to the second counter reset state. This state, as well as the generate response state is
often omitted. The second counter reset is omitted when the send packet is short and has a fixed
length. An example of this is the chroma motion compensation node which always sends a two
flit response. The generate response state is omitted when the entire response can be calculated
prior to beginning the transmission process. The chroma motion compensation node is another
example of a node with this behavior, as is the deblocking filter node. An example of a node
which requires a generate response state is the luma motion compensation node which requires
this state to calculate the second half of the predicted block. When a generate response state is
used, it transitions to wait tx after one clock cycle. Wait tx transitions to tx when the NoC interface
specifies that the network is ready to receive a packet. Depending on if the state machine has a
generate response state the tx state transitions back to either generate response or wait tx after one
clock cycle, or into idle if the entire packet has been sent. The determination of when the node is
done sending a packet varies between nodes.
32
Figure 5.10: General structure of the state machines used in the hardware-only nodes.
5.5 Inverse Quantization Inverse Transform Node
The IQIT node is shown in Fig. 5.11. The purpose of this node is to perform the inverse
quantization and inverse transform procedures. Inputs can come from any node, however only
the parser node ever sends any data to this node in the current design. The results produced by
this node are sent to the buffer node, where they are added to the location specified by the parser
node prior to transmission.
33
Figure 5.11: High level design of inverse quantization inverse transform node.
5.5.1 Parsing and Input Packet Format
The IQIT node accepts packets in the format shown in Fig. 5.12. The first flit contains a
set of parameters required either by the IQIT node itself, or parameters which the IQIT node is
required to forward to the buffer node in its response packet. The next two flits are the zig-zag
ordered transform domain residuals with the last residual in the most significant position of the
first of these flits. The header fields required to be passed on to the buffer node are LCbCr select, y
coordinate, x coordinate and id. The other fields are used by the IQIT node itself.
Figure 5.12: Request packet format for the inverse quantization inverse transform node.
5.5.2 Zig-Zag
The input residuals coming from the parser are in an order resulting from ”Zig-Zag”
scanning which is done for better entropy compression. The purpose of this module is to reorder
these residuals into the regular flattened 2-d order required by the proceeding modules.
5.5.3 Inverse Quantization
This component implements the inverse quantization procedure defined in the standard
[4]. The main inputs to this module are the quantized, transform domain, residuals in a flattened
2-d array of length 16 as well as the quantization parameter. An additional input to this module is
34
a signal indicating if the inverse quantizer should bypass the DC value. This is used when the
parser has already processed the DC value using the Hadamard transform.
5.5.4 Inverse Transform
The inputs to this module are the inverse quantized residuals in the transform domain,
and the outputs are the final inverse quantized, inverse transform domain residuals in the spatial
domain. The inverse transform is the same as defined earlier in Chapter 2. This transform is
applied to all 16 inputs simultaneously.
5.5.5 Packet Generation and Output Packet Format
The output packet of the IQIT Node consists of three flits as shown in Fig. 5.13. The first
flit contains information to select the color channel, and the location of the residuals in the buffer,
as well as information to identify the packet as an IQIT command packet. Additionally, the first
flit contains a field, Sign Mask, which indicates the sign of each of the 16 spatial domain residuals
in the next two flits. The second and third flit contains the flattened 2-d array ordered, inverse
quantized, inverse transformed residuals as absolute values. The buffer node must either add or
subtract these residuals from the selected channel’s buffer based on the value of the corresponding
bit in the Sign Mask field. The use of this mask is done to avoid 9 bit numbers for the residuals.
Figure 5.13: Response packet format for the inverse quantization inverse transform node.
5.5.6 Simulation
Initial testing of the IQIT node was performed by comparing the simulation output, see
Fig 5.14, against a variety of outputs from the IQIT algorithm in the open source decoder [27] the
parser and buffer node are based on. In addition to verifying the IQIT algorithm itself this
simulation also shows the send and receive behavior of the IQIT node. In this simulation the IQIT
node receives three flits and dequeues them from the appropriate VC. After receiving the entire
35
packet the IQIT node waits until the network is ready to accept flits and transmits the expected
three flit response packet.
0000000000000000 00001801000000FF 0000000000000000
0 1 0 1 0
0 1 0 1 0
0
0000000000000000
/iqit_test/clk
/iqit_test/recv_data
/iqit_test/is_tail_flit
/iqit_test/data_in_buffer
/iqit_test/dequeue
/iqit_test/select_vc_read
/iqit_test/send_data
/iqit_test/set_tail_flit
/iqit_test/send_flit
/iqit_test/ready_to_send
0000000000000000 01010000000001AE
0 1 0
0 1 0
0
0000000000000000 000088AE000000FF 08040000... 04000405... 0000000000000000
/iqit_test/clk
/iqit_test/recv_data
/iqit_test/is_tail_flit
/iqit_test/data_in_buffer
/iqit_test/dequeue
/iqit_test/select_vc_read
/iqit_test/send_data
/iqit_test/set_tail_flit
/iqit_test/send_flit
/iqit_test/ready_to_send
Figure 5.14: Simulation of the IQIT node.
5.6 Luma Motion Compensation Node
The luma motion compensation node, depicted in Fig. 5.15, performs the sub-pixel
motion compensation algorithm for luma samples. This algorithm consists of a series of FIR filters
which are used to interpolate between a set of reference samples. The luma motion compensation
node is capable of interpolating 8 samples at a time. The number of samples to interpolate at a
time is chosen to match the width of a flit in the NoC. Because motion compensation only has a
data dependency on the reference frame, and not on the current frame, both the luma and chroma
motion compensation nodes are good targets for parallelization by adding multiple instances of
each node, although this is left as future work.
36
Figure 5.15: High level design of luma motion compensation node.
5.6.1 Parsing and Input Packet Format
The input packet format is shown in Fig. 5.16. In total, a luma motion compensation
packet is 20 flits long. The first flit contains the header flit which only contains the identifier and
the fractional component of the motion vector being used for inter prediction. The remaining flits
contain the rows to be stored in the reference sample register. There are a total of 9 rows required
to interpolate a 4x4 block, with 2 flits per row this brings the total number of flits to 18 plus the
header. The rows for the reference sample come from the reference buffer, which is the most
recently decoded video frame.
Figure 5.16: Request packet format for the luma motion compensation node.
5.6.2 Sample Register
The Sample Register holds one 9x9 block of reference samples. The output of the Sample
Register is a 7x9 block containing the reference samples required to perform the luma motion
compensation algorithm on one half of the 4x4 block in the Sample Register.
37
5.6.3 Interpolator
The interpolator contains the FIR filter required to perform the sub-pixel motion
compensation algorithm on half of the 4x4 block contained in the Sample Register. This results in
8 samples of output from the interpolator each clock cycle which is enough to fill one flit.
5.6.4 Packet Generation and Output Packet Format
The format of Luma Motion Compensation response packet is shown in Fig 5.17. The
header of this packet contains only the identifier for the packet, while the other two flits each
contain two rows of the final 4x4 luma inter prediction result.
Figure 5.17: Response packet format for the luma inter prediction node.
5.6.5 Simulation
Fig. 5.18 shows the simulation output of the luma motion compensation node. This
simulation uses a fractional motion vector component of zero. Therefore, the primary purpose of
this simulation is to determine that the input samples of the request packet are being properly
parsed, saved in the Sample Register, presented to the interpolator, and packaged by the response
generator. Additionally, this simulation shows the proper NoC interfacing behavior of the node.
38
0000000000000000 0000000000000080 0000000000040400 0000000000000000
0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0
0
0000000000000000 0000000000000080 0000000000000000
/inter_test/clk
/inter_test/recv_data
/inter_test/is_tail_flit
/inter_test/data_in_buffer
/inter_test/dequeue
/inter_test/select_vc_read
/inter_test/send_data
/inter_test/set_tail_flit
/inter_test/send_flit
/inter_test/ready_to_send
0000000000000000 0000101010100000 FFFFFFFFFFFFFFFF 0000212223240000 FFFFFFFFFFFFFFFF 0000303030300000 F...
0 1 0 1 0 1 0 1 0 1 0 1 ...
...1 0 1 0 1 0 1 0 1 0 1 0 ...
0
0000000000000000 1010101000000000 1010101024232221
/inter_test/clk
/inter_test/recv_data
/inter_test/is_tail_flit
/inter_test/data_in_buffer
/inter_test/dequeue
/inter_test/select_vc_read
/inter_test/send_data
/inter_test/set_tail_flit
/inter_test/send_flit
/inter_test/ready_to_send
FFFFFFFFFFFFFFFF 0000404040400000 FFFFFFFFFFFFFFFF 0000505050500000 FFFFFFFFFFFFFFFF 0000000000000000
0 1 0 1 0 1 0 1 0 1 0 1 0
...0 1 0 1 0 1 0 1 0 1 0 1 ...
0
1010101024232221
/inter_test/clk
/inter_test/recv_data
/inter_test/is_tail_flit
/inter_test/data_in_buffer
/inter_test/dequeue
/inter_test/select_vc_read
/inter_test/send_data
/inter_test/set_tail_flit
/inter_test/send_flit
/inter_test/ready_to_send
0000000000000000
0 1 0 1 0
0 1 0 1 0
0
1010101024232221 0000000000000080 1010... 3030... 1010101024232221
/inter_test/clk
/inter_test/recv_data
/inter_test/is_tail_flit
/inter_test/data_in_buffer
/inter_test/dequeue
/inter_test/select_vc_read
/inter_test/send_data
/inter_test/set_tail_flit
/inter_test/send_flit
/inter_test/ready_to_send
Figure 5.18: Simulation of the luma motion compensation node.
5.7 Chroma Motion Compensation Node
The chroma motion compensation node performs the sub-pixel portion of the chroma
inter prediction algorithm. A high level design of this node is shown in Fig 5.19. This node
39
Figure 5.19: High level design of chroma motion compensation node.
performs this algorithm for 2x2 chroma blocks from the same location in both chroma channels
simultaneously. This node is very similar to the luma motion compensation node, with a few
exceptions. First, the interpolator is different because luma and chroma motion compensation
require different interpolators. Similarly, the reference sample register is smaller since chroma
motion compensation only requires a 3x3 block of samples for interpolation. Finally, the chroma
motion compensation node has two copies of the reference sample buffer and the interpolator in
order to perform motion compensation on both channels simultaneously.
5.7.1 Parsing and Input Packet Format
Fig. 5.20 shows the input packet format for the chroma motion compensation node. The
header contains all of the parameters required for the interpolator as well as the id number used
by the buffer node. The next three flits contain all of the reference samples required by the
interpolators for both channels. The order of the samples was chosen to allow efficient reading
from the reference buffer.
Figure 5.20: Request packet format for the chroma motion compensation node.
40
5.7.2 Interpolator
Each interpolator is a 2-d linear interpolator as defined earlier in Chapter 2. Both of the
interpolators have access to all of the samples in the reference buffer for their channel
simultaneously, and produce the prediction for their channel in one clock cycle.
5.7.3 Packet Generation and Output Packet Format
The output packet format is shown in Fig. 5.21. The header flit contains only the id used
by the buffer node. The second flit contains each 2x2 block of prediction results for the two
chroma channels.
Figure 5.21: Response packet format for the chroma inter prediction node.
5.7.4 Simulation
Fig. 5.22 shows the results from simulating the test bench for the chroma motion
compensation node. This test bench has one of the chroma channels using motion vectors of zero
for x and y, while the other channel uses motion vectors of eight, which is the full scale value for
the fractional motion vectors. The result is that one channels passes though the upper corner of
the provided reference sample block, while the other passes though the lower corner of the
provided reference sample block.
41
0000000000000000 00000000000808FF
0 1 0
0 1 0
0
0000000000000000
/chroma_motion_test/clk
/chroma_motion_test/recv_data
/chroma_motion_test/is_tail_flit
/chroma_motion_test/data_in_buffer
/chroma_motion_test/dequeue
/chroma_motion_test/select_vc_read
/chroma_motion_test/send_data
/chroma_motion_test/set_tail_flit
/chroma_motion_test/send_flit
/chroma_motion_test/ready_to_send
00000000000808FF 0102030405060708
0 1 0 1 ...
0 1 0
0
0000000000000000 0807050400000000
/chroma_motion_test/clk
/chroma_motion_test/recv_data
/chroma_motion_test/is_tail_flit
/chroma_motion_test/data_in_buffer
/chroma_motion_test/dequeue
/chroma_motion_test/select_vc_read
/chroma_motion_test/send_data
/chroma_motion_test/set_tail_flit
/chroma_motion_test/send_flit
/chroma_motion_test/ready_to_send
0102030405060708 0000000900000009
0 1 0
0 1 0 1 0
0
... 0807050404030100 0807... 00000000000000FF
/chroma_motion_test/clk
/chroma_motion_test/recv_data
/chroma_motion_test/is_tail_flit
/chroma_motion_test/data_in_buffer
/chroma_motion_test/dequeue
/chroma_motion_test/select_vc_read
/chroma_motion_test/send_data
/chroma_motion_test/set_tail_flit
/chroma_motion_test/send_flit
/chroma_motion_test/ready_to_send
0000000900000009
0
0
0
00000000000000FF 0807050404030109
/chroma_motion_test/clk
/chroma_motion_test/recv_data
/chroma_motion_test/is_tail_flit
/chroma_motion_test/data_in_buffer
/chroma_motion_test/dequeue
/chroma_motion_test/select_vc_read
/chroma_motion_test/send_data
/chroma_motion_test/set_tail_flit
/chroma_motion_test/send_flit
/chroma_motion_test/ready_to_send
Figure 5.22: Simulation of the chroma motion compensation node.
5.8 Intra Prediction Node
The high level design of the intra prediction node is shown in Fig. 5.23. The intra
prediction node performs the intra prediction algorithm for one block from any of the luma or
42
Figure 5.23: High level design of intra prediction node.
chroma channels at a time. Since the intra prediction mode may be different for blocks in the same
location in different channels, the intra prediction node does not operate on all channels
simultaneously. These blocks can be either 16x16 or 4x4. All of the modes specified in Chapter 2
are supported by this node for both block sizes.
5.8.1 Parsing and Input Packet Format
The intra prediction node accepts two different types of messages. Both of these messages
are shown in Fig. 5.24. The first packet type accepted is a write command which writes four
neighbor samples used by the intra prediction core to the registers at a time. The second type of
packet initiates the intra prediction process and contains the intra prediction mode to be used as
well as all of the relevant information to perform the prediction algorithm.
Figure 5.24: Command packets recognized by the intra prediction node.
5.8.2 Intra Prediction Core
The intra prediction core is organized as a 2-d array of combinational functions
implementing each of the intra prediction equations defined in the standard [4]. This array of
functions is surrounded by registers containing each of reference samples required by the
algorithm as shown in Fig.5.25.
43
Figure 5.25: Organization of the intra core. Note that the upper three bytes of the input samples
written to address four are ignored.
5.8.3 Packet Generation and Output Packet Format
The output packet format is shown in Fig. 5.26. The intra node sends a variable length
packet to the buffer node after each intra prediction command is received. The length of the
packet is dependent on the block size in the intra prediction command. In the case of a block size
of four, the packet is 5 flits long and the least significant half of each of the flits one through four
are reserved. If a block size of 8 is received, the intra node responds with 9 flits, although the intra
core does not fully support this block size. If the block size is 16 the response size is 33 flits, which
is the largest packet size sent by any node in the system. When reading the diagram in Fig. 5.26, if
the block size is 16, each row takes up two flits instead of one. This means, when the flit number is
even and greater than zero the most significant byte of the flit contains the ninth element of its
respective row instead of the first.
44
Figure 5.26: Response packet format for the intra prediction node.
5.8.4 Simulation
Partial simulation output for the intra prediction core is shown in Fig. 5.27. This
simulation writes a set of input values to the intra core, and writes both the reference samples, and
the prediction results for each of the modes to a comma separated value (CSV) file. The included
image is of the 16x16 plane prediction mode. The inputs are the leftmost and topmost cells, while
the rest of the image is the prediction result. The CSV file from the simulation was imported into
Excel for shading.
Figure 5.27: Simulation of the intra prediction node showing the 16x16 plane prediction mode.
5.9 Deblocking Filter Node
The high level design of the deblocking filter node is shown in Fig. 5.28. The role of this
node is to perform the deblocking filter algorithm as described in Chapter 2. Like the luma and
chroma motion compensation nodes, this node is a good candidate for parallelization by adding
multiple instances of this node. Although some dependency on previously deblocked samples
exist, they are much less than the dependency intra prediction has on previous intra prediction
results for example.
45
Figure 5.28: High level design of the Deblocking Filter Node.
5.9.1 Parsing and Input Packet Format
Fig. 5.29 shows the packet format accepted by the deblocking filter node. The first flit
contains the set of parameters defining the threshold levels for the deblocking process. The
second flit contains the four samples from each macroblock nearest to the edge being processed by
the deblocking filter.
Figure 5.29: Deblocking Filter Node request packet format.
5.9.2 Deblocking Filter
The deblocking filter component implements the conditional filtering equations defined
in the standard [4] for each of the samples provided by the input packet. All eight of the resulting
samples are calculated in a single clock cycle and the result is sent back to the buffer node. The
response packet of the deblocking filter node is shown in Fig. 5.30.
Figure 5.30: Response packet format for the Deblocking Filter Node
46
5.9.3 Simulation
The simulation included in Fig. 5.31 contains an input sequence to the deblocking filter
which surpasses the threshold level of the conditional filters in the deblocking filter node. Because
of this, filtering does not occur and the response packet contains the original samples it received
from the buffer node.
0000000000000000 000000000000FF00 0...
0 1 0 1
0 1 0
0
0000000000000000
/db_test/clk
/db_test/recv_data
/db_test/is_tail_flit
/db_test/data_in_buffer
/db_test/dequeue
/db_test/select_vc_read
/db_test/send_data
/db_test/set_tail_flit
/db_test/send_flit
/db_test/ready_to_send
0102030401020304
1 0
0 1 0
0
0000... 00000000000000FF 0102030401020304
/db_test/clk
/db_test/recv_data
/db_test/is_tail_flit
/db_test/data_in_buffer
/db_test/dequeue
/db_test/select_vc_read
/db_test/send_data
/db_test/set_tail_flit
/db_test/send_flit
/db_test/ready_to_send
Figure 5.31: Simulation results showing the correct operation of the deblocking node.
5.10 Display Control Node
The display node, shown in Fig. 5.32, provides a display buffer for the decoded video as
well as color space transformation functionality to convert the LCbCr pixels into RGB pixels.
Additionally, the display node contains logic to drive a VGA controller.
47
Figure 5.32: High level design of the Display Node.
5.10.1 Parsing and Input Packet Format
Fig. 5.33 shows the packet format accepted by the display node. Each of these packets are
one flit long and contain two LCbCr pixels and an address. The size of the pixel buffer is 320
pixels wide and 200 pixels high. The pixel address is calculated as shown in Eq. 5.1.
pixel addr(x, y) = x + y ∗ 320 (5.1)
Figure 5.33: Display node write pixel command packet format.
5.10.2 Color Space Transformation
The color space transformation component is modified based off of a YCbCr to RGB
component provided by Altera [26]. The modifications done to this component where done to
make it match the YCbCr color conversion functions specified by the ITU [29]. The color space
conversion functions used are shown in Eqs. 5.2- 5.4. The clip function is the same as defined in
Chapter 2 and makes the value of its argument saturate at either 255 or 0 whenever it would
overflow or fall below zero respectively.
R = clip(Y + 1.402 ∗ (Cr− 128)) (5.2)
G = clip(Y− 0.344136 ∗ (Cb− 128)− 0.714136 ∗ (Cr− 128)) (5.3)
48
B = clip(Y + 1.772 ∗ (Cb− 128)) (5.4)
5.10.3 VGA Controller
The VGA controller component is an open source VGA controller [30]. This VGA
controller is parametrized, and the display node uses the parameter set for a 640x400 display.
Since the video display buffer is only 320x200 the VGA controller displays four copies of the
buffer each refresh cycle.
5.10.4 VGA Digital to Analog Converter
The VGA DAC is an open source hardware R-2R style VGA DAC originally designed for
use with the Raspberry Pi boards [25]. This VGA DAC has 6 bits per color. The limited precision
of this DAC is used by the display controller to reduce the amount of On-Chip-Memory required
by the display node. This is done by only saving the 6 most significant bits of the 8 bit RGB values
determined by the color space conversion component.
5.11 Compilation for FPGA
The h.264 decoder described in this chapter was compiled for the Stratix IV FPGA on the
DE4 development board [31] using Quartus Prime 16.1 Standard Edition [32]. A summary of the
resource utilization report is given in Table 5.2
Table 5.2: Resource utilization of the proposed 3x3 NoC based h.264 decoder.
Item Report
FPGA Device Family Stratix IV GX
Device EP4SGX230C2
Logic Utilization 135,953 / 182,400 ( 75 % )
Total Combinational Functions 87002
Total Registers 65437
Dedicated Logic Registers 64,161 / 182.400 ( 35 % )
Total Pins 292 / 888 ( 33 % )
Total Block Memory Bits 1,886,567 / 14,625,792 ( 13 % )
DSP Block 18-bit Elements 224 / 1,288 ( 17 % )
Total PLLs 3 / 8 (38 % )
49
CHAPTER 6
SCALABILITY
6.1 Introduction
One of the primary benefits of an NoC based design is improved scalability and
reusability of components. To demonstrate this, a scaled down version of the proposed NoC
based h.264 decoder is presented in this chapter. This scaled down version of the decoder targets
an SoC style FPGA in the Cyclone V-SoC family [33]. Comparing the number of Adaptive Logic
Modules (ALMs), the smaller FPGA has only 32,070 ALMs while the original target has 91,200.
One aspect of the smaller target which enables this port is the built-in Hard Processor System
(HPS) on the new target. The HPS contains two ARM processors which will serve as the parser
and buffer node’s main computational device in this port.
6.2 Scaling Design to Fit a Smaller Target FPGA
Three major changes where made to the original design to allow it to fit on a smaller
FPGA. First, intra prediction was moved onto the buffer node to be performed in software.
Second, the NoC topology was changed to use only four routers with multiple ports on each
router to accommodate the seven node design. Finally, instead of using NIOS II cores for the
buffer and parser nodes, the built-in ARM cores available on the FPGA chip where used.
Additionally, two noteworthy compromises were made to allow this decoder to be implemented
on this smaller FPGA. First, the main memory used by the parser and buffer is no longer
physically isolated, as it is in the original design. However, the buffer and parser still use logically
isolated memory. Second, the IO used by each processor based node is physically accessible to
each other, although neither node ever accesses the IO ports used by the other node.
6.2.1 Porting Process
The process of porting begins with choosing a subset of the full design which will fit on
the smaller device. Intra prediction was chosen to be moved into software because, based on
profiling results, it is not utilized as much as inter prediction, but still uses a large amount of area
on the FPGA. The choice to reduce the number of routers, while increasing the number of ports
per router was made because it saves on the number of resources greatly and was not expected to
50
Figure 6.1: Scaled down version of the proposed NoC based h.264 decoder.
greatly degrade performance. The architecture chosen for this scaled down decoder is shown in
Fig. 6.1. Each node in this architecture is mapped to have the same address as it has in the
original design.
The porting process continues by modifying either the buffer or parser node to add any
algorithms eliminated from the hardware. In this case, the intra prediction algorithm was added
into the buffer node. Additionally, the software which controls network interface needed to be
modified to support the IO interface of the target processor. After the network interface code is
rewritten for the new processor to hardware interface of the SoC style FPGA target, the hardware
and software can be compiled and tested on the new target platform.
6.2.2 Porting Results
The scaled down version of the architecture achieves very similar performance to the
original design as shown in Chapter 7. This version has a considerably higher overhead
associated with interacting with the NoC as measured from the buffer node. However, the HPS
processors are much faster than the NIOS II cores used in the original design and the performance
of the scaled down version is slightly better than the full scale version despite higher overhead.
51
Two factors could be causing the higher overhead. First, the interface between the FPGA fabric
and the HPS could simply be slower than the interface provided by the NIOS II processors.
Second, since fewer routers are used in the scaled down design, and the buffer depth per router is
the same as the full scale design, the scaled down architecture has a much lower flit capacity
which would cause increased waiting times when transmitting a large amount of data.
6.3 Compilation for FPGA
The 2x2 NoC based h.264 video decoder described in this chapter is compiled for FPGA
using Quartus Prime 15.1 Lite Edition [32]. The target board used in this design is the DE1-SoC
board [33]. A summary of the compilation report is given in Table 6.1.
Table 6.1: Resource utilization of the proposed 2x2 NoC based h.264 decoder.
Item Report
FPGA Device Family Cyclone V
Device 5CSEMA5F31C6
Logic Utilization 24,708 / 32,070 ( 77 % )
Total Registers 30978
Dedicated Logic Registers 31,102
Total Pins 159 / 457 ( 35 % )
Total Block Memory Bits 1,152,000/4,065,280 ( 28 % )
Total DSP Blocks 87 / 87 ( 100 % )
Total PLLs 1 / 6 ( 17 % )
52
CHAPTER 7
PERFORMANCE TESTING AND PROFILING
7.1 Introduction
This chapter presents the results of performance testing and profiling performed on both
versions of the NoC based h.264 decoder presented in Chapters 5 and 6. To perform this
profiling, a counter was added to the hardware component of the NIOS II nodes. This counter is
read by the software running on the NIOS II cores at a variety of locations in order to determine
what the limiting factors of system performance are. Additionally, the timer is also used to
determine total system performance and compare against the software decoder the parser and
buffer are based on, as well as the USHA decoder [17] which is a hybrid NoC-Bus based design.
The USHA decoder is chosen as a comparison because it is a partially NoC based system.
7.2 Test Videos
The test videos are from an online repository of YUV encoded video files [34]. Five test
sequences are used from this repository called ”akiyo”, ”foreman”, ”highway”, ”hall”, and
”paris”. The first three are QCIF format video with a resolution of 176x144, the other two are CIF
format videos which have twice the horizontal and vertical resolution of the QCIF videos. An
important note here is that although the video decoder itself can decode CIF, as well as higher
resolution videos, the display node only contains enough RAM to display 320x200 videos. Thus,
for CIF videos the entire video is decoded, but only the top and leftmost 320x200 pixels are
displayed. Each of the five test videos are encoded using the JM reference encoder [35]. The
encoding settings use a modified baseline profile which uses a single reference image and a
periodic intra prediction update to avoid accumulated error in the output video stream. The
encoder configuration file is available alongside the released source materials for this thesis [36].
Selected frames of these test videos being decoded by each of these implementations are shown in
Fig. 7.1 and Fig. 7.2.
7.3 Buffer Node Profiling
Because the buffer node controls access to a resource used by nearly every algorithm in
the system, the profiling measurements are taken at this node. A diagram depicting the profiled
53
Figure 7.1: 3x3 implementation decoding the ”hall” test video sequence.
Figure 7.2: 2x2 implementation decoding the ”akiyo” video sequence.
zones of the buffer node is included in Fig. 7.3. A total of 9 counters are maintained for profiling
purposes. The first counter starts when the buffer node receives an allocate frame command and
stops when the buffer receives a special packet indicating the video stream is done. The purpose
of this counter is to keep track of the total time it takes to decode the video sequence. One timer
for deblocking, intra prediction and inter prediction are maintained to keep track of the total time
54
taken to perform these algorithms. An additional timer for each of these three algorithms is also
maintained to keep track of the amount of time the buffer idles while waiting for a response from
the node associated with each of these algorithms. A timer is also used to keep track of the
amount of time a write to the display node takes. Another timer is used to determine the total
time the buffer spends idling after completing a command before it receives another one. The
IQIT algorithm is not profiled since this algorithm is essentially done by the time it reaches the
buffer node. The results for the 3x3 and 2x2 decoder implementations are shown in Table 7.1 and
Table 7.2 respectively. The average distribution for each decoder is shown in Fig. 7.4 for the 3x3
decoder and Fig. 7.5 for the 2x2 decoder.
Figure 7.3: Diagram of timer start/stop positions within the buffer node software.
7.3.1 Discussion of Profiling Results
The profiling results indicate a high proportion of total time spent in inter prediction and
deblocking relative to time the buffer node idling waiting for these algorithms nodes to respond.
This indicates that the buffer node is currently incapable of fully utilizing these nodes. Since the
code on the buffer node for dispatching either of these algorithms consists almost entirely of
reading from memory and writing to the NoC through the NoC interface, architectural changes
for improving performance in the future should focus on a few things. First, improving the total
55
Table 7.1: Profiling Results from the 3x3 NoC Based Decoder. Times indicated are in units of
seconds.
Video akiyo highway foreman paris hall
Format qcif qcif qcif cif cif
Frames 300 2000 300 1060 300
Total Decode Time 37.18 310.00 55.07 579.68 168.65
Total Intra Time 1.61 11.46 1.96 23.68 6.08
Total Inter Time 8.08 83.26 16.36 135.85 38.65
Total Deblock Time 16.14 107.62 16.14 242.31 68.26
Total Display Time 3.99 26.60 3.99 35.95 10.13
Intra Idle Time 0.54 3.80 0.65 7.92 2.02
Inter Idle Time 0.47 12.97 3.25 11.39 3.52
Deblock Idle Time 2.57 17.12 2.57 38.51 10.85
Command Wait Time 6.23 68.75 14.14 121.27 38.63
Table 7.2: Profiling Results from the 2x2 NoC Based Decoder. Times indicated are in units of
seconds.
Video akiyo highway foreman paris hall
Format qcif qcif qcif cif cif
Frames 300 2000 300 1060 300
Total Decode Time 43.05 430.34 83.79 615.21 184.04
Total Intra Time 0.04 0.27 0.04 0.60 0.16
Total Inter Time 6.26 157.34 39.00 143.15 43.96
Total Deblock Time 13.30 88.71 13.31 199.47 56.19
Total Display Time 16.45 109.63 16.45 147.52 41.55
Intra Idle Time 0.00 0.00 0.00 0.00 0.00
Inter Idle Time 1.65 46.00 11.56 39.88 12.36
Deblock Idle Time 8.95 59.69 8.95 134.20 37.80
Command Wait Time 6.28 67.29 13.48 112.80 38.14
bandwidth to memory. Second, reducing the amount of interaction required by the buffer node
CPU to perform NoC read and writes. Additionally, given that the buffer node is simple enough
for a practical implementation as a hardware only node, this would be another area to investigate.
Additionally, the parser node is a good target for improvement since the command wait time
takes up a significant amount of the buffer’s time. A more detailed discussion of future work is
included in Chapter 8.
7.4 Performance Comparisons
Performance comparisons against the open source implementation [27] which provided
the basis for the parser node and parts of the buffer node are reported in Table 7.3. The decoder
was modified to use the VGA display node for video output. Additionally, the decoder was
56
Inter
24.9 %
Intra
3.9 %
Deblock
37.9 %
Display
7.8 %
Command
21.7 %
Other
3.8 %
Figure 7.4: Average time spent in each section of the buffer node code for the 3x3 NoC based
decoder.
Inter
29.0 %
Intra
0.1 %
Deblock
26.1 %
Display
26.0 %
Command
17.1 %
Other
1.8 %
Figure 7.5: Average time spent in each section of the buffer node code for the 2x2 NoC based
decoder.
modified to use the available hardware timer to measure the total decoding time. Since reading
from the timer has a performance impact, all of the profiling timers were removed from the NoC
based decoders except for the total decoding time timer. Each of the decoders were tested using
the same five test videos used for profiling. The test results are shown in Table 7.3.
The NoC based decoders were also compared against the USHA decoder. In order to
make a comparison against the results provided in the USHA paper [17], the performance results
where converted from frames per second to macroblocks per second since this is the performance
unit used in the USHA paper. The number of macroblocks per frame is calculated by dividing the
57
Table 7.3: Comparison of the NoC based decoders with an open source software based decoder
running on the NIOS II core and HPS core. All reported numbers are in units of frames per second.
3x3 Decoder 2x2 Decoder Nios II SW HPS SW
akiyo (qcif) 8.26 7.12 4.86 11.57
highway (qcif) 6.74 4.73 3.55 11.03
foreman (qcif) 5.76 3.64 2.81 10.79
paris (cif) 1.96 1.76 0.82 4.23
hall (cif) 1.91 1.66 1.02 4.22
average fps (cif) 1.95 1.74 0.85 4.23
average fps (qcif) 6.75 4.75 3.56 11.06
Table 7.4: Comparison of USHA decoder and the 3x3 and 2x2 NoC Based Decoders
Frames per second Macroblocks per frame Macroblocks per second
3x3 decoder (qcif) 6.75 99 668
2x2 decoder (qcif) 4.75 99 470
3x3 decoder (cif) 1.95 396 771
2x2 decoder (cif) 1.74 396 688
USHA (1) n/a n/a 2475
USHA (2) n/a n/a 20250
USHA (3) n/a n/a 108000
USHA (4) n/a n/a 244800
total number of pixels by 16 squared since a macroblock is 16x16 pixels. The USHA decoder was
chosen as a comparison because it is a hybrid NoC-Bus based system. The USHA decoder has
four configurations which are single threaded CPU (configuration 1), multi-processor
(configuration 2), hardware accelerated (configuration 3), and full hardware (configuration 4). The
hardware accelerated implementation of USHA has all algorithms except inter prediction and
deblocking running in software on separate tiles.
7.4.1 Discussion of Performance Comparisons
As was expected, the NoC based implementation outperforms the NIOS II soft-core
processor running the full software decoder. An unexpected result is the performance of the
software decoder running on the ARM core available on the HPS. One important difference
between the software decoders and the NoC based decoders is that the software decoders do not
implement a deblocking filter. However, even after accounting for this a large performance
discrepancy exists. Another noteworthy point is that the 2x2 decoder is slower than the 3x3
decoder, despite the fact that the software decoder is over three times faster on the HPS compared
58
to the NIOS II core, and intra prediction is not a heavily utilized function. Based on the profiling
results this appears to stem from the fact that the communication between the HPS and the FPGA
on the SoC style FPGA is slower than the communication between the NIOS Cores and the rest of
the FPGA on the large scale design. Evidence of this can be seen in the relative time spent writing
to the display node for each decoder. The 2x2 decoder spends about 26% of its time writing to the
display node, where as the 3x3 decoder only spends about 8% of its time engaged in the same
activity. This indicates a large difference in the communication overhead between the two
designs. Based on this information, efforts seeking to improve the NoC decoder implementations
to exceed the ARM processor’s performance should focus on improving the processor to FPGA
communication, or removing it altogether by creating a hardware only implementation of the
buffer node.
USHA [17] has much high performance in each of its configurations compared to the NoC
decoders presented here. The fact that multiple configurations are presented gives some insight
into how similar performance could be achieved. First, the performance of USHA configuration 1
indicates that either the software based decoder used in this processor only configuration is much
more efficient than the software decoder tested on the ARM core here, or the PowerPC processor
used by USHA is much faster than the ARM processor used in this thesis. Configuration 2
suggests a higher level of computational parallelism is achieved by USHA. Despite the limitations
of bus communications schemes, USHA’s use of a common bus for reading and writing memory
may provide advantages in terms of task level parallelism. This advantage exists not because of
including a bus, but because the included bus provides a direct access by each node to a shared
memory space. When such a shared memory space exists, the buffer node no longer plays such a
central role in the decoding process. This would allow the parser to send commands directly to
the intra and inter nodes instead of first sending them to the buffer node, greatly increasing the
achievable level of parallelism in the system. An NoC based shared memory space, potentially
implemented on a separate independent NoC, would be a worthwhile area of investigation for
improving performance.
59
CHAPTER 8
FURTHER OPTIMIZATIONS AND FUTURE WORK
8.1 Overview
This chapter gives recommendations for future work. This contains both optimizations of
the current architecture as well as architectural modifications, which are likely to improve
performance. Other areas of future work also included are comparisons against different
communication methods, and general bug fixes and feature expansion.
8.2 Future Work Targeting Performance
Each subsection below presents proposed future work targeting increased performance.
The proposals are presented in order, starting with the most similar to the current 3x3 NoC based
decoder, and generally build upon each other as the section proceeds.
8.2.1 Parser and Buffer Node Optimization
We learned from the profiling results in Chapter 7 that increasing the performance of the
parser node would have a considerable impact on system performance. A number of approaches
could be taken to achieve this. The simplest approach to increasing the parser performance would
be to enqueue packets that could be sent into a FIFO when the parser would have sent a packet,
but has not received an acknowledgment from the buffer node yet. This queuing method would
improve performance by allowing the parser to get further ahead of the buffer when it is able.
Another approach to reducing the amount of time the buffer spends waiting for the parser would
be to add a second NIOS II core or a dedicated coprocessor to the buffer node which had its own
port, but direct access to the main buffer node. The purpose of this coprocessor would be to
consume all incoming packets at a guaranteed rate eliminating the need for a buffer to parser
acknowledgment. Neither of these approaches would improve the actual performance of the
parser node. Instead, they allow the parser node to stay as busy as possible which results in
improved performance. To improve the performance of the parser node itself, hardware
accelerators could be added for functions such as CALVC and Exp-Golomb decoding.
Additionally, the Hadamard transform could be offloaded to the IQIT node, but this is not
expected to yield much improvement in performance since this transform is not used very often.
60
A better processor-network interface would benefit the buffer node, and to some extent
the parser node. A descriptor based NoC interface for the buffer node as described in [37] would
likely result in a large increase in performance. This type of NoC interface would have direct
access to the memory on the buffer node and could autonomously send packets based on
descriptions received from the main processor on the buffer node. This approach would be
especially powerful when combined with a separate port on the buffer for commands from the
parser and IQIT nodes because of the increased parallelism a multi-port buffer node would allow.
A slightly modified architecture based on proposed ideas in this section is shown in Fig. 8.1. In
this architecture, the algorithm dispatch node is the descriptor based NoC-interface coprocessor
previously mentioned.
Figure 8.1: Architecture with modified buffer and parser node.
61
8.2.2 Further Partitioning
The profiling results from Chapter 7 indicate that memory access plays an important role
in the speed of the entire system. The use of a descriptor based NoC coprocessor would likely
reduce this bottleneck to some extent, however, further improvements may still be necessary. One
way to make improvements related to memory access would be to partition the buffer node in
such a way that the buffer commands and thus memory accesses are spread out over multiple
physical memories. One way to do this is to have separate buffer nodes for each of the channels.
This is possible since none of the algorithms which operate on the samples from a given channel
never have a data dependency on another channel. However, partitioning the buffer into three
nodes, one for L, Cb and Cr respectively is probably unnecessary. This is due to the sub-sampling
of the chroma channels. Each chroma channel only contains a quarter of the samples as the luma
channel, so even when combined into one node, a chroma buffer node would not be the limiting
factor in system performance. An example of a dual buffer node architecture is given in Fig. 8.1.
In order to allow this design to fit on the currently targeted FPGA, the Stratix IV, the parser node
has been modified to use on chip memory.
Figure 8.2: A dual buffer node architecture of a 3x4 NoC based decoder. Note that in this
architecture, the NoC topology is increased from 3x3 to 3x4.
62
8.2.3 Combined Display and Deblocking
Currently, the decoder performs all of the deblocking filtering before sending the new
frame to the display node. One way of improving the efficiency of deblocking and displaying a
frame would be to write the entire frame to a combined display and deblocking node which
would then send back all of the updated macroblock edges after deblocking was performed.
Because this would also reduce the number of nodes, mapping nodes according to
communication patterns becomes an easier task, and therefore performance could improve.
8.2.4 Alternative Communication Pattern
Currently, the parser sends commands to the buffer which causes the buffer to send the
required data to one of the algorithm nodes, i.e. intra, inter, deblock. An alternative approach
would be to have the parser send these commands to the algorithm node itself and have that node
request the data it needs from the buffer node. The buffer node, or one of the buffer nodes in a
multiple buffer node design, would then respond with the data, and finally, the algorithm node
would send a write request back to the buffer after the algorithm finishes. A comparison of the
current communication pattern and the alternative communication pattern presented in this
subsection is shown in Fig. 8.3. This ends up being, potentially, quite a bit more communication
per command leaving the parser than the current design, but may allow for higher levels of task
level parallelism by reducing the extent to which ordered execution is enforced in the system.
Figure 8.3: Diagram of the current communication pattern (left) and an alternative which may
allow for better parallelism (right).
63
8.2.5 Parallelization of Inter and Deblocking
Based on profiling results, as the parser node and memory to NoC speeds improve, the
most important algorithms for improving in speed are inter prediction and deblocking. Both inter
prediction and deblocking are good candidates for parallelization since both have minimal data
dependencies on results of previous iteration outputs. Inter prediction, for example, only has a
data dependency on commands from the parser, and the reference frame. Initial efforts
investigating the potential for parallelization of inter prediction indicated that the current design
of the buffer node and communication pattern can reliably perform inter prediction across two
luma nodes, with chroma done in software, or one luma and one chroma node, which is the
current design. However, when running inter prediction across two luma and one chroma node
response packets from the inter prediction nodes would go missing. The alternative
communication pattern introduced in the previous subsection would likely improve the reliability
of multiple inter prediction nodes running in parallel.
8.3 Other Future Work
Currently, there are noticeable visual artifacts in the output video stream. Future work
seeking to reduce these artifacts would be valuable for increasing the usability of this decoder as a
component in a larger media system. Additionally, although the deblocking filter is capable of
performing the full deblocking rules, only normal mode deblocking is performed on luma
macroblock edges to maintain reasonable performance. Also, redesigning the display node to use
an off-chip RAM for storing the video would be useful as it would allow display formats larger
than 320x200. Finally, a study making use of the dynamic communication capabilities of the NoC
to develop a full codec for h.264 building off the current design would be interesting.
64
REFERENCES
[1] A. Agarwal, C.-D. Iskander, H. Kalva, and R. Shankar, “System-Level Modeling of a
NoC-Based H.264 Decoder,” 2008 2nd Annual IEEE Systems Conference, 2008.
[2] T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video
coding standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13,
no. 7, pp. 560–576, July 2003.
[3] S. Nargundmath and A. Nandibewoor, “Entropy coding of H.264/AVC using Exp-Golomb
coding and CAVLC coding,” International Conference on Advanced Nanomaterials & Emerging
Engineering Technologies, 2013.
[4] Advanced video coding for generic audiovisual services, International Telecommunication Union,
Oct. 2016. [Online]. Available: https://www.itu.int/rec/T-REC-H.264
[5] J. F. Gatal, E. Raffinan, J. Imperial, and J. A. Hora, “FPGA-based H.264 Video Decoder in RTP
payload format,” 2015 International Conference on Humanoid, Nanotechnology, Information
Technology,Communication and Control, Environment and Management (HNICEM), 2015.
[6] Y. Pan, D. Zhou, and S. Goto, “An FPGA-based 4K UHDTV H.264/AVC video decoder,”
2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), 2013.
[7] Q. Yang, T. Wang, X. Su, L. Wang, and X. Wang, “GALS architecture of H.264 video encoding
system on DN-DualV6-PCIe-4 FPGA platform,” 2012 IEEE 11th International Conference on
Signal Processing, 2012.
[8] V. Rosa, W. Staehler, A. Azevedo, B. Zatt, R. Porto, L. Agostini, S. Bampi, and A. Susin, “The
H.264 Video Coding Standard,” 18th IEEE/IFIP International Workshop on Rapid System
Prototyping, May 2007.
[9] J. Ru, Y. Yang, and Y. Yang, “Design of H.264 Video Decoding IP Core on FPGA,” 2014 Fourth
International Conference on Instrumentation and Measurement, Computer, Communication and
Control, 2014.
[10] L.-G. Wu, D.-L. Zhang, G.-M. Du, Y.-K. Song, and M.-L. Gao, “A 4x4 pipelined intra frame
decoder for H.264,” 2009 3rd International Conference on Anti-counterfeiting, Security, and
Identification in Communication, 2009.
[11] T. G. George and N. Malmurugan, “A New Fast Architecture for HD H.264 CAVLC
Multi-syntax Decoder and its FPGA Implementation,” International Conference on
Computational Intelligence and Multimedia Applications (ICCIMA 2007), 2007.
[12] J.-Y. Chang, W.-J. Kim, Y.-H. Bae, M.-Y. Lee, J.-Y. Kim, and H.-J. Cho, “Star-Mesh NoC based
multi-channel H.264 decoder design,” 2008 International SoC Design Conference, 2008.
[13] V.-D. Ngo, H.-N. Nguyen, and H.-W. Choi, “Realizing Network on Chip Design of H.264
Decoder Based on Throughput Aware Mapping,” 2006 First International Conference on
Communications and Electronics, 2006.
65
[14] J. Xu, W. Wolf, J. Henkel, and S. Chakradhar, “H. 264 HDTV Decoder Using
Application-Specific Networks-On-Chip,” 2005 IEEE International Conference on Multimedia
and Expo, 2005.
[15] A. Luczak, P. Garstecki, O. Stankiewicz, and M. Stepniewska, “Network-on-chip based
architecture of H.264 video decoder,” 2008 International Conference on Signals and Electronic
Systems, 2008.
[16] M. Stepniewska, “Advanced video codecs implementation using Network-on-Chip in FPGA
devices,” PhD dissertation, Poznan University of Technology, 2012.
[17] A. Rao, S. K. Nandy, H. Nikolov, and E. F. Deprettere, “USHA: Unified software and
hardware architecture for video decoding,” 2011 IEEE 9th Symposium on Application Specific
Processors (SASP), 2011.
[18] H. Kalva, “The H.264 Video Coding Standard,” IEEE Transactions on Circuits and Systems for
Video Technology, vol. 13, no. 4, pp. 86–90, Oct.-Dec. 2006.
[19] I. Richardson, White Paper: H.264/AVC Context Adaptive Variable Length Coding, VCodex, Feb.
2002.
[20] H. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, “Low-complexity transform and
quantization in H.264/AVC,” IEEE Transactions on Circuits and Systems for Video Technology,
vol. 13, no. 7, p. 598603, 2003.
[21] W. J. Dally and B. P. Towles, “Network Interfaces,” in Principles and Practices of Interconnection
Networks. Morgan Kaufmann Publishers, 2004, ch. 22, pp. 427–448.
[22] ——, “Introduction to Interconnection Networks,” in Principles and Practices of Interconnection
Networks. Morgan Kaufmann Publishers, 2004, ch. 1, pp. 1–24.
[23] M. Papamichael, “CONNECT: CONfigurable NEtwork Creation Tool.” [Online]. Available:
http://users.ece.cmu.edu/∼mpapamic/connect/
[24] M. K. Papamichael and J. C. Hoe, “CONNECT: Re-Examining Conventional Wisdom for
Designing NoCs in the Context of FPGAs,” Proceedings of the ACM/SIGDA international
symposium on Field Programmable Gate Arrays - FPGA ’12, 2012.
[25] “VGA666,” Fen Logic Ltd., Sep. 2014. [Online]. Available:
https://github.com/fenlogic/vga666
[26] “Advanced Synthesis Cookbook Design Files,” Alterra Corporation. [Online]. Available:
https://www.altera.com/content/dam/altera-www/global/en US/others/literature/
manual/cookbook.zip
[27] M. Feidler, “Implementation of a basic H.264/AVC decoder,” June 2006. [Online]. Available:
https://keyj.emphy.de/projects/studies/
[28] Nios II Gen2 Software Developers Handbook, Altera Corporation, May 2015. [Online]. Available:
https://www.altera.com/en US/pdfs/literature/hb/nios2/n2sw nii5v2gen2.pdf
66
[29] T.871 : Information technology - Digital compression and coding of continuous-tone still images:
JPEG File Interchange Format (JFIF), International Telecommunication Union, May 2011.
[Online]. Available: http://www.itu.int/rec/T-REC-T.871
[30] S. Larson, “VGA Controller (VHDL) - Logic,” Aug. 2013. [Online]. Available:
https://eewiki.net/pages/viewpage.action?pageId=15925278
[31] “Altera DE4 Development and Education Board,” Terasic Technologies Inc. [Online].
Available: de4.terasic.com
[32] “Download Center,” Alterra Corporation. [Online]. Available:
https://www.altera.com/downloads/download-center.html
[33] “DE1-SoC Board,” Terasic Technologies Inc. [Online]. Available: de1-soc.terasic.com
[34] M. Reisslein, L. J. Karam, P. Seeling, F. H. Fitzek, and T. K. Madsen, “YUV Video Sequences.”
[Online]. Available: http://trace.eas.asu.edu/yuv/
[35] K. Suehring, “H.264/AVC Software Coordination.” [Online]. Available:
http://iphome.hhi.de/suehring/tml/
[36] I. Barge, “NoC based h.264 decoder for FPGA,” Apr. 2017. [Online]. Available:
https://github.com/bargei/NoC264
[37] W. J. Dally and B. P. Towles, “Buses,” in Principles and Practices of Interconnection Networks.
Morgan Kaufmann Publishers, 2004, ch. 20, pp. 389–410.
