Exploitation of thread- and data-parallelism in video coding algorithms by Thomas R. Jacobs (7204421)
University Library 
I • Lo~;~ghb_orough 
.Umvers1ty 
Author/Filing Title ........ ?~~-~.'?.~.f.T: .. B: ............... . 
........................................................................................ 
T Class Mark .................................................................... . 
Please note that fines are charged on ALL 
overdue items. 
Ill~ 

Exploitation of thread and data parallelism 
in video coding algorithms 
By 
Thomas Richard Jacobs MSc, BEng (Hons) 
A Doctoral Thesis submitted in partial fulfilment of the requirements for the 
award of Doctor of Philosophy of Loughborough University 
August 2007 
~m Lou~!hhnrt'Hli!,h '"~t Uni•tt'l>ily 
•j./ 
ri~~·h:·/Y1 ~--~-·r-s0 
---..-.v..,.__..,,,,,...._ _ 
Date rS{bldf 
-
Class I 
Acc~lf-3.7 No. 
-
I 
I 
ABSTRACT 
MPEG-2, MPEG-4 and H.264 are currently the most popular video coding algorithms for 
consumer devices. The complexity and computational intensity of their respective 
encoding processes and the associated power consumption currently limits their full 
deployment in portable or cost-sensitive consumer devices. 
This thesis takes two approaches in addressing these performance issues. Firstly in the 
static partitioning of application's control-flow-graphs using thread-level parallelism to 
share the computational load between multiple processors in a System-on-Chip multi-
processor configuration. Secondly, two separate design methodologies, one founded in 
RTL and in the second SystemC, were applied in order to investigate dedicated vector 
architectures in the acceleration of video encoding through the exploitation of data-level 
parallel techniques. By implementing two vector datapaths, one from each methodology, 
a comparison of the two is made. 
The key contributions of the work are summarised below: 
• Demonstration of the reduction in computational workload per processor by 
exploiting thread-level parallelism. 
• Static partitioning of three state-of-the-art video encoders, namely MPEG-2, 
MPEG-4 and H.264, to permit their execution on a multi-processor environment. 
• Design of a vector datapath to accelerate MPEG-4 video encoding by 
implementing data-level parallelism. 
• Comparative study of the potential of the ESL language, SystemC, in the design 
methodology, in comparison with the RTL. 
ACKNOWLEDGEMENTS 
I would like to thank my two supervisors, Dr Vassilios Chouliaras and Dr David 
Mulvaney for their help and advice throughout my time undertaking this work. Their 
contribution to this thesis, in technical knowledge, guidance and support has been 
invaluable. 
I would also like to thank Nadia, and fellow researchers at Loughborough for the help and 
understanding through the years together and have made the whole Ph.D. process an 
enjoyable one. In addition to my friends in Loughborough I would also like to thank Tim, 
Simon and the numerous medics in Leicester that have made my time away from my 
research in Leicester extremely fun and enjoyable. 
Finally thanks goes to my parents who have always been there for me with help support 
and love. 
11 
TABLE OF CONTENTS 
List of Abbreviations ....................................................................................................... vii 
List of Figures ................................................................................................................... x 
List of Tables ................................................................................................................ x1x 
CHAPTER 1: Introduction ............................................................................................. ! 
1.1. The digital world ................................................................................................ 1 
1.2. Problem formulation .......................................................................................... ! 
1.2.1. Video compression evolution .................................................................... ! 
1.2.2. Parallel computing .................................................................................... 2 
1.3. Thesis structure .................................................................................................. 3 
1.4. Contributions ..................................................................................................... 4 
1.5. References .......................................................................................................... 5 
CHAPTER 2: Video Compression ................................................................................. 7 
2.1. Lossless compression methods .......................................................................... 8 
2.1.1. Run length encoding ................................................................................. 8 
2.1.2. Huffrnan .................................................................................................... 9 
2.1.3. Lempel-Ziv .............................................................................................. lO 
2.2. Lossy compression methods ............................................................................ 12 
2.2.1. 
2.2.2. 
2.2.3. 
2.2.4. 
2.2.5. 
2.2.5.1. 
2.2.5.2. 
2.2.6. 
2.2.7. 
2.2.8. 
Colour conversions .................................................................................. l2 
Discrete Cosine Transform ..................................................................... 17 
Quantisation ............................................................................................ 19 
Zig-zag scan order ................................................................................... 21 
Fractals .................................................................................................... 22 
Iterated function systems .................................................................... 22 
Partitioned iterated function systems .................................................. 24 
Wavelet ................................................................................................... 27 
Vector Quantisation ................................................................................ 31 
Differential encoding .............................................................................. 33 
iii 
Table of Contents iv 
2.2.9. Motion estimation I motion compensation ....................................... , ...... 34 
2.3. Compression standards and schemes ............................................................... 36 
2.3.1. Standards bodies ...................................................................................... 36 
2.3.1.1. International Telecommunications Union (ITU) ................................ 36 
2.3.1.2. Motion Picture Experts Group (MPEG) ............................................. 38 
2.3.1.3. International organisation for standardisation (ISO) .......................... 38 
2.3.2. H.261 ....................................................................................................... 38 
2.3.3. MPEG-1 .................................................................................................. 40 
2.3.4. MPEG-21 H.262 .................................................................................... .43 
2.3.5. H.263 ....................................................................................................... 45 
2.3.6. MPEG-4 .................................................................................................. 46 
2.3.6.1. Part 2 Visual .................................... : .................................................. 48 
2.3.6.2. Part 10 Advance video coding I H.264 ............................................... 50 
2.4. Conclusion ....................................................................................................... 56 
2.5. References ........................................................................................................ 57 
CHAPTER 3: Parallelism Techniques ......................................................................... 62 
3.1. Parallelism ....................................................................................................... 62 · 
3.2. Thread-Level Parallelism ................................................................................. 63 
3.2.1. Control-flow-graph partitioning .............................................................. 69 
3.2.1.1. Dynamic CFG partitioning ................................................................. 69 
3.2.1.2. 
3.2.2. 
3.2.2.1. 
3.2.2.2. 
3.2.2.3. 
3.2.3. 
3.2.3.1. 
Static CFG partitioning ....................................................................... 70 
TLP exploitation environments ............................................................... 74 
SimpleScalar ....................................................................................... 74 
OpenMP .............................................................................................. 79 
POSIX Threads ................................................................................... 80 
TLP and video processing ....................................................................... 82 
Coarse grain granularity ...................................................................... 82 
3.2.3.2. Fine grain granularity .......................................................................... 83 
3.3. Data-Level Parallelism .................................................................................... 83 
3.3.1. SS_SPARC ASIC Processing Platform .................................................. 84 
3.3.2. ARM ........................................................................................................ 86 
3.3.2.1. ARMv6 Architecture .......................................................................... 87 
Table of Contents 
3.3.2.2. 
3.3.3. 
V 
ARM NEON Media Processing Engine ............................................. 87 
x86 - MMX, SSE and 3DNOW .............................................................. 88 
3.3.4. AltiVec .................................................................................................... 89 
3.4. Conclusions ...................................................................................................... 90 
3.5. References ........................................................................................................ 90 
CHAPTER 4: Hardware Techniques for Exploiting TLP ......................................... 96 
4.1. Thread-Level Parallelism ................................................................................. 96 
4.2. The MPEG-2 video compression standard ...................................................... 96 
4.2.1. Test modelS ............................................................................................ 98 
4.2.1.1. Motion Estimation (function motion_estimation() ) ......................... 100 
4.2.1.2. Transform (function transform()) .................................................... 104 
4.2.1.3. Inverse transform (function itransform() ) ........................................ 106 
4.2.2. Fast ME algorithms ............................................................................... 108 
4.3. The MPEG-4 Visual video compression standard ......................................... 109 
4.3.1. FrameCodei.. ......................................................................................... 113 
4.3.2. FrameCodeP .......................................................................................... 117 
4.3.2.1. Motion Estimation ............................................................................ 120 
4.4. The H.264 video compression standard ......................................................... 123 
4.4.1. Threading granularity ............................................................................ 125 
4.4.2. Slice encoding in x264 .......................................................................... 128 
4.5. Results ............................................................................................................ 131 
4.5.1. Test sequences ....................................................................................... l34 
4.5.2. MPEG-2 ................................................................................................ 137 
4.5.3. MPEG-4 ................................................................................................ 143 
4.5.4. 8.264 ..................................................................................................... 145 
4.6. Conclusion ..................................................................................................... 147 
4.7. References ...................................................................................................... 147 
CHAPTER 5: Hardware Techniques for Exploiting DLP ••••••••••••••••••••••••••.•••••••••••• 150 
5.1. Data-Level Parallelism .................................................................................. 150 
5.1.1. Vectorising the X viD encoder. .............................................................. 150 
5.1.2. Hardware implementations ................................................................... 157 
5.2. VHDL (RTL) Implementation ....................................................................... 158 
Table of Contents vi 
5.2.1. Vector element unit.. ............................................................................. 160 
5.2.1.1. Addition and subtraction unit ........................................................... 161 
5.2.1.2. Multiplication unit ............................................................................ 162 
5.2.1.3. Shifter units ....................................................................................... 163 
5.2.1.4. Miscellaneous unit ............................................................................ 165 
5.2.2. Pack/merge unit. .................................................................................... 170 
5.2.2.1. Pack and Unpack functions ............................................................... 172 
5.2.2.2. Merge functions ................................................................................ 174 
5.2.3. Stage two ............................................................................................... 177 
5.3. SystemC Implementation ............................................................................... 179 
5.3.1. Process: clock_proc() ............................................................................ 181 
5.3.2. 
5.3.3. 
5.3.4. 
Process: trans_stage1_proc() ................................................................. 181 
Process: trans_stage2_proc() ................................................................. 184 
Process: bypass_proc() .......................................................................... 184 
5.4. Power and Area Analysis ............................................................................... 185 
5.5. Combining thread and data-level parallelism ................................................ 189 
5.6. Conclusion ..................................................................................................... 193 
5.7. References ...................................................................................................... 193 
CHAPTER 6: Conclusions & Future Work •••••••..•••••.••••••••••••••••••••••••••••••••.••••.••••••••• 195 
6.1. Parallelisation of Video Encoders .................................................................. 195 
6.1.1. Thread-Level Parallelism ...................................................................... 195 
6.1.2. Data-Level Parallelism .......................................................................... 196 
6.2. Experimentation Findings .............................................................................. 197 
6.2.1. Thread-Level Parallelism ...................................................................... 197 
6.2.1.1. MPEG-2 ............................................................................................ 197 
6.2.1.2. MPEG-4 ............................................................................................ 198 
6.2.1.3. H.264 ................................................................................................ 198 
6.2.2. Data-Level Parallelism .......................................................................... 199 
6.3. Future work .................................................................................................... 199 
6.4. References ...................................................................................................... 200 
Bibliography ............................................................................................................... 202 
List of Publications ........................................................................................................ 214 
LIST OF ABBREVIATIONS 
Abbreviation 
16CIF 
4CIF 
AC-3 
ADC 
ALU 
AMP 
API 
ASF 
AVI 
CD 
CFG 
CIF 
CMP 
CMP-MT 
CODEC 
CRT 
CWT 
Dl 
DC 
DCT 
DFT 
DIC 
DLP 
DV 
DVB 
DVD 
DWT 
ESD 
ESL 
FE 
FLI 
FPS 
FU 
GIF 
Explanation 
Sixteen times common intermediate format 
1408x1152 (PAL) 
1408x960 (NTSC) 
Four times common intermediate format 
704x576 (PAL) 
704x480 (NTSC) 
Adaptive transform coder 3 
Analogue to digital converters 
Arithmetic logic unit 
Asymmetric multiprocessors 
Application programme interface 
Advanced system format 
Audio video interleave 
Compact disk 
Control-flow-graph 
Common Intermediate Format 
352x288 (PAL) 
352x240 (NTSC) 
Chip-level multiprocessors 
Multithreaded chip-level multiprocessor 
Compressor/decompressor 
Cathode Ray Tube 
Continuous wavelet transform 
Full standard definition video 
720x576 (DVD-Video PAL) 
720x480 (DVD-Video NTSC) 
Synopsis Design Compiler 
Discrete Cosine Transform 
Discrete Fourier transform 
Dynamic Instruction Count 
Data-Level Parallelism 
Digital Video 
Digital video broadcast 
Digital Video Disc 
Discrete Wavelet Transform 
Electronic system design 
Electronic system level 
Cadence First Encounter 
Foreign language interface 
Frames per Second 
Functional unit 
Graphics interchange format 
vii 
List ofAbbreviations 
Abbreviation 
GMC 
GNU 
GOB 
GOP 
GPL 
HD 
HD1080i 
HD1080p 
HD720i 
HD720p 
HDL 
HVS 
IEC 
ILP 
IP 
IPC 
ISA 
ISDN 
ISO 
ISS 
ITF 
ITV 
JPEG 
LCD 
MB 
MC 
ME 
MIMD 
MIPS 
MIPS 
MISD 
MJPEG 
MP2 
MP3 
MPEG 
MSSG 
MV 
NOP 
OoO 
os 
P&R 
PCM 
PDP 
PE 
PIFS 
PISA 
POSIX 
PPE 
Explanation 
Global motion compensation 
Gnu's Not UNIX 
Group of blocks 
Group of pictures 
GNU General publish license 
High Definition 
HD 192xl080 interlaced 
HD 192xl080 progressive 
HD 1280x720 interlaced 
HD 1280x720 progressive 
Hardware description language 
Human Visual System 
International Electrotechnical Commission 
Instruction-Level Parallelism 
Internet protocol 
Instruction per Cycle 
Instruction set architecture 
Integrated Services Digital Network 
International Standards Organization 
Instruction Set Simulator 
Iterated function systems 
International Telecommunications Union 
Joint Picture Expert Group 
Liquid Crystal Display 
Macroblock 
Motion compensation 
Motion estimation 
Multiple instruction multiple data 
Microprocessor without Interlocked Pipeline Stages 
Million instructions per second 
Multiple instruction single data 
MotionJPEG 
MPEG-1 audio layer 2 
MPEG-1 audio layer 3 
Motion Picture Expert Group 
MPEG software simulation group 
Motion vector 
No Operation 
Out of Order 
Operating system 
Place and Route 
Pulse code modulation 
Plasma Display Panel 
Processing Engine 
Partitioned iterated function system 
Portable instruction set architecture 
Portable operating system interface for Unix 
Power processing element 
viii 
List ofAbbreviations 
Abbreviation 
PPM 
PRAM 
PSTN 
Pthreads 
QCIF 
RGB 
RMSE 
RTL 
SAC 
SAD 
SD 
SIMD 
SISD 
SMP 
SMP-MT 
SMT 
SNR 
SoC 
SPE 
SQCIF 
SSE 
TIFF 
TLP 
TM 
TSMC 
UN 
VCD 
VCL 
VDU 
VHDL 
VHS 
VLIW 
vo 
VOL 
VOP 
VQ 
VRF 
vs 
Explanation 
Portable Pixel Map 
Parallel RAM 
Public switching telephone network 
POSIX threads 
Quarter Common Intermediate Format 
176xl44 (PAL) 
176xl20 (NTSC) 
Red Green Blue 
Root mean squared error 
Register Transfer Level 
Syntax-based arithmetic coding 
Sum of absolute differences 
Standard definition, see D 1 
Single Instruction Multiple Data 
Single instruction single data 
Symmetric multiprocessors 
Multithreaded symmetric multiprocessor 
Simultaneous multithreading 
Signal to noise ratio 
System on chip 
Synergistic Processing Elements 
Sub quarter common intermediate format 
128x96 
Streaming SIMD extension 
Tagged image file format 
Thread-Level Parallelism 
Test model 
Taiwan Semiconductor Manufacturing Company 
United Nations 
Video compact disk 
Video coding layer 
Video Display Unit 
Very High Speed Integrated Circuit (VHSIC) 
Hardware Description Language 
Video Home System 
Very Large Instruction Width 
Video object 
Video object layer 
Video object plane 
Vector Quantisation 
Vector register file 
Video session 
ix 
LIST OF FIGURES 
Figure 1-1 Three classifications of parallel operations based on the target granularity of each 
approach .................................................................................................................................. 2 
Figure 2-1 Illustration of the bitrate requirement for uncompressed video data ............................... 7 
Figure 2-2 Comparison between uncompressed and RLE bitstream ................................................ 8 
Figure 2-3 Huffman probability graph illustrating how individual probabilities combine together to 
form a tree structure ................................................................................................................ 9 
Figure 2-4 A rearrangement of the Huffman probability graph to form the Huffman tree. This tree 
assigns each branch with a specific unique bit arrangement. ................................................ 10 
Figure 2-5 RGB colour representation illustrating each separate red, green and blue colour channel 
that combine together to form a full colour image ................................................................ 13 
Figure 2-6 YUV colour representation illustrating the one luminance, Y, and two chrominance, U 
and V, colour channel that combine together to form a full colour image ............................ 14 
Figure 2-7 Example of four sub-sampling schemes, 4:4:4,4:1 :I and 4:2:0, illustrating the sampled 
pixels used within each scheme ............................................................................................ 16 
Figure 2-8 Illustration of the mapping between the 8x8 spatial block and it corresponding 8x8 
frequency coefficient block ................................................................................................... 17 
Figure 2-9 Four examples of forward DCT transformation of a 4x4 pixel block ........................... 18 
Figure 2-10 Illustration of the frequency intensity, both horizontally and vertically, for each 
frequency coefficient within a 8x8 DCT block .................................................................... 19 
Figure 2-11 Example of quantisation of 8x8 block, a) original data, b) quantisation coefficients and 
c) resultant block ................................................................................................................... 20 
Figure 2-12 Scan patterns: a) raster, b) zig-zag, c) alternative zig-zag ........................................... 21 
Figure 2-13 An example to illustrate the packing properties obtained through scanning coefficients 
in a) zig-zag order and b) raster scan order ........................................................................... 22 
Figure 2-14 An example of fractal image generation taking an start object, in this example a 
rectangle, and within three iterative steps a coarse representation of Sierpinski gasket is 
becoming apparent. ............................................................................................................... 23 
Figure 2-15 Right-angled range partition schemes: a) fixed block, b) Quadtree, c) Horizontal-
vertical.. ................................................................................................................................. 24 
Figure 2-16 An example of fractal decompression showing each iteration of the decoding process, 
with its corresponding RMSE value ...................................................................................... 26 
Figure 2-17 Illustration of the eight element 1-dimentional Haar wavelet transform ..................... 28 
Figure 2-18 Three-level one-dimensional DWT decomposition tree illustrating the repetitive 
down-sampling of resolution. The blocks Hand L indicate high and low pass filters 
respectively and ~2 indicates downsampling by a factor of 2 ............................................... 29 
Figure 2-19 Two-level two-dimensional DWT decomposition tree illustrating the repetitive down-
sampling of resolution for both dimensions .......................................................................... 29 
X 
List o Fi ures xi 
Figure 2-20 A two-dimensional DWT decomposition illustrating the repetitive down-sampling of 
image resolution .................................................................................................................... 30 
Figure 2-21 Example of a two-level DWT transformation of a Seagull image .............................. 30 
Figure 2-22 Example of a one-dimensional 2-bit vector quantisation representing any real number 
as one of four quantisation values ......................................................................................... 31 
Figure 2-23 A two-dimensional 4-bit vector quantisation of regions in a 6x6 block using 15 
quantisation values ................................................................................................................ 31 
Figure 2-24 Vector quantisation image compression is achieved by matching blocks from the 
original image with a block in the code book. Decompression is achieved by copying code 
book values to the appropriate corresponding location in the reconstructed image[36] ....... 32 
Figure 2-25 Illustration of the image difference between two consecutive frames in the 'Susy' 
video sequence. Differential encoding compression is achieved by encoding this residue 
frame as opposed to frame I. ................................................................................................ 34 
Figure 2-26 Illustration of MV observed between frame 0 and I on the 'Susy' sequence. When 
using these MV it is seen how the information stored within the residue frame is 
dramatically reduced ............................................................................................................. 35 
Figure 2-27 Prediction pattern for frame sequence containing only I and P frames ....................... 39 
Figure 2-28 Illustration of the hierarchical structure present within the H.261 bitstream .............. 40 
Figure 2-29 Prediction pattern for a frame sequence containing I, P and B frames ........................ 41 
Figure 2-30 Illustration of the differences between display and coding order of MPEG-1 stream. 42 
Figure 2-31 Illustration of the hierarchical structure present in MPEG-1 video streams ................ 42 
Figure 2-32 Distinction between video coding layer and the network abstract layer of the H.264 
standard ................................................................................................................................. 51 
Figure 2-33 The nine 4x4 intra prediction modes in the H.264 standard ........................................ 52 
Figure 2-34 The four 8x8 and 16x16 intra prediction modes in the H.264 standard ...................... 52 
Figure 2-35 16xl6MB partitions available in H.264 ...................................................................... 53 
Figure 2-36 8x8 MB partitions available in H.264 ......................................................................... 53 
Figure 2-37 H.264 profiles .............................................................................................................. 55 
Figure 2-38 H.264 levels ................................................................................................................ 56 
Figure 3-1 Memory topology of thread-level parallel systems: a) shared b) distributed and c) 
distributed shared .................................................................................................................. 64 
Figure 3-2 Symmetric and asymmetric multi-processing configurations in a shared memory 
system ................................................................................................................................... 65 
Figure 3-3 Temporal multi-threading illustrating how, through switching active context, stall time 
can be reduced ....................................................................................................................... 66 
Figure 3-4 Super threading illustrates how dynamic switching of active contexts can reduce system 
stalls by switching to an available context as soon as there is a stall in the active context. .. 67 
Figure 3-5 Simultaneous multi-threading illustrating how one processor can make use of hardware 
context switching to execute multiple hardware contexts concurrently ................................ 68 
List ofFigures 
Figure 3-6 Schematic diagram and corresponding code example taken from the multi-threaded 
XviD encoder, to illustrate the transformation from a loop serial with a single context 
execution to a parallel loop with MAX_THREAD number of contexts executing 
xii 
concurrently .......................................................................................................................... 71 
Figure 3-7 Reduction of parallel loop iteration required to allow MAX_THREAD CPUs to 
executed the loop concurrently ............................................................................................. 71 
Figure 3-8 Reconstruction of loop iteration counter based on the current loop iteration, x_old, the 
maximum active CPUs, MAX_THREAD, and the contexts private ID, context .................. 72 
Figure 3-9 Pseudo declaration of shared memory array. Each array is declared in shared (static) 
memory space and contains one element for each active CPU in the system 
(MAX_ THREAD) ................................................................................................................. 73 
Figure 3-10 Pseudo serial loop illustrating the use of shared memory array. Initially, each context 
writes data to its private element within the array demonstrating the exclusive write 
properties of the array. Secondly, within a serial loop, a single context carries out read 
operations, demonstrating shared access to data ................................................................... 73 
Figure 3-11 Remainder (strip mining) code selection. Here an 'if statement is used to select 
individual contexts to execute the remainder code based on its context ID .......................... 74 
Figure 3-12 Sim-system's C definition of the control registers declared for each simulated CPU. 75 
Figure 3-13 Assembly macro used to obtain the processor ID number of the calling context ....... 75 
Figure 3-14 Low-level implementation of the instruction that copies the processor ID value stored 
in the CPUs control registers to a general purpose register RD_ADDR ............................... 76 
Figure 3-15 Assembly macro used to set the processors run state .................................................. 76 
Figure 3-16 Instruction to set processor state flag within the control register to the value located 
within the general purpose register RS1_ADDR .................................................................. 76 
Figure 3-17 C macro implementing the sleep stage of the barrier instruction. Entering contexts 
perform a atomic addition to the global semaphore, gsem, before either entering a sleep state 
(context>O) or entering an empty loop controlled by the value of the semaphore ................ 77 
Figure 3-18 C macro implementing the synchronous release stage of the barrier instruction. Here 
context zero cycles through all sleeping contexts and systematically alters their run state to 
RUN ...................................................................................................................................... 78 
Figure 3-19 C macro defining the completed barrier instruction comprising of the sleep and 
synchronous release stages .................................................................................................... 78 
Figure 3-20 Declaration of shared and private variables through the use of the static statement ... 78 
Figure 3-21 Creation of parallel threads using forks and joins present in the OpenMP API.. ........ 79 
Figure 3-22 OpenMP API syntax for thread creation and parallel operations ................................ 79 
Figure 3-23 Built-in OpenMP functions responsible for determining thread ID and total number of 
threads in the system respectively ......................................................................................... 80 
Figure 3-24 Syntax of the barrier instruction in OpenMP responsible for processor 
synchronisation ..................................................................................................................... 80 
Figure 3-25 POSIX thread creation and destruction using the pthread_create() and pthread_exit() 
functions ................................................................................................................................ 81 
Figure 3-26 Illustration of thread creation through repeated calls of pthread_create() function using 
a different thread handle, threads(] for each call. .................................................................. 81 
Figure 3-27 Creation and initialisation of a mutually exclusive (mutex) variable in Pthreads ....... 81 
List ofFigures 
Figure 3-28 Illustration of locking mechanism in Pthreads through the use of 
pthread_mutex_lock() and pthread_mutex_unlock() to lock and unlock the mutex 
xiii 
respectively ........................................................................................................................... 82 
Figure 3-29 SS_SP ARC kernel demonstrating the multiple streaming vector accelerators ........... 84 
Figure 3-30 Detailed schematic of the SS_SP ARC super-scalar and vector datapths .................... 85 
Figure 3-31 V CORE vector datapath of the SS_SP ARC designed to extend the DSP functionality 
of the CPU ............................................................................................................................. 86 
Figure 4-1 Programme flow loop illustrating the five functions responsible for encoding one 
MPEG-2 frame using the TM5 reference encoder ................................................................ 98 
Figure 4-2 Profile data for the TM5 MPEG-2 encoder illustrating the percentage of execution time 
spent within each of the functional blocks .......................................................................... 100 
Figure 4-3 Double loop arrangement in the motion_ estimation() function to navigate through a 
frame in raster order to provide Cartesian pixel coordinates to each MB of the frame ....... 100 
Figure 4-4 Execution of motion estimation functionality through either frame or field specific 
functions .............................................................................................................................. IOI 
Figure 4-5 Macroblock structure of MPEG-2 TM5 reference encoder depicting all information 
stored per MB ...................................................................................................................... 101 
Figure 4-6 Dynamic creation of an MB structure during initialisation, allocating sufficent memory 
to represent all MBs in one frame ....................................................................................... 102 
Figure 4-7 Creation of a static shared array to hold the pointers to the MB structure required by 
each available thread ........................................................................................................... 102 
Figure 4-8 Modified double loop in MPEG-2 TM5 encoder allowing for MAX_THREAD number 
of threads to execute the inner most loop ............................................................................ I 03 
Figure 4-9 Populating MB information pointer array with the correct location of the current MB 
and so allowing each thread to access a different section of the MB structure ................... 103 
Figure 4-10 Selection procedure, though use of modulus operator, for remainder, (stripmine) 
execution ............................................................................................................................. I 04 
Figure 4-11 Realignment of mbi pointer to point to the location within the MB structure 
representing the first MB in the subsequent row ................................................................. 104 
Figure 4-12 Declaration of shared memory arrays for storing block information required by the 
sub_pred() and fdct() functions ........................................................................................... I05 
Figure 4-13 Code segment representing context zero the executing a serial loop in which 
sub_pred() is called using each individual thread's privately stored data. The barrier 
instruction ensuring all write operations to shared variables required by sub_pred() are 
complete prior to the execution of the serial loop ............................................................... 106 
Figure 4-14 Parallel implementation ofFDCT including the reallocation of block number based on 
the loop iteratior, context ID ............................................................................................... 106 
Figure 4-15 Parallel implementation of inverse DCT within MPEG-2 encoder .......................... 107 
Figure 4-16 Code segment representing context zero executing a serial loop in which add_pred() is 
called using each individual thread's privately stored data. The barrier instruction ensuring 
all write operations to shared variables required by add_pred(), are complete prior to the 
execution of the serial loop ................................................................................................. 107 
Figure 4-17 A timeline view of both the XviD encoder and its rival encoder, Divx .................... 109 
List ofFigures xiv 
Figure 4-18 Profiling data for the XviD encoder over three encoder quality settings; illustrating the 
percentage of execution time spent in executing each functional compression block ........ 110 
Figure 4-19 Command line options available within the xvid_encraw example programme. Each 
option specifies parameters that can be used in the encoding procedure ............................ Ill 
Figure 4-20 Encoding loop of xvid_encraw responsible for reading in one frame's pixel data, in 
either yuv or pgm picture format, and subsiquently calling XviD's main encoding function 
enc_main() to encode that frame ......................................................................................... 112 
Figure 4-21 Selection of frame encoding function (FrameCodel() or FrameCodeP()) based on the 
frame type flag, pFrame->intra ........................................................................................... 112 
Figure 4-22 Single threaded double loop responsible for identifying and accessing MBs within 
raster scan order .................................................................................................................. 113 
Figure 4-23 FrameCodel() parallel double loop. Illustrating the calculation of the maximum loop 
parameter for inner loop, the modified inner loop using this new maximum and the 
recreation of original loop index, i. ..................................................................................... 113 
Figure 4-24 Declaration of shared array of pointers storing MB locations for each thread .......... 114 
Figure 4-25 Allocation of each element within MB pointer array to their corresponding location 
within MB structure ............................................................................................................ 114 
Figure 4-26 Pseudo representation of the functions called within FrameCodel() ......................... 114 
Figure 4-27 A serial loop to calculate and store private quantisation factors for each available 
thread, based on whether lumimasking is being implemented ............................................ 115 
Figure 4-28 Modified MBTransQuantlntra to accept private MB quantisation value .................. 116 
Figure 4-29 Selection of threads to process the remainder (stripmine) MBs through the use of the 
modulus operator ................................................................................................................. 116 
Figure 4-30 Graphic representation of MB encoding on a row of liMB for two, three and four 
CPU context. Remainder (stripmined) MB are shown with a red border ........................... 117 
Figure 4-31 Pseudo code segment representing the function calls in FrameCodeP(). Depending on 
the outcome of MotionEstimation(), two encoding techniques are implemented during the 
latter part of this function .................................................................................................... 118 
Figure 4-32 FrameCodeP() parallel double loop showing the calculation of the maximum loop 
parameter for the inner loop ................................................................................................ 118 
Figure 4-33 Asserting flags within shared memory array blntra[] for a given thread depending on 
the type of the MB being processed by that thread. Flagging a MB ensures that it is encoded 
using intra techniques .......................................................................................................... 119 
Figure 4-34 Section of MB encoding function based on an intra flag held in the blntra shared 
memory array ...................................................................................................................... 119 
Figure 4-35 Creation of shared memory arrays responsible for storing the pointer to the MB 
location in the MB structure, the modified prediction pattern results, and the intra frame 
encoding flag ....................................................................................................................... 120 
Figure 4-36 The combined prediction and search pairing repeatedly executed in ME. Predictions 
based on neighbouring MBs in the current frame are used as a start reference for motion 
searching using reference frames ........................................................................................ 120 
Figure 4-37 Prediction patterns: (a) ideal, (b) standard, (c) proposed, each based on the mean 
average MV from the selected MB within the current frame. The MB whose MV is being 
calculated is shown as a darkened block in the centre of the nine neighbouring MBs ........ 121 
List o Fi ures 
Figure 4-38 Recalculation of displacement MVs by calculating motion prediction using the 
prediction pattern specified in the standard, and basing the new displacement on these 
XV 
predictions ........................................................................................................................... 122 
Figure 4-39 Accumulation of intra encoding flags and evaluating the total against predefined 
values in a serial loop, in order to find the total number of intra-encoded MBs in the frame. 
If the total is above the specified limit each thread in parallel the exits ME function with the 
intra exit value ..................................................................................................................... 122 
Figure 4-40 Illustration of x264 relative bitrate compared to the reference JM encoder for image 
quality 34-40dB ........................................ -.......................................................................... 125 
Figure 4-41 Comparison of bitrate relative to a single sliced frame for 2, 3, and 4 slices at image 
qualities ranging from 34 to 40dB ....................................................................................... 126 
Figure 4-42 Comparison of bitrate produced during a four sliced encoding relative to a single slice 
at four different frame resolution and for image quality range 32 to 42dB ......................... 127 
Figure 4-43 Assignment and execution of MBs in slice groups. In (a) the non-sliced encoder 
assigned all MB in the frame to a single slice, whereas in (b) the specified MB ranges based 
on MB limits defined in the given slice's array in the encoder's data structure .................. 129 
Figure 4-44 Allocation of MB ranges to each slice group based on an even distribution of MBs 
between slices .......................................... -.......................................................................... 129 
Figure 4-45 Single-threaded loop in the multi-sliced encoder responsible for calling 
x264_slice_ write(). This parent function encapsulates all encoding functionality for the 
given slice group ................................................................................................................. 130 
Figure 4-46 Multi-threaded multi-sliced x264_sliuce_ write() execution loop. Modifications were 
implemented to allow multiple threads to execute the loop in parallel. .............................. 131 
Figure 4-47 Relative distribution of instructions amongst CPU contexts for the number of contexts 
ranging from I to 32. This distribution based on the ideal case of an equal division of the 
workload ............................................................................................................................. 132 
Figure 4-48 A typical 3D view of practical distribution of instructions among avaliable processors 
illustrating the ripple effect observed due to the difficulty of precisely sub-dividing all 
instructions .......................................................................................................................... 133 
Figure 4-49 Relative distribution of instructions amongst CPU contexts for number of context 
ranging from I to 24. The distribution is based on an ideal but non-equal division of the 
workload ............................................................................................................................. 134 
Figure 4-50 Example frame, description and specific encoding challenges of a selection of test 
sequences used for each of the three video coding standards .............................................. 137 
Figure 4-51 Illustration of the relationship between search range used for ME and the instruction 
count observed while encoding the 'snow lane' test video on the TM5 reference encoder.137 
Figure 4-52 Relative DIC observed when encoding the CIF 'Tennis' sequence on the modified 
multi-thread MPEG-2 encoder ............................................................................................ 138 
Figure 4-53 Reduction in relative DIC observed when encoding the 512x380 'snow lane' test 
sequence on the modified multi-thread MPEG-2 encoder .................................................. 139 
Figure 4-54 The relative DIC observed for each adaptive ME search method when using the full 
search techniques at a search range of 40x40pel... .............................................................. 141 
Figure 4-55 Reduction in relative DIC observed through exploiting TLP for twelve adaptive search 
methods when compared to single-threaded encoding of each method .............................. 142 
List ofFigures xvi 
Figure 4-56 Reduction in relative DIC observed through exploiting TLP with the MPEG-4 encoder 
XviD using the CIF 'Tennis' sequence ............................................................................... 143 
Figure 4-57 Reduction in relative DIC observed through exploiting TLP using a scaled 'rush hour' 
sequence at QCIF CIF and HD1080p resolutions ............................................................... 144 
Figure 4-58 Reduction in relative DIC observed through encoding various CIF sequence with the 
multi-threaded x264 encoder using 4 active slice groups .................................................... 145 
Figure 4-59 Frame from Coastguard sequence illustrating slice specific motion ......................... 146 
Figure 4-60 Reduction in relative DIC observed when encoding the HD1080p 'Rush hour' 
sequence using the mull-threaded x264 encoder with 32 active slice groups ..................... 146 
Figure 5-1 Programmers model used for modelling the vector environment illustrating the addition 
vector and scalar registers required for vector operations ................................................... 151 
Figure 5-2 C model of the vstateT structure that defines the vector register file (VRF) and vector 
length register that are required for vector operations as stated in the programmers model. 
............................................................................................................................................ 151 
Figure 5-3 C model of the sstateT structure that defines the scalar register file (RF) that are 
required for vector operations as stated in the programmers model... ................................. 152 
Figure 5-4 Reference sum of absolute difference (SAD) C code present within the XviD encoder . 
............................................................................................................................................ 153 
Figure 5-5 Assignment of the inputs of the function, in this case the current and reference frame 
pointers, to vector pointers representing vector registers .................................................... 153 
Figure 5-6 Clearing of scalar register 1 for future use within the SAD algorithm through the use of 
sclr() instruction .................................................................................................................. 153 
Figure 5-7 Vectorised inner loop of the SAD algorithm illustrating the use of vector loads (vldb) 
and the vector instruction vsad ............................................................................................ 154 
Figure 5-8 Remainder (striprnined) instruction calls being executed if the original maximum loop 
parameter, 8, does not divid exactly into VLMAX. Note the setting of the vlen register to 
specify the valid elements within the vector ....................................................................... 154 
Figure 5-9 Scalar store instruction to return the computed SAD value to the C variable sad ....... 155 
Figure 5-10 Vector instruction for sum of absolute difference. Executing SAD operations on all 
valid vector elements as defined by vlen ............................................................................. 155 
Figure 5-11 8x8 SAD calculation function seen with both the original C and the vectorised 
implementation .................................................................................................................... 156 
Figure 5-12 Reduction in relative DIC observed through vectorising the X viD MPEG-4 encoder 
with a vector length (VLMAX) ranging between 8 and 248 ............................................... 157 
Figure 5-13 Schematic view of the RTL vector datapath illustrating the three main function logic 
blocks: velement, pack_merge and add_ tree; as well as the 2 stage design ........................ 159 
Figure 5-14 Illustration of the four replicated vector element units ............................................ 160 
Figure 5-15 Seven vector instructions and their accompanying description present within the 
add_sub entity ..................................................................................................................... 161 
Figure 5-16 Add_sub unit ............................................................................................................ 162 
Figure 5-17 Three vector instructions and their accompanying description present within the mult 
entity ................................................................................................................................... 162 
Figure 5-18 Mult unit. .................................................................................................................. 163 
List ofFigures xvii 
Figure 5-19 Three vector instructions and their accompanying description present within the 
shifter entity ........................................................................................................................ 163 
Figure 5-20 Shifter unit. .............................................................................................................. 164 
Figure 5-21 Ten vector instructions and their accompanying description present within the shifter 
entity ................................................................................................................................... 165 
Figure 5-22 Mise unit .................................................................................................................. 166 
Figure 5-23 Extract Functions ..................................................................................................... 166 
Figure 5-24 Clip Function ............................................................................................................ 166 
Figure 5-25 Compare Functions .................................................................................................. 167 
Figure 5-26 Sum of Absolute Differences Function .................................................................... 168 
Figure 5-27 Splat Functions ......................................................................................................... 168 
Figure 5-28 Pack Functions ......................................................................................................... 172 
Figure 5-29 Unpack Functions .................................................................................................... 173 
Figure 5-30 Merge Functions ...................................................................................................... 174 
Figure 5-31 Schematic view of stage one of the RTL based vector datapath illustrating the 
replicated velement process and the single pack_ merge process ........................................ 175 
Figure 5-32 Stage One Masking Process ..................................................................................... 176 
Figure 5-33 Schematic view of stage 2 of the RTL vector datapath illustrating the add_tree logic 
block .................................................................................................................................... 177 
Figure 5-34 Adder tree logic block composing of a number of adder components arranged into a 
tree formation with the number of adders per row decreasing as they travel down the tree. 
············································································································································ 178 
Figure 5-35 SystemC vector datapath implementation illustrating the three asynchronous and one 
synchronous process ............................................................................................................ 179 
Figure 5-36 A comparison between the 10 interface declarations of a) SystemC and b) RTL 
designs ................................................................................................................................. 180 
Figure 5-37 Declaration offour processes present within SystemC design .................................. 181 
Figure 5-38 Use of read and write properties of SystemC signals to illustrate signal assignment 
present within proc_clock() process .................................................................................... 181 
Figure 5-39 Use of casting to read in a specific number of bits from a signal and assigning to a 
variable ................................................................................................................................ 182 
Figure 5-40 Assignment of a bit vector to an variable array by use of a intermediate integer signal. 
············································································································································ 182 
Figure 5-41 C macro representation of shift right 16 instruction .................................................. 182 
Figure 5-42 SystemC implementation of shift right 16 instruction .............................................. 183 
Figure 5-43 Function to mask elements in the array at locations greater than vlen ...................... 183 
Figure 5-44 Assignment of a variable array to a bit vector by use of a intermediate integer signal. 
············································································································································ 184 
Figure 5-45 Statistical power consumed on RTL design at various operating frequencies and for 
each VLMAX ...................................................................................................................... 186 
List ofFigures xviii 
Figure 5-46 Statistical power consumed on SystemC design at various operating frequencies and 
for each VLMAX ................................................................................................................ 186 
Figure 5-47 Area of RTL design at varying operating frequencies and for each VLMAX .......... 187 
Figure 5-48 Area of SystemC design at varying operating frequencies and for each VLMAX ... 188 
Figure 5-49 Power and area of both RTL and SystemC design methodologies for vector length of 
32 ......................................................................................................................................... 189 
Figure 5-50 Reduction in DIC count obtained with the combined TLP and DLP MPEG-4 encoders 
for the Foreman test sequence ............................................................................................. 190 
Figure 5-51 Power consumption and synthesised area for the SS_SP ARC with vectorised MPEG-4 
eo-processor at varying operating frequencies, vector length of 16 and 32, and with I, 2, 4 
and 8 processors .................................................................................................................. 192 
LIST OF TABLES 
Table 2-1 An example ofLempel-Ziv compression flow for input stream THETHREETREES ... 11 
Table 2-2 Summary of the 25 divisions of within the !TU's standards .......................................... 37 
Table 2-3 The four video compression standards in subsection H of the !TU standard ................. 38 
Table 2-4 Three standards described by MPEG ............................................................................. 38 
Table 2-5 Five sub-sections within MPEG-1 standard ................................................................... 40 
Table 2-6 The 11 sub-sections in the MPEG-2 standard ................................................................ 43 
. Table 2-7 Summary of the encoding features of the six profiles defined by the MPEG-2 standard . 
.............................................................................................................................................. 45 
Table 2-8 The four levels defined in the MPEG-2 standard and their maximum resolutions and 
bitrates ................................................................................................................................... 45 
Table 2-9 The subsections of the MPEG-4 standard ...................................................................... 47 
Table 2-10 The three profiles of MPEG-4 showing their individual levels .................................... 50 
Table 4-1 Test model (TM) stages in the MPEG-2 development. .................................................. 97 
Table 4-2 Table illustrating the mapping of compression functionality onto C functions within the 
TM5 encoder ......................................................................................................................... 99 
Table 4-3 Comparison of software H.264 encoders, both commercial and open-sourced ............ 124 
xix 
CHAPTER 1: 
INTRODUCTION 
1.1. The digital world 
As the consumer demand for mobile video and multimedia-rich products continues to 
grow at a pace, a number of lossy video compression standards have been developed in 
order to make better use of limited storage capacity and to enable transmission over low 
bandwidth networks. In tandem with ongoing work related to the development of new 
and evermore efficient video compression methods, the sophistication and complexity of 
successive generations of embedded computer systems has continually improved. One 
method of providing the necessary computational power to execute these increasingly 
complex standards in consumer products has been to create dedicated System on Chip 
(SoC) computer systems. One very potent leverage of SoC computation power comes 
from the exploitation of the abundant parallelism found in such algorithms, the most 
prominent forms being Thread-Level Parallelism (TLP) and Data-Level Parallelism 
(DLP). 
1.2. Problem formulation 
This work addresses the exploitation of TLP and DLP and quantifies the performance 
benefits (dynamic instruction count reduction) in three state-of-the-art video compression 
standards when executing on a parallel RAM (PRAM) model. It will be shown that a 
significant reduction in per-CPU instruction count, instructions executed on an individual 
CPU, can be achieved by exploiting these forms of parallelism. 
There are two main background research areas that are relevant to the current work: video 
compression and parallel computing. 
1.2.1. Video compression evolution 
The use and subsequent advancements in image and video compression techniques have 
been driven by the ever growing consumer demand for higher quality, increased 
1. Introduction 2 
resolution and mobile video encoding/decoding capabilities. This demand is usually 
targeted to achieve higher quality compressed video (according to a defined metric) for a 
given bandwidth requirement. 
The roots of the modem video compression are in the static image compression domain. 
Compression techniques such as discrete cosine transform (DCT)[1], quantisation[2] and 
variable length coding (VLC)[3, 4], found in the Joint Picture Expert Group's (JPEG)[5] 
static image compression standards, are also employed for intra-frame compression 
within the Motion Picture Export Group's (MPEG)[ 6] video compression standards, 
MPEG-1[7], MPEG-2[8] and MPEG-4[9]. These techniques are augmented by additional 
compression methods such as motion estimation (ME) and motion compensation (MC) 
that have been developed to remove temporal redundancies found between frames. 
1.2.2. Parallel computing 
Parallelism in computer architecture has been available for a number of years. From the 
super (vector) computers such as the Crays'[10] of the 1970's, the RISC and RISC-like 
superscalar processors of the late 1980's, SIMD architectural extensions such as 
MMX[ll] of the 90's to the hyper-threaded multi-cores of the 2000's, various forms of 
parallel computing has been at the forefront of computer innovation. Parallel computing 
can be classified in various ways, among the most popular is Flynn's taxonomy[12] that 
classifies computers based on the multiplicity of instructions and data-sets that can be 
operated on in one cycle. Another classification defines each parallel computer based on 
the granularity to which the parallelism is applied, are shown in Figure 1-1. 
• Thread-level parallelism 
• Data-level parallelism 
• Instruction-level parallelism 
Figure 1-1 Three classifications of parallel operations based on the target granularity of each 
approach. 
Thread-level parallelism (TLP) focuses on functional regions in the control flow graph 
(CFG) of an application and their execution on independent CPU contexts of a multi-core 
system. These TLP-aware architectures, which allow the execution of multiple threads in 
1. Introduction 3 
parallel, can be classified based on their memory hierarchies (shared or distributed) and 
task/resource coverage across processing units (symmetric or asymmetric 
multiprocessing) or within a single core. 
Data-level parallelism (DLP) focuses on data and the data operations performed in the 
CFG. Here, data to which the same arithmetic operation is applied, can be concatenated to 
form a data vector. The arithmetic operation need then be only applied to the vector as a 
whole, therefore allowing a large dataset to be processed by a single instruction rather 
than requiring individual instructions for each data item. 
The final form of parallelism, instruction-level parallelism (!LP), focuses on the 
instruction as the unit to which parallelisation applies. With !LP, individual instructions 
of the dynamic instruction flow are executed on different functional units present within 
the processor's main pipeline. By implementing a range of techniques, such as instruction 
pipelining[13), register renaming[J4, 15) and out-of-order (OoO) execution, 
dependencies between instructions can be eliminated or significantly reduced yielding 
high utilisation of the parallel functional units. 
1 .3. Thesis structure 
This thesis is divided into six chapters, with the comprising of this introduction to the 
subject matter of the thesis. The second chapter describes previous work on image and 
video compression and is divided into two distinct sections. The first section looks at 
different image compression techniques including those for compressing static images 
and the extensions introduced for motion. The second section shows how the methods of 
the first section have been combined to produce the international video compression 
standards ratified by organisations such as the International Telecommunications 
Union[l6) and the Motion Picture Expert Group[6). 
Chapter three concentrates on parallel computing and in particular the motivation for 
employing parallelism to accelerate computationally intensive tasks is discussed. Firstly, 
TLP operations, architectures, coding techniques and development tools available to 
exploit TLP are discussed, and secondly four SIMD architectures are examined to 
illustrate approaches to DLP hardware design and the motivation for each. 
I. Introduction 4 
Chapter four describes the practical work undertaken in this thesis to re-write, in a thread-
parallel fashion, three complex video encoders. Focusing on MPEG-2, MPEG-4 and 
H.264 encoder in turn, the chapter systematically evaluates the modifications made to 
allow each to execute in parallel on a custom parallel simulator and the results are 
presented. 
The fifth chapter is concerned with data-level parallelisms. This section continues work 
carried out by the ESD group at Loughborough University into instruction set design and 
acceleration of audio and video encoders through vectorisation[17-19]. Here, two vector 
datapaths for accelerating the MPEG-4 video encoder XviD were developed using a 
conventional hardware description language and the high-level electronic system 
language, SystemC. Both implementations were subsequently synthesised and placed and 
routed to allow for speed, power and area measurements to be taken to permit a 
. comparison of both designs to be made and the overall feasibility of the proposed 
datapaths assessed. 
The final chapter concluded the principal findings of this research, and future work that 
can be done to continues the work and finding obtained thus far. 
1 .4. Contributions 
This work aims to research embedded CPU architecture and software algorithms in order 
to accelerate video encoding applications through exploiting both thread and data-level 
parallelism. The key contributions of the work are as follows: 
• Demonstration of the reduction in computational workload per processor by the 
exploitation of thread-level parallelism. 
• Static partitioning of three state-of-the-art video encoders, MPEG-2 MPEG-4 and 
H.264, to allow for execution on a multi-processor environment. 
• Design of a vector datapath for accelerating MPEG-4 video encoding through 
exploiting data-level parallelism. 
1. Introduction 5 
• Comparative study into the potential for using the ESL language, SystemC, as a 
complete design methodology as compared to traditional RTL. 
During the course of this research, its findings have been international recognised by 
being presented and published in 11 academic conferences and journals. 
1.5. References 
[1] N. Ahmed, T. Natarajan, and K. R. Rao, "Discrete Cosine Transform," 
Computers, IEEE Transactions on, vol. C-23, pp. 90- 93, 1974. 
[2] S. J. Solari, Digital video and audio compression. New York: McGraw-Hill, 
1997. 
[3] F. Ercal, M. Alien, and H. Feng, "Asystolic image difference algorithm for RLE-
compressed images," Parallel and Distributed Systems, IEEE Transactions on, 
vol. 11, pp. 433-443, 2000. 
[4] D. A. Huffman, "A Method for the Construction of Minimum-Redundancy 
Codes," Proceedings of the IRE vol. 40, pp. 1098- 1101, 1952. 
[5] "Joint Photographic Experts Group," ITU-T T.81, ISOIIEC IS 10918-1, 1994. 
[6] "Motion Picture Expert Group," http://www.chiariglione.org/mpeg. 
[7] "Coding of moving pictures and associated audio for digital storage media at up 
to about 1,5 Mbit/s." vol. 11172: ISO/IEC, 1993. 
[8] "Generic coding of moving pictures and associated audio (MPEG-2)." vol. 
13818: ISOIIEC, 1995. 
[9] "Informational technology-- Coding of audio-visual objects," ISO/IEC 14496, 
2000. 
[10] R. M. Russell, "The CRA Y-1 computer system," Communication of the ACM, 
vol. 21, pp. 63 -72, 1978. 
[11] "Intel Architecture Optimization Manual," 242816-003, Intel Corp. 1997. 
[12] M. J. Flynn, "Some Computer Organizations and Their Effectiveness," 
Computers, IEEE Transactions on, vol. C-21, pp. 948-960, 1972. 
[13] J. Crawford, "The execution pipeline of the Intel i486 CPU," in Compcon Spring 
'90. 'Intellectual Leverage', Thirty-Fifth IEEE Computer Society International 
Conference., 1990, pp. 254-258. 
]. Introduction 
[14] S. A. Mahlke, W. Y. Chen, J. C. Gyllenhaal, W. W. Hwu, P. P. Chang, and T. 
Kiyohara, "Compiler code transformations for superscalar-based high-
performance systems," 1992, pp. 808-817. 
[15] M. Moudgill, K. Pingali, and S. Vassiliadis, "Register renaming and dynamic 
speculation: an alternative approach," 1993, pp. 202-213. 
[16] "International Telecommunication Union," http://www.itu.int. 
[17] V. A. Chouliaras, J. L. Nunez-Yanez, and S. Agha, "Silicon Implementation of a 
Parametric Vector Datapath for Real-Time MPEG2 Encoding," in lASTED 
Conference on Signal and Image Processing, 2004, pp. 98-303. 
[18] T. R. Jacobs, V. A. Chouliaras, and J. L. Nunez-Yanez, "A thread and data-
parallel MPEG-4 video encoder for a system-on-chip multiprocessor," in 
Application-Specific Systems, Architecture Processors, 2005. ASAP 2005. I 6th 
IEEE International Conference on Samos, Greece, 2005, pp. 405 - 410. 
6 
[19] K. Koutsomyti, S. R. Parr, V. A. Chouliaras, and J. Nunez, "Applying Data-
Parallel and Scalar Optimizations for the efficient implementation of the G.729A 
and 0.723.1 Speech Coding Standards," in Signal and Image Processing, Seventh 
lASTED International Conference on (SIP 2005), Honolulu, Hawaii, 2005. 
CHAPTER 2: 
VIDEO COMPRESSION 
This chapter discusses the motivations for carrying out video compression and quantifies 
the computational performance such compression algorithms place on VLSI-
basedconsumer electronic products. Video compression algorithms reduce the number of 
bits required to reproduce the video sequence while maintaining a desired quality level. 
To illustrate this, the calculation in Figure 2-1 shows that a 1 minute clip at D 1 (720x576) 
resolution occupies 1.7 GBytes. 
o Colour depth: 8 bit per component colour 8 bits/colour 
o 3 component colours per pixel (RGB) 3x8 bits/pixel 
o 414720 pixels@ 01 resolution (720x576) 720x576x3x8 bits/frame 
o 25 fps frame rate 25x720x576x3x8 bits/s 
= 248832000 bit/s 
= 249Mbits/s 
o 60 seconds per minute = 1.7GB/min 
Figure 2-llllustration of the bitrate requirement for uncompressed video data. 
Due to the large memory capacity required for such video streams, some form of 
compression scheme is required to be able to handle, store and transmit this data. Within 
this chapter, a broad range of video and image compression methods are examined. These 
can largely be divided up into two types, lossless and lossy . 
. 7 
2. Video Compression 8 
2.1. Lossless compression methods 
Lossless compression, as the names suggests, is the process of reducing the memory 
requirement of the video sequence without any loss of information content. An image 
where compression and subsequent decompression has been performed using a lossless 
codec would produce a frame pixel-identical to the original frame. Although this is the 
ideal in terms of image quality, lossless compression schemes are not very common in the 
consumer market due to their relatively poor compression ratio compared to those 
achieved by lossy compression, as illustrated in section 2.2. Lossless compression is more 
likely to be found in professional video editing and mastering applications[!], where the 
retention of all detail is necessary prior to the application of a lossy compression scheme 
for transmission or storage. In addition, lossless encoding is often most appropriate for 
medical imaging[2], where due to the very high-detail often necessary, lossy compression 
can potentially destroy vital pieces of information within the video sequence that are 
absolutely essential for a prompt and accurate diagnosis. With these exceptions where the 
retention of fine detail is of paramount importance, the majority of compression schemes 
employed for video sequences are lossy. This is mainly due to the fact that the 
compression ratios obtainable through lossless compression methods are relatively poor 
(around 2: I) when compared to 'good' quality lossy compression schemes (around 20: I) 
[3]. However, a number of lossless techniques can be found within lossy schemes and 
thus these compression techniques are described below. 
2.1.1. Run length encoding 
Run length encoding (RLE) compresses sequences containing identical consecutive data 
(runs), into a single data value and the number of consecutive values (the run length)[4]. 
Uncompressed stream: 
000000000001000000000000111000000000000000000000000100000 
RLE stream: (0,12) (1,1) (0,12) (1,3) (0,24) (1,1) (0,5) 
Figure 2-2 Comparison between uncompressed and RLE bitstream. 
Using the bitstream in Figure 2-2 as an example, it can be seen that the uncompressed 
data stream contains 52 digits, whereas the RLE encoded stream contains 14 digits; a 
2. Video Comeression 9 
theoretical compression ratio of 3.7: 1. This compression scheme is very good when long 
runs of repetitive data are encountered, such as often found in black and white images ( !-
bit colour depth) where the probability of long runs is high. This can be seen by the 
implementation of RLE within the fax standard[5] along with Huffman coding(see 
below). 
2.1.2. Huffman 
Huffman coding is an entropy coding scheme that uses variable-length codewords to 
represent the source symbols[6]. By arranging the source symbols in decreasing order of 
probability and setting shorter codewords for symbols that occur with a higher 
probability, a symbol stream can be efficiently coded using this method. 
Character Relative Probability 
a 0.31 0.31 0.31 0.42 
0:2 J 1.0 
b 0.27 0.27 0.27 0:1 J 0.58 
c 0.19 0.19 0.23 0.27 
+ 
d 0.11 0.12 0.19 
+ 
e 0.08 0.11 
+ 
0.04 
Figure 2-3 Huffman probability graph illustrating how individual probabilities combine 
together to form a tree structure. 
Figure 2-3 depicts how each of the individual probabilities of the character found within a 
given input stream can be combined together to produce a combined probability total of 
one. Through ordering each of the symbols by probability, summing the probabilities of 
the least most probable symbols together and then reordering according to the new 
values, each successive column reduces the number of symbols by one until eventually 
the graph is complete[?]. 
2. Video Compression 
r 0.58 
0.31 
(A=OO) 
1.0 
~ r-0.42~ 
0.27 
(B=01) 
10 
0.23 
r-0.12~ 
0.08 0.04 
~ 
0.11 
(D=1 01) 
(E=1000) (F=1001) 
0.19 
(C=11) 
10 
Figure 2-4 A rearrangement of the Huffman probability graph to form the Huffman tree. 
This tree assigns each branch with a specific unique bit arrangement. 
By redrawing the probability graph, the Huffman tree, shown in Figure 2-4, is produced. 
By allocating a unique binary digit at each node, a binary codeword is created for each 
symbol by following from the root node to each specific leaf node. The code words 
produced by Huffman are prefix free, meaning that no bit sequence within a codeword 
can be ambiguous. This can be seen since every branch within the tree produces a leaf 
that contains its parent code plus the branch code. Since every possible combination of 
codeword patterns is allocated, any given codeword may not contain another codeword as 
a prefix. 
2.1.3. Lempel-Ziv 
· The Lempel-Ziv algorithm, and its variants[8, 9], use adaptive dictionary-based 
compression techniques. This compression scheme operates by scanning through the 
input symbol stream to build a dictionary of symbols. When a dictionary entry is seen in 
the input stream it is replaced with a dictionary reference instead of the string itself. This 
method does not rely on knowledge of the source data or any predefined codewords since 
this algorithm dynamically creates the dictionary during scanning, increasing as 
necessary, the number of bits per character[lO). The size of the dictionary can have a 
dramatic effect on the compression ratio of a given sequence[ I I], since savings resulting 
from the use of codewords need to be greater than the memory overheads arising from the 
2. Video Compression lJ 
dictionary itself. Table 2-1 gives an example of how a input stream (THETHREETRESS) 
would be encoded using the Lempel-Ziv algorithm. 
Table 2-1 An example of Lempei-Ziv compression flow for input stream 
THETHREETREES. 
Sequence: THETHREETREES 
Previous New Dictionary Meaning of Meaning of 
string or Output 
character Character codeword codeword output 
- T - - - -
T H D1 TH T T 
H E D2 HE H H 
E T D3 ET E E 
T H TH already in - - -dictionary 
TH R D4 D1+R=THR D1 TH 
R E D5 RE R R 
E E D6 EE E E 
E T ET already in dictionary - - -
ET R D7 D3+R-ETR D3 ET 
R E RE already in -dictionary 
RE E DB D5+E=REE D5 RE 
E s D9 ES E E 
s End of data - - s s 
In Table 2-1, it is assumed that each uncompressed character is represented by 8-bits and 
the compressed symbol by 9-bits. It is seen that the compressed output uses 90-bits (10 
codewords each 9-bits long) compared to 104-bits (13 characters each 8-bit long); a 
compression ratio of 1.15:1 (15% improvement). 
Lempel-Ziv is found within the graphics interchange format (GIF) [12] and as a 
compression option within the tagged image file format (TIFF) [13]. This compression 
scheme has various strengths and weaknesses in the context of image compression. The 
dictionary-based scheme only achieves compression when repetitive patterns in the 
symbol stream are encountered. Within line diagrams or images with low variation this is 
a very common occurrence and results in a high compression ratio. Natural images, 
however, have high variations across the image and few repetitive structures and thus 
Lempel-Ziv-based schemes do not produce such large compression ratios. 
2. Video Compression 12 
2.2. Lossy compression methods 
In contrast to lossless compression, lossy compression permanently removes part of the 
input symbol stream information. Image and video compression schemes make use of 
prior knowledge of human visual perception, achieving high compression ratios with little 
quality deterioration. In the following section a number of common lossy compression 
schemes are introduced. 
2.2.1. Colour conversions 
Historically colour images have been represented by three colour components. Within 
video display units (VDU), be it a cathode ray tube (CRT), a liquid crystal display (LCD) 
or a plasma display panel (PDP), it is red green and blue dots (pixels) of varying 
brightness that are used to produce the illusion of colour. This red green blue (ROB) 
representation of colour is known as the colour space. 
2. Video Compres.\ion 
channel that combine together to form a full colour image. 
Figure 2-5 hows how a colour image can be repre ented using the ROB colour pace. By 
combining the three eparate co lour image the resultant true colour image i reproduced. 
The number of distinct colour that can be produced from the colour pace is know as the 
colour depth of the image and i mea ured in term of the total number of bit u ed to 
repre ent a ingle pi xel. If 8-bit were chosen for each colour component. thi 
corre. pond to 256 di tinct inten itie for each colour and hence the number of po. ible 
colour that can be di played would be 256x256x256 :::: 16.7 mi llion at a colour depth of 
24-bi t (8+8+8)[3] . lt hould be evident that the number of bit · u ed to reprel>ent colour 
in formation ha a great effect on the re ultant bitrate of treaming video. The ROB 
representation i not the only colour space avai lable and alternati ve colour spaces have 
2. Video Compression 
been devised for other appl ications. The main colour pace u cd within the video 
comprc, sion community i · YUV in which the information h. . cparated into one 
luminance (brightne. , Y) and two chrominance (colour, U and V) component [ 14]. 
Figure 2-6 YUV colour representa tion illust rating the one luminance, Y, a nd two 
chromina nce, U and V, colour cha nnel tha t combine together to form a full colour image. 
Figure 2-6 demon trate how a colour image can be formed from it YUV component . 
To tran late lo and from the RGB colour , pace the following equation!. are used. 
Y = 0.299R + 0.587G + 0. 1148 (2- 1) 
U = 0.492(8 - Y) =- 0.147R - 0.289G + 0.4368 (2-2) 
V= 0.877(R - Y) = 0.6l 5R - 0.5 15G- 0. 1008 (2-3) 
2. Video Com pression 15 
To produce the black and white luma component a weighted average of each ROB colour 
is made. Using this Y value, the U and V components are ca lculated by subtraction from 
the blue and red colours respecti vely. 
Dividing the colour space into luma and chroma components better fits the model of the 
human visual system (HVS)l l5]. The eye, as the main sensor in the HVS, has on the 
retina, two different types of photoreceptors, known as rods and cones. The rods are 
responsible for detecting brightness and shape but no colour, whereas the cones are 
responsible for detecting colour. The eye as an imaging system is more sensitive to 
brightness compared to colour, containing 12 million rods compared to 6 to 7 million 
cones[16, 171. Through knowledge of the HVS and a technique known as colour sub-
sampling, colour space compression is possible. Using this compression scheme the 
colour depth of the image can be reduced without noticeably reducing the perceived 
colour representati on. 
By virtue of the eye's reduced sensitivity to colour it is possible to digitally store less 
colour information and yet retain percepti vely similar colour information. Sub-sampling 
is the process of sampling the chrominance information at a lower spatial fTequency than 
the luminance component. There is a va riety of sub-sampling schemes available, each 
representing different variations in the amount of colour data that is retained. Each 
scheme is represented by the ratio, x:y:z, of the three components in a Y UV signal. The 
convention is to specify the number of 4x4 blocks sampled with in an 8x8 block 
containing 64 pi xel values. 
In 4:4:4 sampling, Figure 2-7a, the sampling rate of all three components is the same, 
thus none of the colour information is lost. This scheme is used in high-end film 
production to best maintain colour content after editing sequences. Within the 4:2:2 
scheme, Figure 2-7b, each of the chroma components are sampled at hal f the horizontal 
frequency whi le preserving the vert ical rate. This sub-sampling standard tends to be used 
within the advanced consumer and professional video recording industries and is included 
in DV (50Mbps) [ 18]. The 4: L: 1 sampling scheme, Figure 2-7c, further sub-samples the 
chroma data in the horizontal field while leaving the vertical intact. The horizontal 
frequency is a quarter of the vertical. This sampling scheme can be found within portable 
consumer video recorders using the NTSC DV compression scheme. The 4:2:0 scheme. 
2. Video Compression /6 
Figure 2-7d, reduces both the horizontal and vertical sampling rate. As with 4:2:2 the 
horizontal rate is halved whereas each line canies an alternate chroma value in the 
vert ica l direction. This scheme can be seen as 4:2:0 on one line and then 4:0:2 on the 
next. The 4:2:0 sub-sampling scheme is the most commonly used within the consumer 
environment. being found in DVD, the digital TV standards[ 191 and in the PAL DV 
compression scheme. 
· ··r·-~--~ --
... .. ... ..... "· .. " --
' ' ' 
.. • .. ,. • • r • • ,. - -
. ' 
... -~ .. -: .. .. : ... 
- --'-··'··'-·-
. ' ' 
' . ' 
' ' ' -~ ~ -- ~ -- ~- ... 
-.. ., - .. ' .. -, ..... .. 
' 
I ' ' 
- .. ~ .. ... .I ............ .. 
' ' ' 
' ' ' 
y 
.... ,. .... , .... , .. .. 
' ' ' 
' ' ' ..... , .... ( .. .. ., ..... 
' .... , .. .. , .. .. .. . .. 
y 
- -~ - - ~ - - ~--
. ' ' -·~ - -~ - -~--
' I ' · - r ··~ · ·:· · · 
.. l .. ~ . . : .. 
0 •• 
- - ~ - -~·- ' --
-·'·-- - '· --'· .. -~ : : 
y 
I I ' 
--~--~--t--
0 ' ' .. ........ .. ...... .... .. 
• 0 0 
.. .. .. ,. . .. ,. ... , .. .. 
' . ' 
- - -~--~ - - : - -
.... -:- .... :· -- :· .. -
. . . 
·· -r- -r ··y·-
• . . 
y 
' . 
' ' ' 
-.. .., .. .. "' .. •"\ .. -
--~ - -l .. .. ~ .. ... 
' ' ' 
.. .. .. .. ... '\ .. ""'\ - .. 
' ' ' .. .. , .... , .... , -.-. 
' ' ' 
' ' - -~- - ~· · {- -
' . . 
.. .. . ,-.. ' -.. , .. .. 
' ' ' 
' I ' 
.. .. . .... .... .. ... .... 
. ' . 
- ..-.1 - ... " .. - J .. ... 
---·:- --:---:--
. ' . 
' . ' 
--~- - ~ · - ~- ... 
' . ' ·-·.- . · .... , ..... 
., .. ; --; .. . ~- .. .. 
. . . 
--·,-- ·, .. ... ·, ... ..... 
. ' . ~--....... -.. ............... -
. 
. 
··-· ·-~ - - ~ - -
.. ... ~ --- ~-- ~ - -
... 
. .. ... ., .. ... ., . -., .... 
. . ' 
• · • a •• 4 .. • ..,. _ • 
· ·- " ·-~ - - ~ --.. . 
.. ' 
-.. ............. --~ - .. 
... .. J .... : .. - .:.-... .. 
-.. ~- .. ~-. -~- ... .. 
. .. 
...... ~- .. . , .. . ...... -. 
-.. .: .. .. ..: .. .. .. : . . .. 
. . 
.. --~- .. .,_ .. ...... -
~ : : 
u 
a) 4:4:4 
: ! f I , : . 
.... ~ ... -1-"'1 '"' - ~- ... ~- .. ·:· ... : .... .. 
--: -- ~-- 1 · -;- -, ---:---:-- · 
.. .. ~ ..... ~ ...... : · ·1-- ~- -l-.. ·:-... -
... .. i ..... { .. ... ~-- ~- - ~-- .. : ... --!-- -
I ' • I J t , 
·- ~ · -r · ._~ ·-·r · 1· - ; · · 1· ·-
-- -; --~ · · ~ - - -:-·--;-- ·r -·r · -
u 
b) 4:2:2 
I I I I I I I 
-... ... - "' -- ... --... . -_, _- ...... -_.._- ... 
) I I I ' I I 
.. -t-... :--:-.. : ... :- ... -:-... -:--... 
- -\ .... , ......... --,- --.- --,---.-- ... 
, I I t I I I 
~ ... ! ...... ; .. - ~ .. .. l .. -~ - ... ..: .. -.. : ...... 
f I I I - I I 
- - . ... - ~ ... - j ..... .I .. - J .. - _,_- _,_ --
---{ ..... : ....... : ..... -~- .. ~ - --:---!-- .. 
I I I o I I I 
u 
c) 4:1:1 
j ' i • • • 
·--~-- ~- - ~- - ~-- ~- --:---...:---
• I I I I I I 
. .. -·I ....... , .. .__. ..... -, --""'\---.- ..... , ..... -
I I I ~ I I , 
·-~ .. .. ,.---.-"'- · .,--~- . ............. . 
I I I I I I t 
... -: ~ - ~--~ --, .. ... "':-- ~--:· .. . 
... ... .. .. __ ., __ _, _.,. J __ ., .. _J_ ... .... ... 
I t I I I I 
::~ ~ ~ 1=i~~ j:~~:=:~~ :~: : ~ 
u 
d) 4:2:0 
I I I t t 
- --:- ~ -:-~ -:· -- - · -:-· - ~--~ --
• I I I I I 
I I I I I I 
-... ~- - ., . .,.. ., .. ...... ..... -~ ..... ~ .. ... , ... .. 
-_.,:_- ... ~ ... -_;_ .... ,. -~ ...... :.. .... : .. -
t I I o I 
- - J . _ _,_ · - · · -· . --'- .. ·"- ·" .... 
I f o I i I 
I I I I I I 
V 
... ... .. ... ... ... . ........... -.. ... --~ -- ........ .. ... 
f I f ~ 
V 
I I I j I I I 
-· ... ... ... .... _ .. -·-- -~-.I.."' "' " -- ~ .. -
--; __ .. : ...... _:_- .. l .. .J .... -~ .. - ~ .. -
, I I j J o o 
-- ~-- ~- - ~·--r' -·r ··~·-} --
, I I I t I 
.. - ..: - .. ~ .. .. ..!... .. -~ . .. !. ... ~ ... ~ ... 
• I I I , f I 
• - J ...... .J ... - _._ ... _ ... - ...... ..... ._ ... -" - .. 
I I I ~ I I I 
---:- --:---:-- -:-· -: ... --:· --: .. .. 
f t I I I I 
V 
I I I I I t f 
--:~~: :~:~:: :~=:~t: : ~:~t :: 
I I I I I I I 
.. . , .. .. , ... _~·--r • •r •• r •• r ·· 
I I ~ I I t I 
--~-- .,.. __ ,...... ...... ,... .... ..... -... .. .. .. . --
I I I I I I t 
·--' · _ .... __ ,_ - -'"- - -"' .. • '- ... - " .... 
I f I I ~ t I 
-- -:- . -:- ..... : ... --~ - ·!-- .. ~ _. ·:- --
1 f I I I I I 
---.-~ ., . ..... , ..... ..,. .,..r ··r· -,. --
~ t t ' t 1 I 
V 
Figure 2-7 Example of four sub-sampling schemes 4:4:4, _. :1:1 and 4:2:0, illustrating the 
sampled pixcls used within each scheme. 
2. Video Compression 17 
2.2.2. Discrete Cosine Transform 
The discrete cosine transform (DCT) [20] is not a compression method in itself but it does 
transform data to a form which allows a range of further compression techniques to be 
applied. DCT is a mathematical method of transforming data within the spatial domain 
(pixel data) into the frequency domain. producing spatia l frequency coefficients. These 
coefficients represent the degree to which a given spatial frequency is present within the 
pixel data. The transform is app lied to a block of pixels which. in the case of MPEG 
compression schemes, is eight pixels by eight pixels. It produces a block, of the same 
size. filled with coefficiems for given frequencies in both the vertical and horizontal 
direction. 
Horizontal distance, x -7 
DCT -7 
f-IDCT 
8x8 pixel block, Pxr 
-. 
a> 
.0 
c 
(1) 
::J 
() 
'::: 
._, 
Horizontal frequency, i -7 
8x8 coefficient block, gq 
Figure 2-8 Illustration of the mapping between the 8x8 spatial block and it corresponding 
8x8 frequency coefficient block. 
Figure 2-8 represents 8x8 blocks in both the spatial, p9., and frequency g,1, domains[2 1]. 
Equation 2-4 depicts the 20 forward DCT transformation for an 8x8 block. 
G =.!.cC ~ ~ ((2x + l)in) ((2y + L)j.n) 1) I J LJ LJ p \)' cos COS 
4 <=0 .... o . 16 16 
where C1 = {~' 
1, 
J =0 
J > 0 
(2-4) 
Each coefficient va lue is comprised by a summation of the pixel data with contributions 
of sinusoids with the specific horizontal and vertica l spatial frequencies. At i=j=O the 
2. Video Compression /8 
frequency under test in both directions is zero and forms the DC coefficient since it 
compri ses of the mean of al l the pixel values within the block. The remaining 63 
coefficients are referred to as AC coeflicients, each relating to a given combination of 
horizontal and vertical spatial frequencies. The discrete cosine transform is a variation of 
the discrete Fourier transform (DFT) [221 with one maj or distinguishable differences 
which make it better suited for use in image compression[23l. 
• DCT produces simpler results involving only real parts whereas DFT produces 
results with both imaginary as well as real parts. 
The examples that follow demonstrate how simple pauems of pi xel data are transformed 
using the OCT. 
8 0 0 0 8 15 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
. 0 0 0 0 0 0 0 0 
(a) (b) 
8 0 0 0 8 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
15 0 0 0 0 0 0 15 
(c) (d) 
Figure 2-9 Four examples of forward DCT transformation of a 4x4 pixcl block. 
Figure 2-9 shows how a 4x4 4bit grey scale pixel block ( left) can be represented as a 4x4 
DCT block (right). The upper left coeflicient is located at g0,0 and corresponds to the DC 
component of the pixels. Figure 2-9a comprises of a uniform distribution across the block 
wi th no variation in pixel value and thus, there are no AC coefficients present in its OCT. 
Figure 2-9b represents a low frequency variation within the horizontally direction but a 
uniform distribution vertically. Since the vertical distribution is uniform the coe fficients 
are zero in rows 2 to 4. The variation in the hori zontal directi on would potentially give 
rise to non-zero coefficients in columns 2 to 4. however. as in this example there is only 
one low frequency component present in the hori zontal plane. the coeflicient in column 2 
2. Video Compression /9 
is non-zero. A similar effect is seen in Figure 2-9c, where the direction of the pixel 
va riation is now vertical and the frequency of the variation has increased resulting also in 
a increase in the distance of frequency coefficients away from DC. In Figure 2-9d there is 
now equal variation in pixel data frequency in both the vertical and horizontal direc tions 
and hence the non-zero AC frequency coefficient is found along the diagonal in the 
transform domain. 
Figure 2-10 Illustration of the frequency intensity, both horizonta lly and vertically, for each 
frequency coefficient within a 8x8 DCT block. 
Figure 2-10 depicts the pattern against which each of the 64 coefficients wi thin an 8x8 
block is matched. Although DCT does not on its own perform compression its 
transformation domain is appropriate for application of further compression techniques. 
2.2.3. Quantisation 
Quantisation is a method of representing an input data stream as a finite set of discrete 
points. Quantisation is performed within analogue to digital converters (ADCs) to 
transform continuous analogue data into discrete digital values. Quantisation can also be 
applied in the re-sampling of a discrete signal , at lower quantisation, to produce an 
approximation of the original signal requ iring fewer bits[24l Such re-sampling can 
provide compression of both images and videos. 
. ( inplll value ) quanttsed value= round 
quantisation coefficient 
(2-5) 
Equation 2-5 illustrates how. by dividing the original input va lue by the quantisation 
coefficient and rounding, the new quantiscd va lue represents the approx imate number of 
2. Video Compression 20 
quanti sers within the input value. The choice of quantisation coefficient therefore has a 
direct correlation between the input va lue and the degree of approximati on. Using a small 
quantisation coefficient produces a better approximation than using a large value. When 
presented with a block of frequency coefficients from the DCT, quanti sation matrices can 
be used to achieve high compress ion ratios with little or no visible deteriorati on. It has 
been observed that the human eye is less sensitive to high frequencies than it is to low. 
Armed wi th this knowledge it is possible to remove high-frequency components from an 
image and sti ll retaining a simi lar perceived quality. By selecting suitable quantisation 
parameters for the range of frequency components present, it is possible to approximate 
the high frequencies much more coarsely while preserving the finer detai l at lower 
frequencies. 
-415 ·33 ·58 35 58 ·51 · IS · 12 16 11 10 16 24 40 51 61 
5 ·34 47 18 27 1 ·5 3 12 12 14 19 26 58 60 55 
·46 14 86 ·35 ·50 19 7 · 18 14 13 16 24 40 57 69 56 
·53 21 34 ·20 2 34 36 12 14 17 22 29 51 87 80 62 
9 ·2 9 ·5 ·32 ·15 45 37 18 22 37 56 68 109 103 n 
·8 15 ·16 7 ·8 11 4 7 24 35 55 64 81 104 113 92 
19 ·28 ·2 ·26 ·2 7 ·44 ·21 49 64 78 87 103 121 120 101 
18 25 · 12 -44 35 48 ·37 ·3 72 92 95 98 112 100 103 99 
a) b) 
·26 ·3 ·6 2 2 ·I 0 0 
0 ·3 ·4 1 I 0 0 0 
·3 1 5 · I · 1 0 0 0 
·4 1 2 · I 0 0 0 0 
I 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
c) 
Figure 2-11 Example of quantisation of 8x8 block, a) original data, b) quantisation 
coefficients and c) resultant block 
Figure 2- 11 i llustrates how applying a speci fic quantisation matrix to the DCT 
coefficients can produce a dramatic reduction in coefficients. By apply ing the above 
quantisati on in the above example it is seen that the high frequency components have 
2. Video Compression 21 
been reduced to zero leaving only the low frequency coe fficients in the top left corner of 
the block. In addition to reducing the high frequency coefficients to zero, quantisation has 
also reduced the magnitude of the lower frequency coefficients thus reducing the number 
of bits required to represent each coefficient[25]. 
2.2.4. Z ig-zag scan order 
The re-ordering of these 2D quantised pixel blocks into a single stream is achieved by 
sequential scanning each of the pixel values one by one. 
J lA j lA J lA 
"' 
) 'VI) fV I) 1 
IJ ( ( ( jl ( •J [) ) [) I, I) 'J , , , ,, 
I ' 
If I 1 
J IJ f J 
, ... 
, ... 
..,... ,, , 
~ ~ 0 
Figure 2-12 Scan patterns: a) raster, b) zig-zag, c) a lter native zig-zag. 
The conventional method of scanning block-based data is raster scan as shown in Figure 
2-12a. This scan order scans each line from left to right, then from top to bollom in the 
same manner as text on a page. This scan order is less than idea l when examining the 
quanti sed block of Figure 2- l l c. as the low frequency coefficients are often non-zero and 
the higher frequency coefficients are often zero. scanning in frequency order would 
improve the potential for compression such as run-length encod ing, section 2. 1. J. such a 
scanning order is ach ieved by the zig-zag pattern scan shown in Figure 2- 12b that 
operates from the Lop left to the bottom right of the block. 
2. Video Comeression 
-26 -3 0 -3 -3 ·6 2 -4 I -4 I I 5 I 2 -1 I - I 2 0 0 0 0 - I - I 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
a) 
-26 -3 -6 2 2 -1 0 0 0 -3 -4 I I 0 0 0 -3 I 5 -1 5 -1 -1 0 0 0 -4 I 2 -1 0 0 
0 0 1 00000000000000000000000000000 
b) 
22 
Figure 2-13 An example to illustrate the packing properties obtained through scanning 
coefficients in a) zig-zag order and b) raster scan order. 
As can be seen in Figure 2-13 , the zig-zag scan arranges the non-zero coeffi ciems 
together beucr than the raster scan for the example in Figure 2- 11. In addition to the 
original zig-zag scan order an alternati ve scan (Figure 2-12c) has been developed. This 
scan pattern has been devised with interl aced input sequences in mind. 
2.2.5. Fractals 
The compression techniques discussed so far are used for the bloc k based compress ion 
standards proposed by MPEG. Fractals can also be used for compression purposes but, as 
yet, are not pan o f any standards. 
A fractal is an image that when viewed at different levels of magnification will ex hibit 
self similarity[26] . Using this idea of self-s imilarities, me thods like ite rated function 
systems (IFS) r27 1 have been developed to create fractals. Research i mo thi s area such as 
partitioned iterated function systems (PIFS) is used to produce compression schemes 
based on fracta l creation. 
2.2.5.1. Iterated fun ction systems 
The iterated functi on syste m (IFS) is a method of producing fractal images, in which 
repetiti ve transformations are applied to an image. By applying these bas ic 
transformati ons, the output image becomes perceptually c lea rer fo llowing each iteration. 
w, [x] = [a1 b, ][x] + [e,] 
y et dl Y /; 
(2-6) 
Eq uation 2-6 shows a trans form matrix for translation s peci tied by e, and };. a nd scaling. 
rotat ion and refl ection specified by a,, b, c, and d, of poims x,y in the image for each 
2. Video Compression 23 
transform, w,[28] . Through setting any of the above parameters the transformation of the 
original image to its new location can be specified. 
(2-7) 
Equation 2-7 shows three such transformation matrices used to generate the famous 
fractal design. Sierpinski 's gaskeT. For all three transformations. a1 and d, are set to 0.5 
thus scaling the input by 0.5 in both the x and y dimensions. The transformations differ in 
the translations of each of Lhe three scaled inputs. 
n=O n=1 
n=3 
Figure 2-14 An example of fractal image generation taking an start object, in this example a 
rectangle, and withjn three iterative steps a coarse representation of ierpinski gasket is 
becoming apparent. 
2. Video Co111pression 24 
By applying the three transformations of equation 2-7 to the input image. n=O. it can be 
seen in Figure 2- 14 that after only few iterations (n=3) the distinctive shape of 
Sierpinski 's gasket can be generated. 
2.2.5.2. Partitioned iterated function systems 
A partitioned iterated function system (PIFS) [29 J partitions the image into blocks. known 
as range blocks, and specifies transformation matrices for each block w ith the purpose of 
achieving compression. Clearly, in order to be viable as a compression method, an 
inverse method to recover the original image from the fractal is required. A s, during 
compression, the desired output image to be produced by the de-compressor is known ( it 
is the input image). it is simply the transformation parameters needed to produce this 
image from the fractaJ that are required[301 . 
J± 
H-
~ H-
rt 
a) b) 
c) 
Figure 2-15 Right-angled range partition schemes: a) fixed block, b) Quadtree, c) 
Horizontal-vertical. 
First, a method to partition a frame into blocks known as range blocks is required. There 
are many alternatives partitioning scheme available; Figure 2- L5 demonstrates three 
different schemes. The fixed block scheme is the simplest as it divides up the frame into 
equal size non-overlapping blocks. Schemes such as quadtree and horizontal-vertical 
have been developed to yield a greater number of blocks allowing transformations to be 
applied to smaller regions and capturing higher detai l wi thout the need to increase the 
2. Video Comoression 25 
block resolution uniformly across the whole imager3 t] . As well as the range blocks, the 
image is also partitioned into blocks of a greater size than range blocks. known as domain 
blocks. Compression is achieved by producing a transformation matrix. equation 2-8. for 
each range block which represents the transformation required to transform a domain 
block to the range block. To find the domain that provides this ' best match· to a given 
range, a method of referencing similarities is needed. The following examples and 
equations assume a fixed block size partition for simplicity. In addition to geometric 
transformations additional parameters are required within the transform matrix for grey-
scalar and colour images. 
[x] [a, b, 0 ][x] [e,] w, ~ = ~ ~ ~ ; + ~: 
(2-8) 
To produce a transformation for a grey-scale image two additional massic, or pixel value, 
transformations are applied to the pixels grey-scale value z. The two extra parameters 
affect the comrast (s,) and the brightness (o,) of the imagc[32] . 
d,1 = domain pixels 
t;1 = range pixels 
(2-9) 
Equation 2-9 gi ves a quantative measure of the similarity of any given domain/range 
block combination and its massic transform, assuming that the range block under test has 
been scaled up to the same size of the domain block for comparison. When selecting 
combinations of range/domain blocks. the transform that yields the smaller value of R is 
deemed to be the most sin1i lar. 
2. Video Compression 26 
j 
-
o = r-sd 
d = average domain pixel value 
-
r = average range pixel value (2-10) 
By combin ing equations 2-9 and 2- 10. it is possible. following searching a set number of 
domains for a given range block. to find the domain with the minimal R and generate the 
corresponding transformation matrix. 
Original 
n=2 RMSE 22.34 n=3 RMSE 4.8 n=4 RMSE 1.73 
Figure 2-16 An example of fracta l decompression howing each itera tion of the decoding 
proce , with its co rresponding RMSE value. 
Once all transform matrices have been produced and stored for the ranges in the image, it 
is then poss ible to reproduce the original image from these transformation alone. Figure 
2- 16 illustrates the decompression process from an initial image, n=O. through each 
2. Video Compression 27 
iteration. n=l. 2, 3 and 4. In this example, the only transformation techniques used were 
translation and hence only the e,,j;. s, and o, parameters of the matrix needed to be stored. 
This reduces the accuracy of the matching process since the inclusion of skew, rotation 
and reflection provide a greater possibi lity of producing a solution wi th a lower R value 
than that obtainable from translation alone. It is shown that, for the above example, after 
only 4 iterati ons the root mean squared error (RMSE) is reduced to 1.73 with liule v isual 
distortion. 
2.2.6. Wa velet 
W avelets are mathematical functions that have the ability to represem a signal in both 
space and time. The wavelet transform. in common with the DCT (section 2.2.2). does 
not provide compression itsel f but instead transforms data into a state that lends itsel f 
more readily to a range of compression techniques. Unlike a Fourier transform that 
analyses the frequency components within the signal through the use of sine and cosine 
fundamental signals. wavelet theory studies the input signal in terms of dilati on (scale. s) 
and the location (shift. -r) of a (mother) wavelet. 
r(-c,s) =- f f(t )w * - dt 1 +- (t-'C) 
.r; -oo s (2- 11 ) 
Equation 2- 11 shows the continuous wavelet transformation (CWT) equation, in which 
the CWT. T(r,s) of the input function j(t) is determined for a specific dilation, s. and 
location. r, of the mother wavelet, y1. The magnitude of T(r,s) is known as the Scalogram 
[33] and can be considered as equivalent to the signal density in Fourier transform. 
2. Video Compression 28 
[ 
T=O 5=00 T=O 5=1 
a) 1x mean b) 1x low frequency 
c) 2x medium frequency 
-'1 
T=3 S=\14 
d) 4x high frequency 
Figure 2-17 Illustration of the eight element 1-dimentionalllaar wa\'elet. transform. 
Figure 2- L7 shows, that by alternating sand r. the Haar mother wavelet (Figure 2-17b) 
can be dilated and shifted over a given time range[34]. Through doing so different aspects 
of the inpur signal can be analysed. At small values of s, and consequently high 
frequencies (Figure 2-17d). detailed spectral analysis can be performed at specific 
locations in the waveform. whereas, for larger values of s, lower frequencies analysis 
over a larger time range is achieved. 
Wavelet transforms can be defined as either continuous or discrete. Whereas CWT 
transform coefficients are produced by analysing skcwed and shifted variations of the 
mother wavelets. discrete wavelet transform (DWT) is often implemented using digital 
fi lteri ng techniques. 
2. Video Compression 29 
Figure 2-18 Three-level one-dimensional DWT decompo ition tree illu trating the repetitive 
down-sampling of resolution. The blocks H and L indicate high and low pass filters 
respectively and 12 indicates downsampling by a factor of 2 
Figure 2-18 illusrrares how an input signal. x(n), can be decomposed into a series of 
signals of decreasing resolut ion signals. At each stage in the process the output is a high 
pass component, d(n) and the low pass component is passed to the next stage for funher 
decompos ition. A t the lea f of the tree the final low pass filter generutes the coarse 
approxi mation information. a(n). To reproduce the original signal the Lree is transverse in 
reverse. 
X(n) 
~r------------------------------ HH 
~~------------------------------ HL 
r------------------------------- LH 
LLHH 
LLHL 
Figure 2-1 9 Two-level two-dimensional DWT decomposition tree illustra ting the repetitive 
down-sampling of resolution for both dimensions. 
To implement the decomposition of a two-d imensional image. additional filters are 
required. as shown in Figure 2-J 9. By first ·fi ltering in the horizonral direction and then in 
the vertica l. image detai l can extracted from the original input image in both dimensions. 
2. Video Co111pression 30 
LHL 
HL 
LLH U-IH 
LH HH 
Figure 2-20 A two-dimensional DWT decomposition illustra ting the repetitive down-
sampling of image resolution. 
Figure 2-20 graphically shows the outputs from the fi lter in Figure 2-19. Since the DWT 
is a transfonnation of data from one form to another the size and resolution of the image 
produced is identica l to that of the input. 
a) Original image b) DWT multi-resolution representation 
Figure 2-21 Example of a two-level DWT tran formation of a Seagull image. 
As can be seen in Figure 2-2 1, the DWT produces a series of mu lL-resolution images. A t 
top left , a low-resolution approximation of the original image is found. Proj ecting from 
thi s in horizontal , vertical and diagonal directions are difference images. By combining 
these difference images with the low resolution approx imation, a new higher resolution 
approximation is produced. Repeating this process (for a two-level abstraction) 
reconstructs the original image. 
2. Video Compression 3 / 
2.2.7 . Vector Quantisation 
Vector quantisation (VQ), like scalar quantisation. is a method o f approximating data so 
that it can be represented by fewer bi ts. 
00 01 10 11 
-4 -3 -2 -1 0 2 3 4 
Figure 2-22 Example of a one-dimensional 2-bit vector qua ntisation representing a ny real 
number as one of four quantisation values. 
Figure 2-22 shows an example of a one-d imensional YQ in which the continuous lower 
scale could be transposed onto the discrete upper four point scale . In this example. the 
range from -oo to +00 is transposed to four points (two bits). Using this VQ sche me, va lues 
around zero receive a finer-grained quantisation than do o ther values. C learly, the 
location of the discrete va lues on the scale and the number of quanti sation bits wi ll affect 
the accuracy with which the original continuous data can be reconstructed. 
-3 ·2 -1 0 2 3 
Figure 2-23 A two-dimensional 4-bit vector quantisa tion of regions in a 6x6 block usin g 15 
quantisation value . 
The two-dimensional 4-bi t VQ in Figure 2-23 shows an example of a 6x6 square block 
di vided up into 16 regions each with a unique codevector. This total combination of 
codevectors is called the codebook. and in this case can be represented using 4-bits l35 1. 
2. \/ideo Compression 32 
In VQ image compression this notion of representing a group of data with a single 
codevector is implemented. 
RECONSTRUCTED 
F igure 2-24 Vector quantisation image compression i achieved by matching blocks from the 
original image with a block in the code book. Decompression is achieved by copying code 
book values to the appropriate corresponding location in the reconstructed image[36]. 
Figure 2-24 illustrates how the original input image is divided into 4x4 pixel blocks. Each 
is then matched to a block in a code book of the recorded codevectors and a reference to 
the selected codevector is stored for that given grid reference within the image. To decode 
a vector quantised image, a method similar to painting by numbers is applied in which the 
codevector is referenced The decoding process is computationall y straightforward and 
suitable for low power devices. To prov ide a va lid compression system, the codebook of 
both the encoder and decoder need to be identical: thi s can be implemented either by a 
static or dynamic codebook. fn a static codebook system. both the compress ion and 
decompression engines contain the same pre-designed set codebook. T his method 
produces low bitrates since only the codevectors need to be transmitted, but its 
performance deteriorates when good matches between image blocks and codebook blocks 
are not avai lable. In dynamic codebook systems, the codebook can be changed from 
image to image or f rom scene to scene. The advamage over the static codebook is its 
adoption to new images, generally producing a codebook Lhat better matches the input 
image compared with the static approach. The disadvantage is that the decoder requires a 
2. Video Compression 33 
method of receiving this dynamic codebook, thereby increasing the bandwidth 
requirement and reducing compression performance. 
2.2.8 . Differential encoding 
The compression techniques described above arc des igned to compress a single image. Jn 
v ideo coding, this type of compression is known as intra f rame compression since the 
methods only require data from individual frames. T wo popular codecs that use intra 
frame compression only are motion JPEG (MJPEG) f37] and Digital Video (DV). Since 
only inrra frame techniques are used. each frame can be decoded independently allowing 
for rapid random access of a frame for display or editing purposes[38]. The disadvantage 
of using intra coding alone is the relati vely high bitrate of the stream produced when 
compared with compression schemes that compress between frames (DV uses a fixed 
bi trate of 25Mbits/s). To reduce the bitrate, inter frame compression has been developed. 
Since each frame in a sequence represents a snapshot of time 40 ms after the previous 
frame (assuming 25 frames/s). the content of each frame is likely to highly correlate with 
neighbouring frames. One of the simplest forms of inter frame compression is di fferential 
encoding. in which only the differences between consecutive frames is encoded. 
34 
c) Residue frame 
Figure 2-25 Illustration of the image difference between two con ecuti ve frames in the 'Susy' 
video sequence. Differential encoding compression is achieved by encoding this residue 
frame a opposed to frame l. 
Figure 2-25 illu. trates the dramatic reduction in image data required to repre ent the 
re idue image (Figure 2-25c) compnred to the complete intra coded second frame (Figure 
2-25b). The di advantage of inter- frame compres ion i that a tran. rni ion error that 
would normally only affect a single frame cru1 now be propagated into a whole equence 
of following frame . 
2 .2.9. Motion estimation I motion compen ation 
M otion e·timation (ME) determine region of irnilarity in the current frame and a 
previou ly-encoded frame and produce a motion vector (MY) that de cri be the tran lmion 
required to transform the encoded frame to the cu rrent frame[39]. To achieve thi each 
frame is di vided into a grid of macroblock (MB), thi. term being coined ince their areas 
2. Video Compression 35 
are generally larger than tho. e of the block. used in D T-ba ed ·cheme!>. Block matching 
techniques are used to find the be 1 matchi ng MBs between the reference and the current 
frame and hence the MY linking them 10gethcr only produce an approx imate match. 
Applying ME a. the only method of image compres. ion doe not, generally, result in 
ufficienlly high qual ity re. ults for practical use. To i mprove performance, the difference 
between the reference and target MB is calculated and tared u ing a method known as 
motion compen ation (MCH40]. MC u e. the reference input frame and the MY 
produced by M E to produce a reconstructed frame containing the tran lated MBs. By 
ubtracting thi. from the original uncompre ed frame a re idue, frame containing the 
difference informati on, i obtained and then tran mitted a part of compressed ideo 
I ream. 
a) Motion found between frames b) Residue with ME 
Figure 2-26 Illustration of MV ob erved between frame 0 and 1 on the ' usy' sequence. 
When using these MV it i een how the information stored within the residue frame is 
dramatically reduced. 
Figure 2-26 illu ·trates the proce es for both M E and M C. The original frames in Figure 
2-25 exhibit . ub tantial temporal imilarity, the imi lari tie are not alway located in the 
ame blocks in consecuti ve frames, but M E is able to track thee imilar areas. For 
illustration purpo es Figure 2-26a include the MY produced by M E for each of the MBs 
in the frame. T he true power of ME a a compre ion technique can clearly be een in the 
re idue frame produced by M C, Figure 2-26b. Compared wi th Figure 2-25c, the reduction 
in the data content of re idue frame i dramatic. The efficiency of M E depend on the 
MB matching algorithm, who c performance i. normally mea ured in term of the 
ab olute difference in pixel va lue between the MB and the reference MB. To determine a 
2. Video Compression 36 
measure for the frame as a whole the sum of abso/w e differences (SAD) between all pairs 
o f MBs wi thin a search area is found. M atching MBs between the reference and the 
current frame is a very computationally expensive process, requirering a large number of 
numerical operations performed at video frame rates. 
2.3. Compression standards and schemes 
By prescribing the use of one or more of the compression methods described, a complete 
codec (COmpressor/DECcompressor) can be produced that takes raw uncompressed pixel 
data and produce a compressed bitstream which can be used to recover the uncompressed 
image data. A number of standards for such codecs have been developed for different 
markets, each speci fy ing different performance metrics. Such applications include video 
conferencing systems operating at low bandwidth and in which the need to reduce latency 
overrides the need to maintain quality. At the other extreme, broadcast systems require 
both good quali ty and high compression rates. In the following section, some of the more 
popular compression standards are described ranging from early standards designed for 
videoconferencing to advanced schemes providing high compression ratios for a large 
range of biLrates. 
2.3.1. Standards bodies 
There is a number of international bodies responsible for producing standards that cover 
the lield of digita l video compression and transmission. 
2.3. 1. J. lntemational Telecommunications Union (!TU) 
The International Telecommunications Union (lTU)141 ] is a Uni ted Nations (UN) 
organisation for developing telecommunication standards, currently divided into 25 
subsections each denoted by a letter, as can be seen in Table 2-2. The recommendations 
that are of most interest to this research are found wi thin the 'H. range of standards. 
where four video compression schemes are found, some of wh ich are aligned with those 
specified by other stands organisations, Table 2-3. 
2. Video Compression 37 
Table 2-2 Summary of the 25 divisions of wit hin the IT 's tanda rds. 
ubscction Responsibil ity 
A Organi7ation of the work of I TU-T 
B Means of expression: dclimuons. symbols. classification 
c General telecommunication statistics 
D General tariff principles 
E Overall network operation. telephone service. service operation and human factors 
F Non-telephone telecommunication services 
G Transmission systems and media. digital ystems and networks 
H Audiovisual and multimedia systems 
I Integrated services digital network (ISDN) 
J Transmission of television. sound programme and other multimedia signals 
K Protection against interference 
L Construction. installation and protection of cables and other elements of outside 
plant 
M TM and network maintenance: international transmission systems. telephone 
circuits. telegraphy, facsimile and leased circuits 
N Maintenance: international sound programme and television transmission circuits 
0 Specifications of measuring equipment 
p Telephone transmission quality. telephone installauons, local line networks 
Q Switching and signalling 
R Telegraph transmission 
s Telegraph services terminal equipment 
T Terminals for telcmatic serv1ccs 
u Telegraph switching 
V Data communication over the telephone network 
X Data networks and open system communication 
y Global information infrastructure and Internet protocol aspects 
z Languages and general software aspects for telecommunication systems 
2. Video Compression 38 
Table 2-3 The four video compression standa rds in subsection H of the ITU standard. 
11 S tandard Year of ratification Links to other standard bodies 
H.261 1991 -
H.262 1994 MPEG-2 pan 2. ISO/IEC 13818 
H.263 1995 
-
H.264 2003 MPEG-4 pan 10. ISO/IEC 1-1496 
2.3. 1.2. Motion Picture Experts Group (MPEG) 
T he motion picture expe rt group (MPEG)l42l or ISO/IEC JTC1/SC29 WGll is an ITU 
working group responsible for developing standards for video and audio encoding, many 
of which can be found in modern consumer and profess ional products, Table 2-4. 
Table 2-4 Three standards described by M PEG. 
Standard name S tandard number Video compression sub-section 
MPEG- 1 ISO/IEC 111 72 Pan 2: Video 
MPEG-2 ISO/IEC 13818 Pan 2: Video 
MPEG-4 ISO/IEC 14496 
Pan 2: Visual 
Part 10: Advanced Video Coding 
2.3. 1.3. Inte rnational organisation for standardisation (lSO) 
The Lnternational organisation for standardisation (1SO)f43] is the parent organisation of 
MPEG but also produces standards that span a very wide range o f categories beyond 
video. The three standards bodies discussed often ratify the same standards in their own 
form, givi ng their own sea l of approval to a given standard. For example, thi s can be seen 
in the MPEG-2 video compression standard which is equiva le nt to ISO/lEC 138 18 and to 
ITU-T H.262 and to MPEG-2. 
2.3.2. H.261 
H.261 was one o f the first digita l video compressio n schemes developed and forms the 
basis of many modern standards. H.261 was ratified by the ITU in 1990 a nd was designed 
speci fi call y for video conferenc ing applications running over Integrated Services Dig ital 
Network (ISDN)[44] Iincs. A s ingle lSDN B channel has a bandwidth of 64kbits/s and 
multiple lines can be bund led together to increase the system bandwidth in steps of 
2. Video ComrreJJsion 39 
64kbit I [45]. With thi. in mind the H.26 1 tandard b set to operate at data rate. in the 
range 40kbi t. / · to 2Mbits/s. The input picture format is strictly defined, with the frame 
sequences ha ing to be progressive, as oppo.e to interlaced, and to be one of two 
re. olutions, namely ei ther common interchange format (CIF) at 352x288 or quarter CJF 
(QCIF) at 175x 144. 
p p p p p p p p 
~ FoiWard predlctton P rrame 
Figure 2-27 Prediction pattern for frame sequence conta ining only I and P fra me . 
H.26 1 employ both intra and inter frame compre · ion technique and deline two 
different frame type to accompli ·h thi . The fir ·t type i the intra frame, or r frame, 
wh ich, as the name uggest , i compre ed u ing intra frame techniques. The econd 
type of frame i the predicted frame, or P frame, that i compre eel u ing inter frame 
technique: that complement the intra compre ion technique . Each frame type ha 
specilic propertie and function within the frame sequence rructure. The J frames, being 
intra-compre ed, doe not require in formation from other (previou ) frames to be 
decompre. sed, permining quick frame seeking and error recovery. DCT transforms 8x8 
pixel block data in I frame and these arc proce ed by quanti ation, zig-zag can and 
RLE. P frame~ are compre eel using both inter and intra frame technique and 
con equent ly require information from other frame within the frame equence. 
Prediction made u ing ME are obtained only from previou ly decoded frame. , a hown 
in Figure 2-27. Once the ME ha produced the MY , MC i · u ed to create a re idue image 
containing the error between the original frame and the one produced by ME alone. The 
re idue i then transformed, quanti ed. canned and RLE is carried out as for the intra 
frame compre!:. ion. The tructure of the compre • ed frame found in an H.26 1 bitstream i. 
divided up into four layer a il lustrated in Figure 2-28. 
2. Video Compression 
• Pic ture 
o Group of blocks 
• Macroblocks 
• Blocks 
Figure 2-28 Illustration of the hiera rchical tructure present within the 11.261 bit trenm. 
40 
The picture layer represents one complete frame within the stream whether it be an I 
frame or a P frame. The group of blocks (GOB) contains a fi xed number of blocks from a 
frame with each GOB containing 132 blocks. or 1112 of a CIF frame and a 1/3 of a QCIF 
frame. The next layer is the 16x J6 MB of pixels used for ME and the lowest layer is the 
8x8 block of lurna pixels used in the calcul ati on of DCT. 
2.3.3. MPEG- 1 
MPEG-l is designed for cod ing video and its associated aud io at up to 1.5Mbits/sl46J. 
Table 2-5 Five sub-sections within MPEG-1 standard. 
fPEG-1 sub-section Sub-section responsibilities 
I Sy terns 
2 Video 
3 Audio 
..j. Con formance testing 
5 Solh\are Simulation 
As can be seen from Table 2-5, the standard is not devoted solely to video compression. 
MPEG-1 combines seperate audio and video compression into a system structure that 
accommodates both types of streams. The maximum bilrate of MPEG- 1 has been 
designed specificall y with consumer storage media in mind. The compact disk (CD), 
which at that time was the di gital storage plaLform of choice for videos, had a read speed 
of just over l.5Mbits/s. The video CD (VCD) specification combines an MPEG-1 video 
bitstream at a constant bitrate of J J50kbits/s with an MPEG-1 audio layer 2 (MP2) stream 
at 224kbits/s to produce a digital di sk based video/aud io system with image quality 
comparable to Video Home System (VHS) video format. The audio sub-section of 
MPEG-1 is subdivided into three layers. MP2 is the second layer and is part of the 
2. Video Compression 41 
European digital video disk (DVD) specification, alongs ide Pu lse Code Modulat ion 
(PCM ) and Dolby' adaptive transfom1 coder 3 (AC-3). The th ird layer in the MPEG- 1 
audio <;tandard (MP3) experienced great popularity on the intemet and in peer to peer 
(P2P) network due to its small fi le size and reasonab le quali ty . tereo audio. 
B B B p B B B p 
4 Forward pred1ct1on B frame 
BaCkward pred1ct1on B frame 
4 Forward pred1C1Ion P frame 
• • • 4 • • • 4 
Figure 2-29 Prediction pattern for a frame sec1uence containing l, P and B fra mes. 
MPEG- 1 video use the same block-based video compression technique a H.26 1, but 
wi th ome of the con traim removed and the addition of ex tra features. The compre . ion 
cheme adopted by MPEG- 1 is based on the ame principle a H .261 wi th imra-frame 
coding being performed by DCT, quantisation, zig-zag re-ordering and RL-encoding, and 
inter- frame prediction is u ed in addition to M E/MC. MPEG- 1 add the bi-directional or 
B-frame, which l ike P-frames, i employed to achieve inter frame compre ion. Whereas 
frame can only produce MY from previ ou ly decoded frame , B frame. can produce 
MV from both pa t and future I or P frame a · hown in Figure 2-29. Thi ability to 
predict from fulure frames make, it important to con ider display order and coding order, 
ee Figure 2-30. 
2. Video Come.ression 42 
Display order l B B B p B B B p 
Frame number 1 2 3 4 5 6 7 8 9 
Coding order p B B B p B B B 
Frame number 5 2 3 4 9 6 7 8 
Figure 2-30 Illustration of the difi'erenccs between display and coding order of MPEG-1 
stream. 
To predict future I or P frames. B frames need to have been previously coded. By altering 
the display order so that the 1 or P fra mes needed by B frames are encoded ftrst it is 
possible for the B frames to predict the future pre-encoded frames. In addition to the extra 
frame type, MPEG- 1 improves the accuracy of the MYs produced by ME thereby 
decreasing the residue informmion left to encode. This is achieved by producing MYs 
with real-value vectors that spec ify sub-pixel precision. When sub-pixcl MYs are 
received by MC, an averaging between the pixel va lues is carried out when producing the 
residue image. In addition, the entropy coding of quanti sed DCT coefficients has been 
modified in order to produce higher compression ratios. By employing Huffman 
techn iques to encode the non-zero coefficients and RL encoding for the zero coeffici ents, 
each set of coefficients can be compressed wi th greater efficiency. The structure of the 
compressed bitstream is once again hierarchical and shown in Figure 2-3 1. 
• Group of pictures 
o Picture 
• Sl ice 
• Macroblock 
o Block 
Figure 2-31 Illustration of the hierarchical structure present in MPEG-1 video trcams. 
As in H.26 1, the lower two layers describe how the pixels are arranged in blocks and 
MBs for DCT and ME respectively. The GOB within H.26 1 is renamed slice in MPEG 
compression, and the requi rement to include a spec ific number of blocks in each slice is 
removed. T he group of picture (GOP) layer ex presses how frames arc grouped. A GOP 
contains a set number of frames that can be encoded and decoded independently with the 
first frame being an J frame. By grouping frames in th is way, it is poss ible to specify 
individual e ncoding and decoding segments in the whole stream. 
2. Video Co111pre.u ion 43 
2.3.4. MPEG-2 I H.262 
MPEG-2 fH.262 was joi ntly deve loped and published by both the MPEG and ITU 
standards bodies. Jn this thesis the standard wi ll be re ferred to as MPEG-2[471. MPEG-2 
has become the defacto standard for quality video storage and transmission. being used in 
both the DVD and digital video broadcast (DVB) standards. As with MPEG- 1 the 
standard is not devoted to video compression alone but is divided up imo pans as 
described in Table 2-6. 
Table 2-6 The 11 sub-sections in the MPEG-2 standard. 
MPEG-2 Sub-sl><:Lion Sub-sections Responsibility 
I Systems 
2 Video 
3 Audio 
4 Conformance testing 
5 Software simulation 
6 System extensions - DSM-CC 
7 Advanced Audio Coding 
8 VOID - (wi thdrawn) 
9 System extension RTI 
10 Conformance extension - DSM-CC 
11 IPMP on MPEG-2 Systems 
The MPEG-2 video sub-section was developed as a general purpose compression sche me 
for bitrates in the range J .5 to 15Mbits/s. One of the main reasons for the increase in the 
allowed bitrate in comparison with MPEG-1 is the re laxation of the maximum input 
resolution, increasing from CIF in MPEG-1 to high de finition (HD, l920xJ080 pixels) in 
MPEG-2. To accomodate the increased resolution and bitrate avai lable, the VLC tables 
have been both extended and enhanced. As in earlier video compression standards. 
MPEG-2 incorporates both ME/MC and DCT to perform compress ion in intra and inter 
frames. MPEG-2 is a superset of MPEG- 1 and MPEG-2 decoders can decode MPEG-1 
streams. The major advance in MPEG-2 is the support for interlaced video in addi tion to 
the support for progressive video as found in earlier encoders. In progress ive video each 
whole frame follows another. whereas in interlaced streams the frame is split up into 
2. Video Compression 44 
fields. each containing an alternate set of scan lines; one field including all the odd lines 
and the other including all the even lines. This method of displaying moving images 
originates from cathode ray tube (CRT) displays. Due to the limited number of lines a 
CRT can scan for a given vertical frequency, it was found that by scanning alternate lines 
would give twice the vertical resolut ion while preserving the appearance of continuous 
moti on without nicker. The introduction of imerlaced video processing required that the 
algorithms for the base compression methods (DCT and ME/MC) are adapted to take into 
consideration two fields as inputs instead of one frame. MPEG-2 introduces two methods 
of scalability, with the enhancements being seen in both the spati al resolution and the 
signal to noise ratio (SNR) of U1e output. SN R scalability provides the ability to store and 
transmit two different sets of DCT coefficients. ln the base, or default level , normal 
compression is performed with the DCT coefficients and MVs stored in the bitstream. 
while the enhanced layer incorporates a refined set of coefficients produced by 
determining the quantisation en·or produced using the base quantisation coe fficients. this 
error then being zig-zag re-ordered and RL encoded. l f a decoder receives only the base 
layer all the information that is required to produce the image is sti ll available. lf the 
decoder receives the base and enhanced layer coefficients then the addi tional in formation 
found in the latLer is added to the in formation w ithin the base layer to produce a result 
with an improved SNR compared to the base alone. MPEG-2 also introduces spatial 
scalability which is the ability to tran mit two different resolutions within the same 
bi tstream. In the base layer, as with SNR sca lability, the normal process of encoding is 
carried out with a downscaled resolution image so that a non-scalable decoder can decode 
this low-resolution stream. A t the M C stage of the base layer encoder, the predicted frame 
is upscaled and passed to the enhanced layer encoder in which the ori ginal high-
resolution input stream is encoded. giving the option of utilising the upscaled in ME. 
Due to the large number of optional features specified by the MPEG-2 standard, decoders 
are not required to be able to decode every possible valid MPEG-2 bitstream, but rather 
need to be able to decode one or more of the profiles from a range speci fied in the 
standard. Each of the profi les stipulates a set of encoding features that must be supported, 
as shown in Table 2-7. 
2. Video Compression 45 
Table 2-7 Summary of the encoding features of the six profiles defined by the MPEG-2 
s tanda rd. 
Name Frames Colour sub-sa mpling t rc:tnlS Features 
Simple I.P 4:2:0 I Max res SD. no interlace 
Main I.P.B 4:2:0 I Simple+ interlace+ Bframcs 
+no max res 
s R scalable I.P.B 4:2:0 1-2 Main+ SNR scalability 
Spatial scalable I.P.B 4:2:0 1-2 Main+ spatial scalabi lity 
High I.P,B 4:2:2 1-3 Spatial scalable+ 4:2:2 
4:2:2 I.P.B 4:2:2 I Main +4:2:2 
In addilion to profi les. levels are defined that indicate the maximum resoluti on and bitrate 
that can be present in the output stream. The combination of profile and level is normally 
indicated in the following format: profi le@ level. The four levels defined in the MPEG-2 
standard can be seen in Table 2-8. 
Table 2-8 T he four levels defined in the M PEG-2 standard and their maximum re olutions 
and bitra te . 
'a me Resolution Max bit rate, Mbits/s 
Low 352xx288 4 
Main 720 x576 15 
lligh 1440 1440x ll 52 60 
High 1920x l 152 80 
2 .3.5. H .263 
The H.263 standard is a low bitrate video encoder developed by the TTU as a replacement 
for H.261. H .263 can be viewed as an updated version of H .26 1. having additional 
features but still being aimed at the video conferencing market[45]. Its more advanced 
features have made H.263 the codec of choice not only for publ ic switching telephone 
networks (PSTN) and ISDN based systems for which it was orig inall y designed. but also 
for video confcrencing products operating over a range of network specificalions. The 
basic compression techniques have not changed as H.261 has matured into H.263 w ith 
ME/M C and DCf sti ll being the methods o f choice. M C precision has increased from a 
fu ll to a half pi xel, producing higher resolution MVs thereby reducing the residue to be 
2. Video Compression 46 
e ncoded . The la test version of H.263 is known as H.263v2 or H.263+, and describes a 
range o f optional advanced features . ln addition to VLC for DCT coeffic ients 
compression. H.263 specifies the option to use a syntax-based arithmetic coding sche me 
(SAC) i481. In addi tion to the B frame found in MPEG-2. H.263 also defines the PB 
frame, which is a combination of the P and B frames in that MYs are produced from both 
the previous and future frames. As for MPEG-2. there are the options o f scalability in 
both the SNR and spatial domains. The rather restricti ve limitation in input resolution has 
a lso been lifled allowing for three extra resolutions to be used: SQCIF. 4CIF and 16CIF. 
corresponding to sub-QCIF (l28x96), four times CIF (704x576) and sixteen times ClF 
( I 408x 1 152). 
2.3.6. MPEG-4 
MPEG-4 is actua ll y the third video sta nda rd produced by MPEG. with MPEG-3 be ing 
abandoned since its goa ls were fulfi lled by MPEG-2. MPEG-4 is a multi-media object-
oriented standard and is di vided into 22 parts. as shown in Table 2-9[491. 
2. Video Compression 47 
Table 2-9 The subsections of the MPEG-4 sta ndard. 
MPEG-4 ub-scction Responsibility 
I Systems 
2 Visual 
3 Audio 
4 Confom1ance testing 
5 Reference Software 
6 Delivery Multimedia Integration Framework 
7 Optimised software for MPEG-4 tools 
8 ~ on lP framework 
9 Reference Hardware Description 
10 Advanced Video Coding 
11 Scene Description and Application Engine 
12 ISO Base Media File Format 
13 IPMP Extensions 
14 MP4 File Format 
15 AVC File Format 
16 Animation Framework eXtension (AFX) 
17 Streaming Text Format 
18 Font compression and sLrc:uning 
19 Synthcsi;cd Texture Stream 
20 Lightweight Application Scene Representation 
21 MPEG-J Extension for rendering 
22 Open Font Formal 
There are two parts in the standard that speci fy video compression schemes, namely pan 
2 (visual) and part 10 (advanced video coding). These two are examined individual ly 
below. T he MPEG-4 syntax is that of an object-ori ented based system, with the v ideo 
encoder adhering to a strict hierarchy. At the top of the hierarchy is the video session 
(VS) containing various video objects (VO). The YOs need not be rectangular collections 
of pixels, but can be collections of arbitrary shapes such as text or forming channel ident 
overlays. A VO can consist of a number of video object layers (VOL). each layer storing 
either temporal or spatial scalability information regarding the given VO. The bottom 
leve l is the video object plane (VOP), that is the frame itself. 
2. Video Compression 48 
2.3.6. 1. Part 2 Visual 
MPEG-4 pan 2 (often simply referred to as MPEG-4). is the video compress ion scheme 
responsible for creating VOs. The scheme is based on ME/MC and DCT techniques and 
is designed to produce good compression and quality videos bitrates from 5kbits/s to 
more than J Gbitls. and resolutions ranging from SQCIF to stud io, greater than 4000 x 
4000 pixels[50]. As in previous generations of video codecs from MPEG. MPEG-4 
implements new features in a effort to increase the compression ra tios. 
MPEG-4 specifies three di fferent types of scalabili ty in the VOLs: spatial. temporal and 
fidelity. As in MPEG-2, spatial sca labi lity provides the abi lity to transmit different 
resolution images in one bitstream. Temporal scalability allows different frame rates to be 
used. permitting faster decoders to decode the stream at a higher fra me rate t.han say a 
mobi le decoder operating with a limited power budgcl. Fidelity scalabili ty uses the same 
principles as SNR sca labi lity defined in earl ier standards. With these three di fferent 
scalabil ities, the codec is designed to have the capabili ty of producing a set of streams 
suitable for either decoder. Where there is partial loss of data in transmiss ion. a scalabil ity 
scheme is able to deliver a gradual degradation of image quality as opposed to a 
catastrophic break down that would occur if only a single stream was present. 
To aid recovery when there is reception packet or signal loss in a transmission. the 
MPEG-4 standard incorporates error and resynchronisat ion features. To he lp recover the 
decoding process in a frame. resynchronisation markers are insened into frame slices, 
where a slice is a group of MBs arranged in scan order. The MBs in any given slice are 
decoder-independent of MBs in other slices, allowing slices to be decoded even if block 
information outside that slice is missing. Such independence, combined wit.h the 
resynchronisation markers. permits full decoding to recommence at the s lice boundary 
following the transmission error. Tn addition to this, methods are also employed to enable 
recovery from errors found in a slice. In an attempt to alleviate the problems produced by 
burst errors, MY and DCT coeffi cients are interleaved. all owing the more visuall y crucial 
MY data to be spread among the less sensitive DCT coeffi cients. F inally. inserting a 
resynchron isarion marker and reversing the order of the VLC coefftcienLs immediately 
foll owing that marker. increases the probability of recovery from errors in the VLC data. 
2. Video Compression 49 
In order to produce a codec that is not only suitable for low-bandwidth mobile video 
clips, but also for studio quality lilm storage. additiona l feaLUres and refinements have 
been implemented. Ln early MPEG standards, the on ly colour scheme available was 4:2:0. 
whereas MPEG-4 allows two additional schemes, namely 4:2:2 and 4:4:4, that store 
colour information not only at an improved resolution but provide a higher correlation 
between the input colour and its sub-sampled counterpart. As the DCT coefficients 
exhibit different properties in low compared to high bitrate sequences, MPEG-4 makes 
a vai !able a range of different VLC tables suited to the compression of speci fie bi trate 
sequencies. As seen in previous standards, increasing the accuracy of ME predictions 
makes it possible to produce a better match between MBs and hence reduce the residue 
(error) data. To achieve this effect the pixel resolution of the MVs has been increased 
from half to a quarter of a pixel. During scenes that involve panning, the motion of large 
areas of the frame can be seen as identical, and MPEG-4 includes a method known as 
global motion compensation (GM C) that defines MVs for the whole frame. MVs can be 
computed with respect to this global offset, reducing the effect that individual MYs w ill 
need to record and hence consequently the bitrate required. 
To allow decoders to conform to a gi ven set to features, profiles and levels are defined as 
in MPEG-2. three profiles are delined. namely simple, advanced simple and main, each 
containing a number of levels. as shown in Table 2-10. 
2. Video Compression 50 
Table 2-10 T he three prolilcs of MPEG-4 howing their individual levels. 
Profile Level Typical Max number of Max Features 
resolution objects bitrate 
• 1-VOP and P-
L1 176x144 64 VOP only 
• 4-MV 
Simple 
L2 352x288 
4x simple 
128 • Unrestricted MV 
• Only rectangular 
L3 352x288 384 VOP 
L1 176x144 128 
Simple+ 
• Adaptive 
L2 352x288 384 quantisation 
• Interlace 
Advance L3 352x288 
4x AS or 
768 encoding simple simple 
• Quarter pixel 
• Global motion L4 352x576 3000 
compensation 
• B-VOP 
L5 720x576 8000 
L2 352x288 16x Simple or 2000 AS+ main 
• semi-transparent Main L3 720x564 15000 
32x simple or VOP 
main • sprite VOP 
L4 1920x1088 38400 
2.3.6.2. Part 10 Advance video coding I H.264 
MPEG-4 part 10 and H.264 is the result of a partnership between MPEG and ITU, known 
as Joint Video Team (JVT) to produce an advanced video compression scheme 151. 52]. 
H .264 has been designed to be a multi-purpose encoding standard achieving good results 
in both high defin ition work and low bitrate environments. To ensure flexibi lity and 
compatibi li ty for such a wide range of formats. the standard divides the workload into 
two layers, as shown in Figure 2-32. 
2. Video CompreHion 
r-1 
10 ~---1 Video Coding Layer 
n; 
0 
0 Coded MB 
!:: 
L8 
' I
Coded Slice 
51 
Figure 2-32 Di tinction between video coding layer and the network abstract layer of the 
H.264 standard. 
The video coding layer (VCL) i re pon. ible for producing a true repre entation o f the 
video equence. Thi. is done, a in earl ier MPEG standard. , through the implementation 
of ME/MC and Iran ·formation. The network ab traction layer (NAL) i responsible for 
formatting the data received from VCL and providing header information to allow for 
ucce fu l distribution to tran port layer or storage media[53]. 
Adopting the optional entropy coding cheme pre ·ented in H .263+, 1-1 .264 define two 
cheme ·, namely context-adapti ve variable length coding (CA VLC) and context-adapt ive 
bi nary ari thmetic coding (CABAC) [54] . CA VLC is a form of VLC where a number of 
VLC tables can be defined, and the appropriate electi on of a table can be made during 
tran mi . ion re1.ulting in an improvement in compression compared with a single-table 
VLC. CABAC u es arithmetic coding; a lo sle s compre ion scheme that is more 
ef ficient than CA VLC, when operating in conjunction with context adaptati on it give. a 
reduction in bitrate of 5 to 15% [55], but at the eo t of higher computational 
complexity [56]. In ari thmetic coding, as ignment of non-integer number of bi t per 
symbol are pos ible. 
H.264 introduce. i ntra frame prediction in which encoded MB in the current frame are 
u ed to reduce the quantity of data to be tran formed and quanti ed. Intra MB prediction 
can be applied to block. of ize 4x4, 8x8 or 16x 16. There are eight different method. of 
predicti on for a 4x4 block, a hown in Figure 2-33, and four for both 8x8 and 16x 16 
blocks, a hown in Figure 2-34. 
2. Video Compression 
M A B C 0 E F G H 
J 
K 
L 
0, vertical 
M A B C 0 E F G H 
J 
K 
L 
.-.--. 
# , , , , 
3, diagonal down-left 
l~ rA B C 0 E F G H 
J 
K 
L 
.. 
6, horizontal-down 
M A B c D E F G H 
I 
J 
K 
i 
~ 
L 
1, horizonta l 
M A B c 0 E F G H 
I I 
. 
' J 
' K 
' L 
' " ~ ' 
4, diagonal down-right 
~ Aj Bf CT.; E F G H 
J, y J 
K 
L , ~ , 
7, vertical-left 
M A B C D E F G H 
J Mean. 
K A·D.I-J 
L 
2, DC 
M A B C 0 E F G H 
I 
J 
K 
L 
1 ' ' ~ 
5, vert ical-right 
MIAJ B C 0 E F G H 
I (..:.._ "' .. 
J 
K 
L 
8, horizontal-up 
Figure 2-33 The nine 4x4 intra prediction modes in the H.264 ta ndard. 
,..-
H H 
I ; ... 
V V 
-
.... 
0, vertical 1, horizontal 
-
H H 
- I ; ; Mean: ~ 
V ~ H+V V 
~ 
2, DC 3, plane 
Figure 2-34 The four 8x8 and 16x16 inlra prediction modes in the H.264 landard. 
52 
To reduce the bit rate further, a method o f encoding which prediction method to u e i 
al. o employed. By analy ing the prediction method employed in neighbouring block . . a 
2. Video Compression 53 
··mo t probable" prediction method can be computed. To indicate if thi~ specific 
prediction is used a single bit in the bit. tream i .• et. 
In the inter-frame encoding in H.264 motion e. timation is u ·ed. but a range of block ize 
i. avai lable in order to repre. ent region. of high detail with maller block size and larger 
block ize for homogeneous area . . 
16 8 8 
0 0 1 
16 0 0 1 
1 2 3 
a) 16x 16 b) Bx16 c) 16x8 d) BxB 
Figure 2-35 16xl6MB partitions available in H.264 
8 4 4 
0 0 1 
8 0 0 
1 2 3 
a) BxB b) 4x8 c) Bx4 d) 4x4 
Figure 2-36 8x8 MD partitions available in H.264 
Figure 2-35[57] indicates how a 16x 16 MB can be partitioned into four different ize . 
When the MB i divided in four 8x8 block , it can be ub-di vidcd further as hown in 
Figure 2-36. The choice of which parti tion pattern to apply to each MB is made to 
produce the fewer number of bit per MB. 
ln contrast with earlier tandard where the reference frame. u ed in ME were limited to 
one per P frame and two per B frame, the H.264 standard allow a larger number of 
reference frame to be con idered. ln addition, H.264 allow B frame to act a reference 
frame as well a I and P frames. In general the e improvement allow the calculation of 
MY that result in mallcr residual during MC, but at the expen e of computational 
complex ity and extra storage. 
2. Video Compression 54 
Previous codecs have used the compl icated DCf computation to transform their pixel 
data into the frequency domain. The nature of this transform is that each implementation 
has a different approximation to the ideal. meaning that an exact match between the 
original data and the inversely transformed data at the decoder is not achieved. This 
variation arises from the fact that floating point arithmetic is required to perform OCT. 
and hence the precision of the system performing the transform affects the accuracy of 
the result. ln H.264. a transform that uses integer arithmetic is defined, which ensures that 
different implementations produce the same resultsl58 1. 
H.264 introduces two new frame types. SP and Sl, in addition to I, P and B. The SP frame 
type uses M C prediction in the same manner as in P frames, but allows for duplicate 
predictions from different source reference frames to al low bitstream switching. such as 
changing between resolutions, splicing (inserting advertising breaks into a television 
programme) and error recovery should the reference frame for one of the SPs frame be 
missing 1591. The SI frame type employs on ly intra frame coding techniques. but multiple 
J frames are encoded for each physical frame. Since decoding is independent of other 
frames, the additional Sl frames can be used for random access and error recovery 
purposes. 
A known product of block-based video compression schemes is the presence of blocking 
artefacts. Blocking artefacts occur when there are insufficient bits ava ilable to encode the 
frame wi thout visual degradation. When this occurs the block structure used to encode the 
frame becomes visible which is known as blocking artefacts. ln an attempt to counteract 
this problem, H.264 includes an adaptive de-blocking filter. By placing the filter in the 
M C loop. a filtered frame can be used as a reference frame. The di fference in pixel values 
at block boundaries is determined and the relatively large absolute difference is used to 
trigger the de-blocking filter. However, if the difference is so large that it can not be 
explained by the coarseness of quanlisati on. a boundary in the image is assumed and no 
smoothing is applied. 
As in other video coding standards, certain conformance points are specified to achieve 
compatibility between encoders and decoders. As in earlier standards, H.264 uses profiles 
to speci fy certain algorithmic and coding methods used in the bitstream and levels are 
2. Video Compression 55 
used Lo apply limitations for the parameters used in the encoder. There are seven profi les 
specified in the H.264 standard. as shown in Figure 2-37. 
Profile Features 
Baseline • Only I and P type slices allowed 
• Progressive stream only 
• Only CAVLC 
• Restriction on the number of slices and the maximum 
H.264 level (Figure 2-38) 
Extended Baseline plus: 
• All slice types allowed 
• Interlaced streams allowed 
• Restriction on number of slices and the maximum H.264 
level (Figure 2-38) 
Main • Only I, P and B-type slices 
• Interlaced streams allowed 
• CABAC allowed 
High Main plus: 
• 8x8 transformations allowed 
• 8x8 (fil tered) intra prediction modes allowed 
High 10 High plus: 
• 1 0 bit pixel data allowed 
High 4:2:2 High 10 plus: 
• 4:2:2 colour sub-sampling allowed . 
High 4:4:4 High 4:2:2 plus: 
• 4:4:4 colour sub-sampling allowed 
• 12 bit pixel data allowed 
• Transform bypass allowing lossless encoding 
Figure 2-37 11.264 profiles 
To restrict the rate of information present in a stream, fi ve major levels and an additional 
10 sub-levels are defined. The level restrictions relate to the MB rate, the number of MBs 
per frame (resolution), frame buffer (restriction of multi frame prediction) and the vertical 
2. Video Co111pressio11 56 
range of the M V s. Figure 2-38 shows all fifteen levels with the corresponding bitrate for 
the baseline, the extended and the main profiles along wi th a corresponding approximate 
maximum resolution that the bitrate would allow. 
Level Max bitrate for baseline, Approx. max 
extended and main profiles resolution at 30fps 
1 64 Kbits/s 128x96 
1 .1 192 Kbits/s 176x144 
1.2 384 Kbits/s 320x240 
1.3 768 Kbits/s 352x288 
2 2 Mbits/s 352x288 
2.1 4 Mbits/s 352x480 
2.2 4 Mbits/s 352x480 
3 10 Mbits/s 720x480 
3.1 14 Mbits/s 1280x720 
3.2 20 Mbits/s 1280x1024 
4 20 Mbits/s 1920x1088 
4.1 50 Mbits/s 2048x1024 
4.2 50 Mbits/s 2048x1088 
5 135 Mbits/s 2560x1920 
5.1 240 Mbits/s 4096x2048 
Figure 2-38 H.264 levels 
2.4. Conclusion 
This review chapter was split into two distinct sections. in the first section 1111age 
compression techniques were examined and the second section on the video standards 
that use these techniques. 
2. Video Compression 57 
The compression schemes range from loss less approaches that reproduce exactly the 
original information to lossy compression methods that are able to dramatically reduce 
the required number of bils for storage or transmission by stripping away information that 
is undetectable to the human eye. Additionally, techniques can be applied in the moving 
image domain, where the use of inter frame encoding techniques was further seen to 
reduce the bitrate of a gi ven sequence. 
A range of video compression standards have been examined. These include the first 
scheme proposed by the !TU, namely H.261 in 1990 followed by a number of more 
advanced approaches instigated by both the ITU and MPEG. culminating in the most 
recent standard developed j ointly by ITU and MPEG, namely H .264. As the encoders 
have evolved through a number of generations, the standards bodies have introduced new 
techniques to improve compression performance but at the expense of processing power 
requirement. Particularl y in portable systems where power consumption is a consideration 
of great importance, ir is vital that operations performed on video streams consume as 
little power as possible. As discussed in charter 3, a parallel solution is likely to consume 
less power than a serial one and considering the popularity of video in resent consumer 
electronic devices, thi s approach is particularly apt in the implementation of modem 
video standards. 
Using the knowledge the author has gained from his survey of video compression 
techniques, later chapters will show that parallel techniques can be brought to bear on the 
block-based video encoder in order to improve throughput. 
2.5. References 
[l j A. A . Kassim, Y. Pingkun, L. Wei Siong, and K . Sengupta. "M otion 
compensated lossy-to-lossless compression of 4-D medical images using integer 
wavelet transforms." lnjomwrion Technology in Biomedicine. IEEE Transactions 
on, vol. 9, pp. 132- 138, 2005. 
[2] C. Zuo-Dian, C. Ruey-Feng, and K. Wen-Jia, "Adaptive predictive multiplicati ve 
autoregress ive model for medical image compression," Medicallmaging, IEEE 
Transactions on, vol. 18, pp. 181-184, 1999. 
[3] J. C. Russ. The image processing handbook. B oca Raton. F la. ; London: Crc. 
2002. 
2. Video Cn111pression 58 
[4] F. Ercal, M. Alien. and H. Feng, "Asystolic image difference algorithm for RLE-
compressed images," Parallel and Distributed Systems. IEEE Transactions on. 
vol. L I. pp. 433-443. 2000. 
[5] "Standardization of Group 3 facsimile terminals for document transmission," 
lTU-T Recommendati on T.4. 2003. 
f6] D. A. Huffman , "A Method for the Construction o f Minimum-Redundancy 
Codes." Proceedings of the IRE vol. 40. pp. 1098- 1101 , 1952. 
[7 ] D . Salomon. Data compression: the complete reference. New York; London : 
Springer, 2000. 
[8] J. Ziv and A. Le mpe l, "Compression of indi vidua l sequences via variable-rate 
coding," Information The01y, IEEE Transactions 0 11 vol. 24, pp. 530-536, Sep 
L978 1978. 
[9 1 J . Ziv and A. Le mpe l. ''A Universal Algorithm for Sequentia l Data 
Compression," Information Theory, IEEE Transactions on vol. 23, pp. 337- 343 
1977. 
[ 10] M . Kjelso, M . Gooch. and S. Jones. "Design and performance of a main memory 
hardware data compressor." in EUROMICRO 96. 'Beyond 2000: Hardware and 
Software Design Strategies' .. Proceedings of the 22nd EU ROM ICRO Conference 
Prague, Czech Republic, 1996. pp. 423-430. 
[11] T. A. Welch, "A Technique for High-Performance Data Compression," IEEE 
Compwer, vol. 17, pp. 8- 19, June L984 1984. 
[1 21 "GrF- Graphics interc hange Format: A standard defi ning a mechanism for the 
storage and transmission of raster-based graphics in formation." CompuServe 
Incorporated, 1987. 
[13] "TJFF Revision 6.0." AdobeDevelope rs Association, 1992. 
[14] M. D. Fairchild, Color appearance models. Chicheste r: John Wilcy, 2005. 
[15] B. E. Schmitz and R. L. Stevenson, "Enhancement of sub-sampled chrominance 
image data," in 38th Midwest Symposium on Circuits and Systems. 1995, pp. 133-
136 vol. J. 
[161 A. PK. K. H, and P. R, "Identification of a subtype of cone photoreceptor, li ke ly 
to be blue sensitive, in the human re tina," The Journal of comparative neurology, 
vol. 225, pp. 18- 34, J 987. 
[17] R. M. Berne and M . N. Levy, Principles of Physiology, 3rd ed.: Mosby 
Publishers. 1996. 
[ 18] "Specifications of Consumer-Use Digital VCRs us ing 6.3mm magnetic tape," in 
IlD Digital VCR Conference, 1994. 
2. Video Compression 
[ 19] "Digital Video Broadcasting (DVB); Framing tructure. channel cod ing and 
modulation fo r digitalteJTestrial te levision." vo l. ETSI EN 300 744 V 1.5. 1: 
European Telecommunications Standards Institute. 2004. 
[201 N. Ahmed . T. Natarajan, and K. R. Rao. "Discrete Cosine T ransform." 
Computers, IEEE Transactions on, vol. C-23, pp. 90 - 93, 1974. 
59 
l2 1J J. Watkinson. The Engineer's guide 10 compression . Pete rs fi cld: Sne ll & Wi lcox, 
1996. 
[22] A. Batema n and I. Pate rson-Stephens, The DSP handbook: algorithms, 
applications and design techniques: Prentice Hall. 2001. 
123] S. A. Martucci. ''Symmetric convolution and the discrete sine a nd cos ine 
transforms." Signal Processing, IEEE Transactions 0 11, vol. 42. pp. 1038- 1051 
1994. 
[24] S. J . Solari , Digital video and audio compression. New York: McGraw-Hill. 
1997. 
[251 J. Ziv. "Coding theore ms fo r indi vidual sequences." Inf ormation Theory, IEEE 
Transactions on vol. 24, pp. 405-412, 1978. 
126] N. Lu , Fracral imaging. San D iego; London: Academic Press. 1997. 
(271 J . . Han , "Fractal image compression and rec uJTcnt iterated function syste ms." 
Compw er Graphics and Applications, IEEE vol. 16. pp. 25- 33 I 996. 
128] J. Mukherjee, P. Kumar, and S. K. Ghosh. "A graph-theoretic approach for 
studying the convergence of fracta l encoding algorithm." Image Processing. 
IEEE Transactions on, vol. 9, pp. 366-377, 2000. 
[29] G. Lu and T. L. Yew, "Image compression using partitioned ite rated function 
systems," in Image and Video Compression, Proceedingsfo SPIE. 1994. pp. 122 
- 133. 
[301 A. E. Jacqui n. "Fractal image codi ng: a review," Proceedings of the IEEE, vol. 
8 1, pp. 145 1-1465, 1993. 
l3 I] B. Wohlberg a nd G. De Jager, "A review of the fractal image coding literature." 
Image Processing, IEEE TransacTions 0 11, vol. 8. pp. 17 16- I 729. 1999. 
[32] J. Cardinal , "Fast fractal compression of greyscale images," Image Processing. 
IEEE Transactions on, vol. 10, pp. 159-164, 200 1. 
[33] 0 . Riou l and M. Vetterli. "Wavelets and signal processi ng." Signal Processing 
Magazine, IEEE, vol. 8, pp. 14 -38, 199 1. 
2. Video Compression 60 
[34j E. J. Stollnitz. A. D. DeRose, and D. H. Salesin, "Wavelets for computer 
graphics: a primer." Computer Graphics cmd Applications, IEEE, vol. 15, pp. 76-
84. 1995. 
135] N. M. Nasrabadi and R. A. King. "Lmage coding using vector quan ti zati on: a 
review." Communications, IEEE Transactions on, vol. 36, pp. 957-971. 1988. 
[36] A. Nakada, T. Shibata, M. Konda, T. Morimoto, and T. Ohmi, "A fu lly parallel 
vcctor-quantization processor for real-time motion-picture compression." Solid-
State Circuits, IEEE Journal of, vol. 34, pp. 822-830. 1999. 
[37] C. Freek. J. M. M. Sousa. W. Hentschel, and W. Merzkirch. "On the accuracy of 
a MJPEG-based digital image compression PlY-system." Experiments in Fluids, 
vol. 27, pp. 310 - 320, 1999. 
[381 A. C. Luther, Principles of digital audio and video. Norwood. Mass. ; London: 
A1 ech House. 1997. 
[39] D. Le Ga ll , "MPEG: A video compression standard for multimedia applications." 
Communication of the ACM, vol. 34, pp. 46-58, 1991. 
[40] G. Cote and L. Winger. "Recent Advances in Video Compression Standards," 
lEE Canadian Review, pp. 2 1-24,2002. 
[411 R. Li , B. Zeng, and M. L. Liou. "A new three-step search algori thm for block 
motion estimation," Circuits and Systems j01· Video Technology, IEEE 
Transactions on, vol. 4, pp. 438 - 442 1994. 
142] P. N. Tudor. "MPEG-2 video compression," lEE Electronics & Communicalion 
Engineering Joumal, vol. 7. pp. 257- 264, 1995. 
[43] "Video Codec for Audiovisual Services at p x 64 kbits/s." ITU-T 
Recommendati on H.26 1, 1993. 
[44] "Internati onal Telecommunicati on Union," hup://www.itu.int. 
[45] "Video Coding for Low Bi trate Communication." fTU-T Recommendation 
H.263. 1996. 
[461 "Coding of moving pictures and associated audio for digita l storage media at up 
to about 1,5 Mbitls." vol. 11172: ISO/IEC. 1993. 
147] "Generic coding of moving pictures and associated audio (MPEG-2)." vol. 
138 18: ISO/IEC. 1995. 
148] G. Cote. B. Erol, M. Gallant. and F. Kossentini, ''H.263+: video coding at low bit 
rates," Circuils and Systems f or Video Technology. IEEE Transactions on, vol. 8, 
pp. 849-866, 1998. 
2. Video Compression 
[49] "Information technology- Cod ing o f audio-visual objects.'' ISO/IEC 14496, 
2000. 
[50] "ln formation technology- Cod ing of audio-visual objects-- Part 2: V isual," 
JSO/IEC 14496-2, 2004. 
6/ 
[511 "Advanced video coding for generic audiovisual services ": ITU-T 
Recomme ndation H.264, 2005. 
1521 "In formation technology- Coding of audio-vi sual objects-- Part 10: Advanced 
Video Coding," ISO/IEC 14496-10, 2005. 
[531 "Advanced Video Coding." vol. I 1496-10: IT U-T Rec. H.264 I ISO/lEC 2002. 
[54j J. L. Nunez-Yanez, V. A. Chouliaras, D. Alfonso, and F.S.Rovati. "Hardware 
Assisted Rate Distortion Optimization with Embedded CABAC Accele rator for 
the H.264 Advanced Video Codec," Consumer Electronics, IEEE Transactions 
on vol. 52, pp. 590-597, 2006. 
1551 D. Marpe, H. Schwarz. and T. Wiegand, "Context-based adaptive binary 
arithmetic coding in the H.264/AVC video compression standard, " Circuits and 
Systems j01· Video Technology, IEEE Transactions on, vol. 13. pp. 620-636, 
2003. 
1561 J. L. Nunez and V. A. Chouliaras. "High-performance arithmetic coding VLS I 
macro for the H264 video compression standard," Consumer Electronics, IEEE 
Transactions 011, vol. 5l. pp. 144-151 ,2005. 
f571 1. E. G . Ric hardson. "H264 White Papers," Video & Image Compression 
Resources and Research 2002. 
158] T . Wiegand, G. J. Sullivan. G. Bjntegaard, and A. Luthra "Overview of the 
H .264/ A VC video coding standard." Circuits and Systems for Video Technology, 
IEEE Transactions on, vol. 13, pp. 560-576, 2003. 
[591 M . Karczewicz and R. Kurcere n, "The SP- and Sl-frames design for 
H.264/A VC," Circuits and Systems for Video Technology, IEEE Transactions on, 
vol. 13, pp. 637-644. 2003. 
CHAPTER 3: 
PARALLELISM TECHNIQUES 
3.1. Parallelism 
The purpose of explo iting parallelism is to improve the execution performance o f an 
application. Para llel techniques yield greatest benefi ts w hen implemented with due 
conside ration of the nature of the targeted algorithm and its associated memory. 
In o rder to e ffectively map computational workloads to computing platforms. Mic hael 
Flynn classified (Fiynn 's taxono my) compute r architectures in terms of their me thods of 
executing instructions and processing data into four possible groupsl l l. 
• Sing le instruction single data (SlSD) 
• Multiple instruction single data (MISD) 
• Single instruction mu ltiple data (SIMD) 
• Multiple instruc tion multiple data (MIMD) 
A scalar uni-processor would, according to F lynn's taxonomy, be c lassified as a SlSD 
syste m, as only one instruction is issued at a ny one ti me and that instruction operates on 
one specific piece of data. MISD machines do not improve the throughput o f a syste m 
since they issue multiple instructions on the same dataset, but in doing so they a llows for 
a degree o f redundancy to be introduced which is vital for safety cri tical syste ms. The 
fina l two classifications specify differe nt a pproaches to explo iting para llelism by e ither 
issuing one (S IMD) or more (MIMD) instructions or operations on mulliple data values 
concurrent ly. 
In these latter two categories, three different forms of parallelism are ide ntified , namely 
thread-level paralle lism (TLP), data- level paralle lism (DLP) and instruction-level 
para lle lism (ILP). TLP is a subset of MIMD. involving mu ltiple processors ope rating in 
paralle l, and can be separate instruction streams executing on separate func tiona l units 
(processor contex ts), on sepa rate datasets (mul ti-programming) or the same datasets 
62 
3. Parallelism Techniques 63 
(multi-threadi ng). Thi s form of paralle lism is efficient when appl ied at function-level 
granularity in which distinct sections of the contro l fl ow graph (CFG) a re allocated to 
separate processor contexts. DLP applies in the case where multiple items o f data, 
typically within an inner loop, ca n be processed in paralle l. This is generally achieved in a 
vector machine, as vector arc hitec tures are the most effi cient means for explo iting this 
type of para llelism. In ILP the paralle lism quantum is at the indi vidual in struction level 
where multiple instructions are issued from a sing le instruction stream a t a rate greater 
than one per cyc le. !LP mic roarc hitectures can be differentiated by their instruction 
scheduling techniques, examples being Very Long Instruction Word (VLIW) machines[2J 
and SuperScalar microarchitectures. 
Thi s chapte r examines two o f these forms of paralle lism in the context of the video 
coding workloads descri bed in chapter 2. It will be shown how these fo rms of para llelism 
can be extracted and how the application code can be re-structured to ma ke paralle l 
exploita ti on possible. 
3.2. Thread-Level Parallelism 
TLP refers to the concurrent execution of disjointed areas of the CFG on multiple 
processor contexts in the absence of data dependencies[3. 4]. T o effecti vely exploit TLP, 
a range of arc hitectures have been developed in order to execute multiple threads in 
parallel. TLP-capable systems fa ll into two di stinct classes according to the method by 
which the threads communicate, namely shared and distributed (message passing) 
memory implementations. A thi rd c lass has been proposed that combines these two forms 
of paralle l syste ms. 
J. Pamllelism Techniques 64 
a) b) 
C) 
Figure 3-1 Memory topology of thread-level parallel systems: a) shared b) distr ibuted and 
c) distributed hared. 
A shared memory system is depicted in Figure 3- la. h consists of a sing le addressable 
memory space, accessible to all threads. Since all processing elements (PEs) operate on 
the same memory space, ensuring coherency between their individual caches is essential. 
ln a distributed system on the other hand, as depicted in Figure 3-lb, each PE has its own 
pri vately-addressable memory space. In this topo logy, the PEs work individually on their 
pri vate data and , through the use of special load/store instructions, are able to transfer 
data from one memory space to the other and achieve synchronisation. The di stributed 
shared me mory system of Figure 3- lc, combines these topologies together to provide 
dimibuted pri vate memory for each PE in addi tion to a shared me mory resource 
accessible to all PEs. 
Due to the large volume of data this is processed within a multiprocessing system, the 
memory sub-system' s design and configuration will play an important ro le in determine 
the achievable sav ing that can be obtained through exploiting thread-level parallelism. 
This is not on ly affected by the memory topology, Figure 3-1, but also at the architectural 
level by the me mory's ava ilable bandwidth to service each node with the required data , at 
3. Parul/elism Technique.\ 65 
the configuration level through the u e of appropriate sized and con figured cache.! at both 
Ll and L2, and finally at the application level where addit ional exploitation can be 
achieved through knowledge of the of the memorie~ configuration and subsequently 
acce . ing data from one proce . or which ha. been locally cached data by another. Within 
this work , however, the focu wi ll be on examining the potent ial avings obtained through 
exploiting TLP techniques on ly and thu not studying the affect of the memory ystem 
on uch result . 
SMPSW AMPSWO AMPSW1 
. . 
Figure 3-2 Symmetric and asymmetric multi-processing configu rations in a shared memory 
system. 
Shared memory multi-proce ing y tem can be cla ified into ymmetric muhi-
proce sing (SMP) and asymmetric multi-proce ing (AMP) ba ed on the cope of the 
oftware application that each CPU execute , as hown in Figure 3-2. In a SMP system, 
each CPU executes a ingle program, which cou ld ei ther be a stand alone multi -threaded 
embedded application or a muhi -proce or aware operating y tem (OS)[5]. Ln an AMP 
y tern, each CPU execute~ it. own et of appl ication , which could be utili ed to meet 
real-time requirement by dedicating a ingle CPU for a time sensiti ve ta k while leaving 
remaining CPU lo execute non- time en iti ve ta ks[61. SMP and AMP are non-exclu ·i ve 
configuration and can be combined to produce a system that comprises of both . This 
could result in a number of CPU executing a ingle application/OS in a SMP 
configuration, which it elf act a one AMP y tem with regard to other AMP CPU. 
3. Parallelism Techniques 66 
In addition to the topo logical configuration descri bed above, there are a number of 
different architectural approache, a ailable to execute multiple thread in parallel. In 
conventional sca lar uni-proce .. or there i only one execution path and hence only one 
in ·truction can be executed at one time. ln such a ystem, threads are employed to ensure 
the processor cont inue operating during a long latency case uch a a load or cache mi 
event. 
Hardware Context 
HCO HC1 
o Idle 
1 Idle 
2 Idle 
3 Idle 
4 Idle 
5 Idle 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
_..;.....,; __ 
System 
Active context 0, system active 
Context 0 miss, not at switch boundary 
therefore can't switch context, system stalls. 
Context 0 still stalled, switching boundary 
therefore system switch to context 1, system 
active 
Context 0 receive data, enters idle state 
Context remains active, no switch necessary 
Context 1 miss, not at switching boundary 
therefore can't switch context, system stalls 
Context 1 still stalled, switching boundary 
therefore system switches to context 0, system 
active 
Figure 3-3 Temporal multi-threading illustrating how. through switching active context, stall 
time can be reduced. 
ln temporal multi-threading, one execution path can ·witch between two or more 
hardware conLexts in an attempt to reduce y tem idle time. Thi i achieved by witching 
out talled contexts, at di tinct time period , in favour of un taJled context. Figure 3-3 
illu. trates a temporal multi-threaded ystem with two hardware context and a si x-cycle 
temporal period. T hi thread management ystem can not produce an instruction per cycle 
(!PC) of more than one ince there is till only one executi on path ; however, th rough 
context witching, it i po ible to reduce the idle time of the proce or[7] . 
J. Parallelism Techniques 
Hardware Context 
HCO HC1 
0 Idle 
1 Idle 
2 ~e 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 Idle 
13 Idle 
14 Idle 
15 Idle 
16 
17 
18 
19 
20 
System 
Active context 0, system active 
Context 0 miss, switch to context 1, system 
active 
Context 1 miss, both context stalled, system 
stalls 
Context 0 receives data and becomes active, 
system active 
Context 1 receives data, enters idle state 
Context 0 miss, switch to context 1, system 
active 
67 
Figure 3-4 Super threading illustrates how dyna mic switching of active context can reduce 
sy tem tall by switching to an available context as soon as there is a stall in the active 
context. 
An improvement to temporal multi-threading i uper-threading[8]. a hown in Figure 
3-4. which allow witching between Lhread to occur every cyc le, allowing a unu. ed 
cyc les from a given thread' time-s lot to be dynamically allocated to a competing thread. 
A with temporal multi-threading, uper-threading only executes a ingle in. tructi on per 
cycle, but by allowing context switching every cycle the idle time resulting from long 
latency in truction i further reduced. 
3. Parallelism Techniques 
Hardware Context 
HCO HC1 HC2 
0 Idle 
1 Idle 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 Idle 
15 Idle 
16 Idle 
System threads 
FUO FU1 
HCO 
HCO 
HCO 
HCO 
HCO 
HCO 
68 
Figure 3-5 Simultaneous multi-threading illustrating how one processor can make use of 
hardware context switching to execute multiple ha rdware contexts concurrently. 
Modern uper calar microarch itectures with multiple functional unit al low the processor 
to execute multiple instructions in the same cycle, when allowed by dependences[9]. In 
the example hown in Figure 3-5, there are two functional unit (FUO and FU I) each 
capable of executing any of the availab le hardware context (HCO, HC I or HC2). Th is 
process of issuing instructions from more than one thread is known a simultaneous 
multithreading (SM T) and is a panicularly effecti ve architectural de ign concept to 
exploit TLP[I 0- 14]. 
Another technique for execULing instructions from multiple thread on every cycle is to 
include multiple proce sors within the ystem. By arranging multiple proce sors in a 
shared memory configuration, each proce or i capable of independently executing any 
thread of the application whi le acce sing data from the shared memory pace. Multiple 
processor simultaneously executing within a hared memory system call for mean of a 
pre erving cache con i tency. Numerous cache coherent protocols have been designed 
and can be classified as either directory based [ 15- 17], where a central directory of 
allocated cache block i maintained, or snooping-based [1 8-20] where each cache 
moni tor acce s traffic for write operation across the . y tem inrerconnection and, where 
nece ary, invalidates its own private copy. Optimi ation uch as narting allows the 
3. Parnlfelism Techniques 69 
monito ring cache to update its own copy of the data by direct ly tappi ng on to transactions 
on the system i ntercon nect[21 j . 
A more integrated form of the above multi-processor configuration is known as chip-level 
multi-processing (CMP). In thi s configuration, ra ther tha n having separate processing 
units in the computing system, the ind ividual processors (CPU cores) are integrated in a 
single high performance chip[22-25]. Thjs configuration has several advantages over 
multip le single-core syste ms due to the close physica l proximity of the processing cores. 
This allows for both traditiona l (bus based) interconnects as well as modem architectures 
that provide greater execution bandw idth, such as c ircuit-switched or packet-based 
interconnect, e ither synchronous or asynchrono us. 
The above three organisations can also be combined to produce a multithreaded chip-
level multiprocessor (CMP-MT) and multithreaded multiprocessor (MP-MT). This is the 
domain of the SS_SPARC ASIC processing engine, under development in the Electronic 
System Design Group. With the hardware architecture capable of executing multiple 
threads in paralle l discussed, the re mai nder of this TLP section wi ll examjne the software 
modifications required to imple ment parallel execution on these architectures. 
3.2. 1. Control- flow-graph partitioning 
Having dec ided on a multi-processor a rchitecture, the next stage in exploiting TLP is to 
extract mu ltiple. non overlapping sectio ns of the CFG (subject to lack of data 
dependencies) and assign them to the distinct processor contexts. This can be done either 
dynamica ll y (runtime) o r stat ically during the programme creat ion. T hese two 
methodologies are e laborated below. 
3.2.1.1. D ynamic CFG par(j tioning 
Dynamic partitioning of the control-flow graph (CFG) refers to the process of extracting 
and di stributing soft ware threads to avai lable CPU contexts during execution. The 
cha llenge of dynarruca lly partitioning of the CFG is due to the unknown unfolding of the 
instruction stream. Through a process known as thread speculation, a mult i-processor 
system attempts to predict the future threads within an in struction stream and execute 
these speculat ively on available CPU contexts. The sequential code is run in parallel 
3. Parallelism Techuiques 70 
threads unti l a true dependency occurs. One such multi-processor based system that 
executes dynamic thread partiti oning is Stanford Hydra[26]. This system has a CMP 
organisation containing four MIPS-based processors, with individual LJ and shared L 2 
caches on a single chip. 
3.2. 1.2. Static CFG part itioning 
Static partitioning is the process of manually identifying regions for parallel execution 
(by adding spec ial instructions into the application source code) and indicating these 
either directl y to the hardware or to the operating systcml27, 28]. Depending on which of 
these mechanisms is being used, the options avai lable to the programmer di ffer. Three 
different approaches to static CFG partitioning are discussed in section 3.2.2. they being 
the sim-system in frastructure based on the SimpleScalar toolset[29], OpenM P[30] and 
Pthreadsr3 t] . 
All of these mechanisms take the same underlying approach to parallel ising the CFG. 
There are two main types of code that are prime candidates for parallel implementation, 
namely independent regions and loops. The former are specific. separate, unrelated 
sections of code that can be executed in parallel as they have no data dependencies. Due 
to their unrelated nature these regions rarely each take the same number of CPU cycles to 
execute and this results in load inbalance in the multi-processor system. This. of course, is 
not likely to be a problem if the processing node would otherwise be idle. The latter 
involves the unrolling of high level functional loops that can be parti tioned and 
distributed across processors[32] . Since each node executes a different iteration of the 
loop, the work ba lance of the system is likely to be reasonably uniform. 
3. Parallelism Techniques 
Serial 
1 2 
Para llel 
11-.., 4_ 
L 
-o ~~ 
-----Timee-----1...,~ 
for (x=O; x<1Wcounc ; x+~) 
( 
11 perform motion est1mation 
iteration = Hlcount I MAX_THRE:AD; 
for (x_old=O ; x_old< iteration ; x_old++) 
( 
x =context+ (MAX_THREAD • x_old) ; 
11 perform motion estimation 
71 
Figure 3-6 chematic diagram and corresponding code example taken from the multi-
threaded XviD encoder , to illu tratc the transformation from a loop erial with a s ingle 
context execution to a parallel loop with MAX_ THREAD number of contexts executing 
concurrently. 
Figure 3-6 illu trate loop unrolling, where each iteration of the original serial loop i 
distributed to the avai lable proces or in the parallel environment. Using the example 
code from the motion estimation loop in X viD, Figure 3-6, a four step approach to 
threading loops i undertaken. 
A shown, the original serial loop contain iwcount iteration of the loop. By threading 
thi loop in a ystem with MAX_THREAD acti ve CPU , the loop i reduced to iWcount 1 
MAX THREAD concurrem iteration 
iteration = iwcount I MAX_THREAD; 
Figure 3-7 Reduction of parallel loop iteration required to allow MAX_TH READ PUs to 
executed (he loop concurrently. 
By dividing the max imum number of iterations in the erial loop by the number o f 
available nodes, the number o f iteration in the parallel loop i equal to the quotient. Any 
remainder in the calculation need to be properl y dealt with and thi i di cu ed below. 
3. Pamlle/ism Techniques 72 
Parallelising a loop directly affects the loop iteration variable. This variable identifies to 
each parallel node the subset of the original serial loop iterati ons to exec ULe. Since the 
number of parallel loop iterati ons have been reduced the parallel loops counter. i_par, 
can not be used directly. 
x = context + (MAX_ THREAD * x_old) ; 
Figure 3-8 Reconstruction of loop iteration counter based on the current loop iteration, 
x_old, the maximum active CPUs, MAX_THREAO, and the contexts private ID context. 
As seen from Figure 3-8, the va lue of x is now f ound by using both the individual node's 
lD number. context, and the maximum number of acti ve CPUs. MAX_ THREAD, as well 
as the current parallel loops iterator, x_old. 
So far, parallelising the loop has concentrated on loop manipulation and not on the actual 
functionality within the loop. The nex t stage is to examine the memory access pallems 
within the loop, considering both reads and writes. and performing the necessary 
modi fications to ensure memory exclusivity between processing nodes. In a distributed 
memory architecLUre, initially it may seem that exc lusivi ty is automatically preserved as 
each node will always work on it own pri vate data. T his is indeed the case when data 
from each node is processed only in that one node. however when data are transferred to 
another node. exclusivity issues arise. In a shared memory architecture these issues are 
al ways present. When statically partitioning the CFG, each variable or memory location 
that is altered during the execution of a loop needs special allention to avoid more than 
one node writing data back to the same location. This can be done through the use of 
pri vate variables. limiting the execution of code sections to specific nodes only and by 
using semaphores[21]. A pri vate variable is one in the shared memory space that only one 
node can access. For example. x in Figure 3-8 would be declared as a pri vate variable, 
allowing each node to have a different value for x but still be referenced using a common 
variable name. Node selection code is used to specify those sections that are to be 
executed by a single or by a specific number of nodes. This can limit the number of nodes 
executing certain functionality or allow a single processing node to execute an exclusive 
code section. Finally, by dividing each iteration into secti ons and separating them with 
barrier instructions, a form of functional pipelining is achieved that ensures vari ables 
written in one section/pipeline wil l be available in the next. 
3. Parallelism Techniques 73 
shared array(MAX_THREAD]; 
Figure 3-9 P eudo decla ration of harcd memory ar ray. Each a rray is declared in harcd 
(static) memory space a nd contains one element for each active CPU in the system 
(MAX_T IIREA D). 
To main tain var iable access exclusivity wi thin a given section, shared memory arrays are 
used. These are standard arrays declared in shared memory with the number of elements 
equal to the number of CPUs in the system. For example. in Figure 3-9. the array w ill be 
allocated in shared memory space. This array has two particular properties that allow it to 
act as both a shared and a pri vate var iable. By dec laring the array wi th an element 
allocated to each of the nodes, each has exclusive write access to an an·ay element. 
Simi larly. as the array is in shared memory (static) space. each node has read access to all 
elements of the array. 
variable(context] = 10 + i 
barrier 
if context ID = 1 
loop x from 0 to MAX_THREAD 
total =+ variable[x] 
barrier 
11 parallel computation 
11 context synchronisation 
11 serial 
11 only context 1 executes 
11 this section 
11 sync with other contexts 
Figure 3-10 Pseudo serial loop illustrating the use of shared memory array. Initia lly, each 
context writes da ta to its private clement within the array demonstrating the exclusive write 
properties of the array. Secondly, within a serial loop, a single context carries out read 
opera tion , demonstra ting shared access to data. 
Figure 3-10 illustrates a possible conliguration where the use of shared memory arrays 
could be advantageous. H ere a serial cumulative operation is performed after the parallel 
computations designated by the first barrier. By insert ing barriers between the parallel 
and serial section, all parallel writes to the array have been completed before the serial 
reads. 
The result from Figure 3-7 may produce a remainder. Since the paral lel loop only 
executes wi th a full set of nodes. ex tra remainder code (strip mining) is needed to process 
these remaining serial iterati ons. 
3. Parallelism Techniques 
if context < (iWcount % MAX_THREAD) 
{ 
11 remainder code 
74 
Figure 3-11 Remainder (strip mining) code selection. llere an ' if' statement is used to select 
individua l contexts to execute the remainder code based on its context ID. 
Figure 3- ll i llustrates one method to select nodes to run the remainder code based on the 
node JD. Using the modulus (%)operator the remainder from Figure 3-7 is calculated. 
The code in the remainder section includes the same operations as those in the multi-
parallel loop and with the same provision for exclusiv ity. 
3.2.2. TLP exploitation environments 
In this section three di fferem environments will be examined and their strengths and 
weaknesses in allowing source code manipulation for TLP exploitat ion identified. T hese 
three tools are: the SimpleScalar toolsel. OpenMP and POSIX threads (?threads). 
3.2.2. L. SimpleSca lar 
T he SimpleScalar toolset is an open-source computer architecture research toolsel. The 
toolset provides the full range of compilers, tinkers. assemblers. libraries utilities and 
simulators for the virtual SimpleScalar architecture. The SimpleScalar architecture is 
known as the portable instruction set architecture (PJSA) which is in turn is based on the 
M IPS and DLX lSAs. PJSA has certain features additional to MIPS, including the 
extension of the opcode from 32 to 64-bits, allowing greater experimentation freedom for 
the purpose of JSA research. The toolset provides five di f ferent simulation packages 
permitting the exploration of trade offs between execution speed and system detail. At the 
two extremes are the simulators known as sim-fast and sim-outoforder. As its name 
suggests, sim-fast has the fastest execution speed of the two, running at more than nine 
mi llion instructi ons per second (MIPS) on a modern x86 L inux-based workstation. To 
achieve this. all notion of timing has been removed and only modelling at the instruction 
level of abstraction is carried out. A t the other extreme, sim-outoforder executes at the 
cyc le-accurate level. allowing customisation of many aspects of the processor's 
microarchitecture and cache hierarchy. including out-of-order (OoO) instruction issuing. 
branch prediction and multiple functional units. 
3. Parallelism Techniques 75 
The origina l SimpleSca lar toolset was designed to work as a uni-processor system and. in 
order to emulate a para llel RAM (PRAM) system, the sim-fast simulator needed 
substantial modification to rea lise the mul ti -processor system (sim-systcm)[33] and to 
provide the framework for user customisation of registers and instructions for each 
processor. Sim-system can thus be classified as a configurable, extensible PRAM. 
typedef struct { 
sword_t hi, lo; 
int fee; 
/* multiplier HI/LO result registers */ 
/* floating point condition codes */ 
Processor ID register * / unsigned int PRID; /* 
int PSTATER; 
md_ctrl_t; 
/* Processor state */ 
Figure 3-12 Sim-sy tern's C definition of the control register declared for each imulated 
CPU. 
Figure 3-12 illustrates the extended processor state of each context of the sim-sysyem 
PRAM. The PlSA architecture state has been extended with the addition of two new 
registers, PRID and PSTATER. the (private) processor ID number which is a unique 
number used to identify the processor, and the processor state that can be either RUN or 
SLEEP. When in SLEEP state. the processor does not execute any instructions until an 
external authori ty returns i t to the run state. 
Udefine GET_PRIDR(var)\ 
( { \ 
asm volatile (".word Ox00010000") ;\ 
asm volatile 
1 << 29 I 
2 << 25 I 
15 « 20 I 
( • .word \ 
!* EXT_OPCODE */\ 
/* CATEGORY */\ 
/* OPCODE * /\ 
12<< 15" "$12") ;\ 
asm volatile ( "addu %0 , $12, $0": ":::r" (var)); \ 
} ) ; 
Figure 3-13 Assembly macro used to obtain the proces or ID number of the ca lling context. 
To allow internal access to the extended PRAM state, assembly macros, as shown in 
Figure 3-13. and statements such as those shown in Figure 3- l 4, are provided. In the code 
in Figure 3- 13, the macro GET_PRIDR () is accessible to all PRAM contexts and sets the 
integer vari able passed into i t to the context number from which it was executed. The first 
two assembly instructions in the macro specify the extended instructi on under the NOP 
opcode annotation. 
3. Pnra//eli.\'111 Techniques 76 
SET_GPR(RD_ADDR , (xcregs.xcregs(context] . regs_C . PRI D)); 
F igure 3-14 Low-level implementation of the instruction that copies the processor ID value 
stored in the C PUs control registers to a genera l purpo e register RD_ADDR. 
Figure 3-14 indicates the low-level implementati on of the instruction that trans fers the 
va lue of the ID register to a general purpose register (GPR). 
A s in the processor ID, the process of setting the processor state register takes place from 
both within the simulator and from the appl ication source (programmer-visible). 
executing on the simulator. 
Udefine CSTATE(context, state) \ 
( { \ 
asm volatile ("addu $10,%0 , $0 '' : :"r" (con t ex t ) 
asm volatile (" . word Ox00010000 " ); \ 
as m volatile (".word \ 
4 << 29 I / * EXT_OPCODE * /\ 
2 << 25 I / * CATEGORY */\ 
15 << 20 I I * OPCODE * / \ 
10<< 15 I \ 
"Ustate " << 10" : : : "$1 0") ; \ 
} ) ; 
Figure 3-15 Assembly macro used to set the processors run state. 
"$10"); \ 
Figure 3-15 shows an assembly macro which takes as inputs both the processor ID 
(context) and the desired new processor state. Being able to specify both the processor ID 
and desired state, it is possible for one processor to control the state of another. Here the 
context value is passed to register $10 using the addu instruction along with the register 
zero before performing the actual assignment. 
x c regs.xcregs(GPR(RD_ ADDR)) .regs_ C . PSTATER = RS1_ADDR; 
Figure 3-16 Instruction to set. processor state nag within the control register to the value 
located within Ute general purpose register RSI_ADOR. 
T he code in Figure 3-16 sets the PSTATER register wi th the value of the state passed 
using RSl_ ADDR. The use of thi s macro can be seen when examining part of the barrier 
mechanism, as shown in Figure 3-17 
3. Parallelism Techniques 
#define BARRIERl_SLEEP \ 
( { \ 
extern int gsem;\ 
gsem++;\ 
if (!context)\ 
while (gsem < XC_MAX)\ 
; \ 
if (!context)\ 
gsem = 0;\ 
if (context)\ 
CSTATE(context,l) ;\ 
} ) ; \ 
77 
Figure 3-17 C macro implementing the sleep stage of the barrier instruction. Entering 
contexts perform a atomic addition to the g lobal semaphore, gsem, before either entering a 
sleep state (context>O) or entering an empty loop controlled by the value of the semaphore. 
The barrier processes is in two parts: the sleep stage and the synchronous release stage. 
First. as each context reaches the barrier, it enters a sleep state except context zero which 
is isolated from the others and is used as a master context to control the state of the other 
slave contexts while in the barrier. Once all contexts have entered the barrier and are in 
sleep state. a synchronous awakening of each context takes place and all contexts are 
released from the barrier concurrently. As can be seen from Figure 3- 17. the first section 
in thi s macro is responsible for incremenUng the globa l semaphore gsem. T his 
semaphore can only be accessed in atomic operations, thus the increment operation is also 
performed atomically and each wri te by the individual threads is exclusive, ensured by 
the memory system serialisation and coherency logic. After this ini tial atomic 
incrementing of the semaphore. the remaining operations in the macro are sp lit between 
the master and slave contexts. The slave contexts, by using the CSTATE macro, are put 
into the sleep state, whereas the master context is placed in an empty whi le loop, being 
released only when all slave contexts have incremented the semaphore. whereupon it 
resets the semaphore. 
3. Parallelism Techniques 
ndefine START_ALL \ 
( { \ 
int i ; \ 
if (context == 0) \ 
for (i=l;i<XC_MAX; i ++) \ 
CSTATE(i,O) \ 
} ) ; \ 
78 
Figure 3-18 C macro implementing the synchronous relea e stage of the barrier instruction. 
Here context zero cycle through all leeping contexts and systematically alters their run 
state to R UN. 
The START_ALL macro is respons ible for awakening the s lave contexts from Lhe s leep 
s tate, as shown in Figure 3-18. T his is achieved by the master context (context zero) 
e ntering a loop that call s CSTATE for each processor. 
ndefi ne BARRIER BARRIERl_ SLEEP;START_ ALL 
Figure 3-19 C macro defining the completed barrier instruction comprising of the sleep and 
synchronous relea e stages. 
T he barrier macro is the combination o f the above macros, as shown Figure 3-19. T he 
macro sends all slave contexts to sleep and holds the master in a wh ile loop. When a ll 
contexts entering the barrie r, the maste r is released and it enters START_ ALL whe re it 
awa kens the slave contexts. 
int a; 
stat i c int b; 
11 private v a riable 
11 shared variable 
Figure 3-20 Declaration of shared and private variables through the use of the static 
statement. 
The (jnal task required by the programmer is to specify. duri ng the variable declaration 
stage. whether a variable is to be c reated as shared or private, as shown in F igure 3-20. 
When dec lared as private. unique instances of the varia ble are accessible by all 
processors, allowi ng variables of the sa me name to have diffe re nt values on individual 
processors. Th is type of variable creates no race issues as eac h instance resides in a 
separate location of the shared memory. Shared variables. dec lared as static, are 
completely accessible by all processors a nd care is required when multiple contexts 
attempt to write to that variable. 
3. Parallelism Techniques 79 
3.2.2.2. OpcnMP 
OpenMP is an Application Programme Interface ( PI) that con i. tll of runtime l ibrarie. , 
compi ler directi ve. and environmental vari ables to allow the . oft ware developer to 
emulate the run-time behaviour of a multi -processor, hnred memory system. OpenMP is 
acces~>ed by • pec ial pre-proces, or in tructions in C or Fortran ·ource code that pecify 
and manipulate the parallel threads in the programme[34]. 
Master thread #pragma omp parallel { 
F 
0 
r 
k 
I 
I 
Worker thread 1 
I 
Worker thread 
- f Master thread ._ 
L,__--...... -
~ ~ 
I 
n 
Worker thr~ ___ _ 
I 
I 
Parallel workload 1 
Figure 3-2 1 reation of parallel threads using fork and j oins pre ent in the Open MP A PI. 
Unlike the SimpleScalar-ba ·ed model in which all threads of the application are pre·ent 
in either an acti ve (run) or u pended ( leep) state, OpenMP create additional thread as 
required to execute u er pecified parallel regions. Figure 3-2 1 depict how one master 
thread operate in the erial sections of the code and generate ( forks) additional th read 
when a parallel section is reached. Thi parallel ection of code i executed by all created 
thread~. and on completion the additional thread are de lroyed Goined) leaving the 
ma ter thread to continue the (. cri a I) programme flow. 
#pragma omp parallel private(varl, var2 ) shared (var 3) 
{ 
/1 Parallel section executed by all threads 
} 
Figure 3-22 OpenMP API yntax for thread creation and pa rallel operations. 
Figure 3-22 how the definiti on of the forking and j oining proce ses in the Open MP APl 
yntax. The para llel ection i declared using the #pragma omp p arallel con truct, 
in which the ma ter thread generate, a number of thread to execute in parallel the code 
section enclosed by the braces that follow. ln the example in Figure 3-22 it can be een 
3. Parnllelism Techniaues 80 
how extra keywords (private. shared) are used to de fine the scope of the variables of the 
paralle l section. 
tid = omp_get_thread_num(); 
nthreads = omp_get_num_threads(); 
Figure 3-23 Built-in OpenMP functions respon ible for determining thread ID and total 
number of thread in the system respectively. 
T he de finition of two of the most important functions in the Ope nMP API are defined in 
F igure 3-23: those obtaining the indi vidua l thread ID number and the tota l numbe r of 
threads preset in the system respecti velyl35]. 
#pragma omp barrier 
F igure 3-24 yntax of the barrier instruction in OpenMP respon ible for processor 
synchronisation. 
OpenMP operates a barrie r mechani sm for synchroni sing threads across parallel sections. 
Using the di rective in Figure 3-24. each thread waits at the barrier locati on until all 
threads are present. at whic h po int the barrier re leases all the threads in parallel. These 
commands allow the static partitioning o f program now for mul ti-processors [271-
3.2.2.3. POSIX Threads 
MT applications in a shared me mory environments can also be implemented in POS[)( 
T hreads. POSIX (portable operating syste m interface for Unix) is the colloquial name 
g iven for the collection of documents of the IEEE 1003[3 L] and lSO/IEC 99451281 
standards. A lthough the standards were designed for use on UN LX based operating 
syste ms (OSs), POSIX has been implemented on other OSs inc luding Microsoft 
Windows NT. The POSIX standard specifies an API with methods and the ir interac tion 
with a POSIX compatible OS. Section I of the standard. IEEE 1003. 1136], specifies the 
faci lities that must be avai la ble in the OS to manage and control threads. These POSLX 
threads or Pthreads. like OpenMP. can be accessed through an APl that can be used by an 
applicat ion· s source code to control the creation and operation o f threads. 
3. Parallelism Ted111iques 
Master thread 1 Master thread pthread_create() ~---';...:_:c=-=..;;.;;;;.;;..._,. • ..., pthreadjoin() 
r-::-- __!_ Worker thread ~f ---'---pthread_create()_j- -'--___;_;=-=-===---i -, pthread_exit() 
Worker threa<!._J 
I 
I 
I 
Worker thread -!-
I 
Parallel workload 1 
81 
Master thread ~ 
] 
Figure 3-25 PO [X thread creation and destruction using the pthread_create() a nd 
pthread_ex'i t() functions. 
A s can be seen from Figure 3-25 the method of thread creation i on a thread by thread 
ba i . Once pawned (pthread_create()), the.e thread execute a given function 
and on completion, exi t (pthread_exi t ())and rejoin the ma. ter thread. 
pthread_t threads[NUM_THREADS]; 
for(t=O; t<NUM THREADS; t ++ ) { -
pthread_ create(&threads[t], NULL, function () , function 
arguments); 
} 
Figure 3-26 lllustration of thread creation through repeated calls of pthread_create() 
function using n different th read handle, threads[] for each call. 
The yntax for thread creation is illustrated in Figure 3-26, where NUM THREADS 
thread (pthread_ t thread f)) are declared by the ma ter thread. The thread 
handle are used to identify each th read created u. ing the pthread_create () 
command which ha four arguments: the thread handle, thread parameter , the function 
that the thread i to commence executing in parallel and any argument required for the 
function. h i interesting to note that bOLh OpcnMP and Pthread have function-level 
granularity wherea SimpleScalar operates at individual C tatement granularit y, allowing 
much finer TLP specificmion and exploitation control. 
pth read_mutex_t mymutex; 
pthread_mutex_init(mymutex, NULL); 
Figure 3-27 C reation and initia li ation of a mutually exclusive (mutex) variaille in Pthreads. 
3. Parallelism Techniques 82 
Within the Pthread standard, a special variable is defined that is used for locking 
resources to a given thread which can also be used for synchronisation of threads. 
Mutually exclusive variables act as locks, permilling on ly one thread to obtain the lock at 
any one time. Once the owning thread unlocks the mutex, another thread can then lock it 
and take control of a speci fie resource being controlled by that mutex. Before mutex 
variables can be used they must be declared and initialised as illustrated in Figure 3-27. 
pthread_mutex_lock (&mymutex); 
global_sum += thread_sum; 
pthread_mutex_unlock (&mymutex); 
Figure 3-28 Illustration of locking mechanism in Pthreads through the use of 
pthread_mutcx_lock() and pthread_mutex_unlock() to lock and unlock the mutex 
respectively. 
By using the initiali sed mutex from Figure 3-27. Figure 3-28 shows how this lock can be 
used to form a critica l section of code and preserve exc lusivity of spec ific resources of the 
system. As shown for sim-system in Figure 3- 17 a simi lar technique of lock acquisition 
can be implemented by the barrier mechanism. 
3.2.3. TLP and video processing 
When considering the use of TLP for video encod ing, one very important aspect of the 
implementati on is the granularity at which TLP is applied. The level at which the codec is 
threaded directly impacts the underlying silicon architecture and the efficiency and rea l-
time capabi lity of the combined software/hardware solution. As discussed in chapter 2. 
compressed video streams arc defined in a hierarchical manner. For MPEG-based 
compression these consist of group of pictures (GOPs), frame, macroblocks (MBs) and 
block. Each of these describes the video frame at an ever finer granularity. 
3.2.3. 1. Coarse grain granu larity 
A number of researchers have carried out investigations into parallel systems for either 
video encoding or decoding. The maj ority of this research has focused on coarse-
granularity TLP exploitation. with the distributi on of the workload most common ly being 
carried out at the (GOP) level. GOP' s form control and data-independent sections thus 
el iminating data dependencies between processing nodes. Little inter-node 
3. Parnl/elism Techniques 83 
communication is needed at this level. As minima l inter-node communication is 
preferable for implememation, this level of abstraction is particularly well suited to 
implementation in distributed systems. Shen. Rowe and Delp [37] imple mented a version 
of the Berke ley parallel MPEG- L 1381 at the GOP level. using master-server nodes to 
contro l the transfer of data both to and from computational nodes. Bozoki, Westen, 
Lage ndijk and Biemond[391 performed a comparative study of distributed 
implementations of MPEG-1 operating at both GOP and slice levels and showed that. due 
to the dominance of communication traffic over the network, the finer the granulari ty the 
less e ffecti ve the di stributed schemes became. 
3.2.3.2. Fine g rain g ranularity 
Bilas. Frills and Singhl40] evaluated the performance of both GOP-Ievel and s lice-level 
parallel MPEG-2 decoders us ing a shared memory syste m and tight ly coupled processors. 
They found that. at both levels, the shared memory network produced a near-linear 
speedup with the number of processor nodes. but each addition of a fu rther processor 
resulted in a threefold increase in the memory requirements. There were two principal 
drawbacks with the GOP approach compared with the slice-level implementation: firstly. 
load imbalance and random-access issues arose when the GOPs were relati vely large. and 
secondl y the memory req uirements were greater for the GOPs since the total number of 
frames that need to be stored in memory is larger by a factor equa l to the numbe r of 
distributed processors. Akramullah, Ahmad and Liou[4l] investigated MPEG-2 e ncoding 
using the mass ive ly paralle l processing inte l paragon XP system. a di stributed memory 
system with a highl y sca leable number of processors. This implementation paralle lised 
tbe encoder at the macroblock (MB) level. and inter-processor communication was 
reduced by allowing each processor to have a local me mory copy of a ll image data that 
may be needed for motion estimation (ME). 
3.3. Data-Level Parallelism 
Data-level parallelism (DLP) focuses on the parallel execution of arithmetic and data 
movement operations performed on the data used by a programme. Usi ng Flynn·s 
taxonomy. DLP is exploited by SlMD systems. Unlike TLP capable mac hines which 
issue multiple instructi ons per cycle. in a DLP system a sing le instruction is issued per 
3. Pamlle/ism Techniques 84 
cycle but thi s instruction operates on multiple data ets. This ability to process a large 
quantity of data in a ingle cyc le is often achieved by implementing pecial vector 
instruction and architeclures. Thi is of particular intere l to the mu ltimedia application 
domain due to the large number of data values that need to be proce ed in the ame or in 
a very simi lar manner. [t is clear that multimedia application are well suited to efficient 
implantation on SIMD architecture .. Tn the remainder of this section, SIMD technique 
adapted by microprocessor vendors and research institutions are examined, illu trating the 
design approaches developed to exploit parallelism at the data-level. 
3.3.1. SS_SPARC ASl C Process ing Platform 
The SS_SPARC processing platform ha been uniquely architected to exploit instruction-
level, data- level and thread-level parallelism[42] , achieved through careful design of the 
kernel, the proce sor and the vector unit. 
Configurable 
number of 
SS SPARC 
SMTCores 
Streaming 
standaiOne 
accele<ators 
Banked L2 Cache 1 syste<n 
memory port 
Figure 3-29 SS_SPARC kernel demonstrating the multiple streaming vector accelerators. 
Figure 3-29 i a high-level view of a generic SS_SPARC processing kernel which 
con i ts of a configurable number of superscalar SMT processor cores (them elves highly 
parameterized), a configurable number of loosely-coupled, streaming coprocessor 
(accelerator ), a switch matrix and a multi-banked, level-2 data cache which 
communicate with the remainder of the SoC via the (embedded) DRAM interface. 
Initially designed to implement ARM's AHB bu protocol [43], a generi c, transacti on-
level pipelined memory interface is also avajlable and can be ea ily adapted to connect to 
other bus protocol uch a ARM's the next-generation AXI tandard[44] . The CPU 
follow a shared memory programming model and fu ll cache-coherency i supported 
acros the internal , level I data cache of the proce or and the banked level 2 data cache 
via a simple, directory-based, 3-state modj fied/exclu ive/ bared (M ESJ) protocoJ[45]. 
3. Parallelism Techniques 85 
The switch matrix is segmented into separate coherent and non-coherent ·channels', 
assigned to the processors and to the streaming accelerators respecti vely. At this level, the 
platrorm can exploit TLP via di stinct sortware threads running in dirferent CPUs. The 
core processor is a highly-parameteri zed. five wide. in-order issue. out-of-order commit 
(with in-order exception resolution). multi-threaded (SMT), Spare VS-compl iant 
microarchitecturc. 
~ 
Figure 3-30 Detailed schematic of the S_ PARC super-scalar and ' 'ector datapths. 
As shown in Figure 3-30, the CPU is segmented into rour major sections: the processor 
instruction front-end (IFE), the core datapaths for scalar operations (SCORE), the 
configurable, extensible vector unit (YCORE) and the high bandwidth load/store unit 
(LSU). VCORE is responsible ror the extensive DSP capabilities or the CPU and is the 
primary mechanism provided for exploiting DLP. It consists of a single-issue, 
configurable vector datapath, supported by an architected vector register fi le. YCORE is 
uniquely designed to connect to arbitrary, 'plug-in' datapaths at RTL level, and exposes a 
highly consistent yet straightforward interface to the external system designer, as depicted 
in Figure 3-31. 
3. Parallelism Techniques 86 
r~  
LIU 
-u,?l _, 
-
~· {~ 
Figure 3-31 VCORE vector datapath of the SS_SPARC designed to extend the DSP 
functionality of the CPU. 
As illustrated in Figure 3-31 , the vector datapath of the SS_SPARC consists o f three 
distinct sections. VCORE provides a number of architectural states, one per SMT thread. 
containing the vector registe r files (VRF) a nd bypass logic for each. Despatch from these 
multiple arc hitectural state to the single issue vector execution stage is governed by the 
thread selec tion logic (CCU). which, using a specified arbitration a lgorithm, se lects 
between available (no register or resource de pende ncies) threads. The logic eng ineered 
into the vector execution stage is not spec i lied by SS_SPARC but instead four interface 
ports are de fined : despatch. bypass. load/sto re uni t (LSU) return path and write-back. By 
specifying only the interface, the SS_SPARC arc hiteclUre allows the highly customised 
and targeted vector ISA to be implemented in order to maximise performance per unit 
area as a result of ex ploiting DLP. 
3.3.2. ARM 
ARM[46] is a lead ing designer of RISC processing cores for the mobile market ARM 
has produced two diffe re nt methods of adding SIMD execution to their line of cores. 
Firstly, in the v6 version of the ARM instruction set SIMD instructions were incorporated 
into the main processing core; secondly, a media processing engine, NEON, was 
developed to assist in SIMD operations, re moving the load from the main processing 
core . 
3. Parallelism Techniques 87 
3.3.2.1. ARMv6 Architecture 
The SJMD instructions of the ARM v6 instruction set fa ll into 3 groups: addition and 
subtraction. multiplication, and sum o r absolute dirferences. No additional registers are 
provided withi n the ARMv6 architecture to support the SIMD instruction. and they make 
use on ly of the existing 32-bit wide register fi le [47l. 
The addition and subtraction instruction group cons ists of six instructions. Halfword 
addition and subtraction (ADD16 and SUB 16), byte addition and subtraction (ADD8 and 
SUB8). combined 16-bit addition and subtraction (QADDSUBX) and combined 16-bit 
subtraction and addition (QSUBADDX). Each of these instructions can contain six data-
processing prefixes such as sign/ unsigned and saturation. As indicated by the instruction 
names. these instructions ope rate at either 8 o r 16-bit g ranul arityf48]. 
The multiplication instruction group has four main in structions. signed dual multiply 
add iti on and subtraction (SMUAD and SMUSD), signed dual multiply accumulate 
(SMLAD) and s igned dual multipl y subtract accumulate (SMLSD). These multiplications 
operate at the 16-bit granularity[49l . 
The sum of absolute difrerence instruction g roup contains just one instruction, ABSD.IFF. 
Thi s instruction works at 8-bit granularity and performs the SAD computation that is 
found in many media algorithms such as motion estimation[50-55]. 
3.3.2.2. ARM NEON Media Processing Engine 
[n addition to the SIMD functiona lity present in ARMv6, a media processing eng ine 
known as NEON has been developed. The Neon engine. which operates as a eo-processor 
to ARM's high performance application processor, Cortex-A8[56l that runs the ARMv7 A 
architecture. 
The Neon eoprocessor's pipe line has of 10 stages, 4 for decode and 6 for execution. ln 
these 6 execution stages are 4 datapaths respons ible ror integer, Neon fl oating point 
(NFP), load/store and vector floating point (VFP) operations. The NFP datapath is 
optimised for multi-media FP opcrations[57] and as a consequence is not rEEE 754 
compliant and thus can not perfo rm the ful l range of noating operations required of a 
3. Parallelism Techniques 88 
fl oati ng point unit. T o ensure the NEON is compliant with IEEE 754, the second 
conventional VFP datapath is also provided. 
NEON is a hybrid SIMD syste m, s ince the length of the vector processed can be either 64 
or 128-bits1.58 1. The vector le ngth is a run -time paramete r and can be selected on a per-
instruction basis. A dedicated 256-byte register fil e is provided to store the vectors. This 
provides e ither 16 separate ly addressable 128-bit registers, or 32 individually addressable 
64-bits registers. Each of these registers can be sub-di vided into vector elements of 
length; 8, 16, 32 or 64 bits, wi th the SlMD instructions being able to operate on each 
e le ment of the register concurrentl y[59]. 
3 .3.3. x86- MMX. SSE and 3D NOW 
In 1997, lntel introduced a new range of Pentium c lass x86 microprocessors that inc luded 
a SIMD imple mentat ion known as MM.X[60]. The architecture specified 57 new integer 
instructions and eight 64bit registers. In practice these registers were also mapped to the 
floating point stack registe rs a nd hence MMX and floating point instructions could not be 
issued si multaneously. In 1999, the first streaming SIMD extens ion (SSE1) added eight 
128-bit registers and 70 new instructions for floating point SIMD operations. The SSE2 
archi tecture was introduced in 2001 a nd imple mented the integer MMX instructions of 
SSEJ using wider registersl6 lj . SSE3 was re leased in 2004 and inc luded horizontal 
operations within vectors (intra-vector) in addition to the existing vert ica l vector to vector 
(inter-vector) operations. Intel's main ri val in the x86 microprocessor market, AMD also 
re leased a series of SIMD architectural enhancements. In the K6 series of 
microprocessors in 1997, an MMX compatible set of 57 SlMD instructions and eight 64-
bit registers were introduced[62]. 1998 saw the launch of the successor to K6. the K6-2. 
along with AMD's own extension to MMX, 3DNow!, whic h extended the integer only 
MMX instructions by implementing a further 21 floating point SIMD instructions[63]. In 
contrast with Intel who added extra registers in its floating poi nt e nhancement (SSE), 
AMD restri cted 3DNow! to us ing the e ight 64-bit registers defined in MMX. Doing so 
el iminated latency due to state switching between integer (MMX) and floating point 
(3DNow!) SJMD instructions[64l . but hindered compiler effic iency, because of the 
limited numbe r of registers. The enhanced 3DNow! Technology that was introduced in 
the Ath lon processors in 1999, implemented an add itiona l 19 integer (1 2 ari thmetic and 7 
3. Parallelism Techniques 89 
load/store) and 4 floating po int S IMD instructions. Enhanced 3DNow!. as in the orig inal. 
uses the same e ight 64-bit reg isters defined in MMX[65]. [n 2001. 3DNow! Professional 
was integrated into the new Ath lon XP micro-processors and provided an additional 5 1 
instructions to providing fu ll s upport for lnte l's SSE S IMD instruction sctl661. 
3.3.4. AltiVec 
AltiYec is a SIMD instruction set extens ion developed by the A IM alliance (a jo int 
venture of Apple[67], rBM[68j. and Motorola[69l), as an extension to their RISC Power 
processors [70]. The AltiVec name is trademarked by Moto ro la and the technology has 
been rebranded by the other partners, namely VMX by IBM and Velocity Engine by 
Apple. From here on the name AltiVec will be used to describe the technology rather than 
just Motoro la-specific products. 
A ltiVec introduced 162 additio nal vector instructions to the base lSA, and these fall into 
fi ve main groups: intra-eleme nt arithmetic. intra-ele ment non-arithmetic, inter-element 
arithmetic. inter-element non-arithmetic and load/store instructio ns. As the names suggest 
these instructions can either work vertically between vecto rs or horizontally within the 
same registe r. and can a lso be arithmetic such as add/sub multiplication or non-arithmetic 
such as pack/unpack and shiftl 7 11. As A ltiVec is an ex tension to the RISC Power 
architecture. each instruc tion's operands arc vector registers . apart from load/store 
instructions that trans fer to and fro m main memory. Consequently. in this model. vector 
instructions canno t use data stored in scalar registers directl y. rather the required data has 
to be first stored lo memory by the main sca lar processor and then loaded into a vector 
register by the Al tiVec unitf72 1. 
The AltiYec exte nsion provides an addi tional 32x 128-bit registers for s toring and 
operating on vecto r data. These 128-bit registers are then sub-divided into sixteen 8-bit. 
eight 16-bit o r four 32-bit e lements . depending on the required data type[73]. This 
numbe r o f vector register is in sta rk contras t to its x86 S !MD counterpart SSE which o nl y 
accommodates e ight vecto r registers. 
3. Parallelism Tecltniques 90 
3.4. Conc lusions 
In this chapter. the two major forms of paralle lism and the archi tecture required for their 
exploitation ha ve been discussed. The TLP inherent in video coding algorithms can be 
statica lly extracted by partitioning the CFG in disjointed sub-trees. A variety of 
multi processor architectures were discussed including chip mu ltiprocessors (CMP) where 
processors can be contained in a si ngle chi p, and simultaneous multi threading (SMT) 
with multiple contexts contained in a single pipelined core. Other approaches to exploit 
TLP were also discussed and these include generic partitioning methodology, the 
modi fied multi-processor SimpleSca lar ISS simulator sim-system, OpenMP and POSlX 
tools. 
Also a number of vector architectures were discussed, in order to ill us trate current 
hardware design techniques employed to exploit DLP. The speed and performance of 
video compression/decompression and other multimedia applications has dramati ca ll y 
been improved through exploi ting DLP. which accelerates the repetiti ve ari thmetic 
operations on different sets of data present. 
With the abundance of multimedia rich consumer applicati ons, a large number of parallel 
architectures that dramatically increase throughput ha ve been developed to exploit both 
their inherent TLP and DLP. each hav ing their own sets of advantages and des ign 
challenges. 
3.5. References 
[l] M. J. Flynn. "Some Computer Organizations and Their Effectiveness," 
Complllers, IEEE Transacrions on. vol. C-2 1, pp. 948-960, J 972. 
[21 V. Kathail, M. S. Schlansker. and B. R. Rau. "HPL-PD Architecture 
Specification: Version I. J." Hewlett Packard 2000. 
[31 K. Sankaralingam, S. W. Keckler. W. R. Mark, and D. Burger, "Uni versal 
mechanisms for data-parallel architectures." in 36rh Annuai!EEEIACM 
lnremarional Symposium on Microarchitecture, 2003. p. 303. 
[4] J. F. Martfnez and J . Torrellas, "Speculative synchronization: applying thread-
level speculation to explicitl y parallel applications," in /Orh inremarioncd 
conference on Archirecrural support fo r programming languages and opera ring 
systems, San Jose, California, 2002, pp. 18-29 
3. Para//elis/11 Techniques 9/ 
f51 . Cascaval, J. G. Castanos, L. Ceze, M. Denneau, M. Gupta, D. Lieber, J. E. 
Moreira. K. Strauss, and H. S. Warren. Jr. , "Evaluation of a multithreaded 
architecture for cel lular computing." in High-Petformance Computer 
Architecture, 2002. Proceedings. Eighth International Symposium on, 2002. pp. 
3 11-32 L. 
[6] T. Y. Morad. U. C. Weiser, A. Kolodnyt. M. Yalero, and E. Ayguade. 
"Performance. power efficiency and scalability of asymmetric cluster chip 
multiprocessors." Computer Architecture Letters, IEEE, vol. 5. pp. 14-1 7,2006. 
[7J M . K. Farrens and A. R. Pleszkun. "Strategies for Achieving improved Processor 
Throughput," in Computer Architecture, 1991. The 18th Annuallntemational 
Symposium on, 199 L. pp. 362- 369. 
(8J T. Ungerer. B. Rabic. and J. Si le. "Multithreaded Processors," The Compwer 
Journal. vol. 45, 2002. 
[9] D. Burger and J. R. Goodman. "Bi iJion-TransislOr Architectures." /£££ 
Compwer. vol. 30, pp. 46- 49, 1997. 
[ 101 D. M. Tul lsen, S. J. Eggers, and H. M. Levy, "Simu ltaneous multi threading: 
Maxi mizing on-chip parallelism," in Computer Architecture, 1995. Proceedings. 
22nd Annual International Symposium on, 1995, pp. 392- 403. 
[ J I] C. Acosta, A. Falcon. A. Ramirez, and M. Yalero, "A Complex ity-Effecti ve 
Simultaneous Multithreading Architecture," in Parallel Processing. 2005. ICPP 
2005. lmemational Conference on. 2005, pp. 157- 164. 
l121 D. Madon. E. Sanchez. and S. Monnier. "A Study of a Simultaneous 
Multithreaded Processor Implementation." in Proceedings of tlze 5th 
lntemational Euro -Par Conference on Parallel Processing, 1999. pp. 716-726. 
f 131 D. Babu, M. I. Babu, M. Saravana, S. Govindan, and R. Parthasarathi. 
"Functional Uni t Usage Bsaed Thread Selection in a Simultaneous Multithreaded 
Processor," in High Petfonnance Computing, International Conference On 200 1. 
[141 H. Oehring, U. Sigmund, and T. Ungerer, "Simultaneous Multithreading and 
Multimedia," in Multi-Threaded Execution, Arcllitecwre and Compilation 
(MTEAC 99), Workshop on 1999. 
[15] D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy, "The 
directory-based cache coherence protocol for the DASH mu ltiprocessor," in 
Compwer Architecture, 1990. Proceedings. J 7th Annual lntemational 
Symposium on, 1990, pp. 148- 159. 
[ 16] M. M. Michael and A. K. Nanda, "Design and performance of directory caches 
for scalable sharedmcmory multiprocessors." in Higlz -Petformance Computer 
Architecture, 1999. Proceedings. Fifth lnternariontll Symposium On. 1999, pp. 
142- 15 1. 
3. Parallelism Tech11iques 
[ 171 A. Agarwal. R. Simoni. J . Hennessy. and M. Horowitz. "An evaluation of 
dirccLOry schemes for cache coherence." in Computer Architecture, 1988. 
Conference Proceedings. 15th Annual lmem ational Symposium on, 1988. pp. 
280-289. 
92 
11 81 C. Saldanha and M. H. Lipasti. "Power Effic ient Cache Cohere nce," in Workshop 
on Memmy Pe1jormance Issues, 2001. 
[ 191 M. Ekman , F. Dah lgren, and P. Stenstrom. "Evaluation o f Snoop-Ene rgy 
Reduction Techniques for Ch ip-Multiprocessors," in Workshop on Duplicating, 
Deconstructing, and Debunking WDDD-1, In Proceedings of2002. 
[201 D. J. Sorin, M. Plakal, A. E. Condon, M. D. Hill , M. M. K. Martin, and D. A. 
Wood , "Specify ing and verifying a broadcast and a multicast snooping cache 
cohere nce protocol." Parallel and Distribwed Systems. IEEE Transactions on, 
vol. 13, pp. 556-578. 2002. 
[21] J. L. Hennessy and D. A. Paue rson, Compwer architecture : a quantitative 
approach. 2003. 
1221 L. Spracklen and S. G. Abraham, "Chip Multithread ing: Opportuni ti es and 
Challenges," in High-Pet:formance Computer Architecture, Proceedings of the 
IIth International Symposium 011, 2005, pp. 248 - 252. 
[231 K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson. and K. Chang. "The Case 
fo r a Single-Chip Mu ltiprocessor." in Architectural support for programming 
languages and operating systems, ASP LOS-VII: Proceedings of the seventh 
international conference on 1996, pp. 2- L I . 
[241 Y. Chung, K. Park. W. Hahn. N. Pa rk. and V. K. Prasanna, "Performance of On-
Chip Mult iprocessors for Vision Tasks." in !5 1PDPS 2000 Workshops on 
Parallel and Distribw ed Processing. Proceedings of the 2000. pp. 242- 249. 
[25 1 C. McNairy and R. Bhatia, "Montecito: a dual -core. dual-thread ltanium 
processor," Micro, IEEE, vol. 25, pp. 10-20. 2005. 
[261 L. Hammond, B. A. Hubbe rt, M. Si u, M. K. Prabhu, M. C hen. and K. Oluko lun. 
"The Stanford Hydra CMP," Micro, IEEE, vol. 20, pp. 71-84, 2000. 
127] R. van der Pas, "An In troduction into OpenMP." in First International Workshop 
on Open MP. IWOM P 2005 Eugene, Oregon USA, 2005. 
[28j "Information technology-- Portable Operating System Interface (POSIX)," 
ISO/ IEC 9945, 2003. 
[291 "S impleScalar," SimpleScalar LLC. 
[301 "OpenMP ": OpenMP Architecture Review Board. 
3. Pamllelism Techniques 93 
[311 "information Technology- Portable Operating Syste m Interface (POSLX)," IEEE 
1003. 1998. 
[32] I. Ahmad, Y. He, and M. L. Liou . ''Video compression with paralle l processing," 
Parallel computing in image and video processing, vol. 28, pp. I 039 - I 078 
2002. 
[33] T . R. Jacobs. V. A. Chouliaras, and J. L. Nunez-Yanez, "A thread and data-
paralle l MPEG-4 video e ncoder for a system-o n-chip multiprocessor," in 
Application-Specific Systems, Architecture Processors, 2005. ASAP 2005. 16th 
IEEE International Conference on Samos, Greece, 2005, pp. 405-410. 
[341 R. Chand ra. L. Dagum, D. Kohr, D. Maydan, J. McDonald, and R. Me non, 
Parallel Programming in Open MP, 200 I. 
[351 B. Barney. "Introduction to OpenMP," Li vermore Computing, 2006. 
[361 "Portable Ope rating ystem Interface for Computer Environments. " fEEE 
I 003. 1. 2002. 
[371 K. Shen, L. A. Rowe, and E. J . De lp, "A parallel implementation o f an MPEGl 
encoder: Faste r than real-time !." in Digital Video Compression: Algorithms and 
Technologies, Proceedings of SPIE Conference on , San Jose. 2005. 
[38] "Berke ley Paralle l MPEG I Video Encoder," Plateau Research Group, Computer 
Science Division, University of Ca lifornia. Berkeley, California. 
[39] S. Bozoki , S. J . P. Westen, R. L. Lagendijk, and J. Biemond. "Parallel algorithms 
for MPEG video compression with PVM.," in EUROSIM 7996, 1996. pp. 315-
326. 
[401 A. Bilas. J. FriLLs. and J. P. Singh, "Real-time parallel MPEG-2 decoding in 
software," 1997, pp. 197-203. 
[41 I S. M. Akramull ah. f. Ahmad , and M. L. Liou, "A data-parallel approach for real-
time MPEG-2 video encoding," Journal of Poralle/ and Distributed Computing, 
vol. 30, pp. 129- 146 1995. 
[42J V. A. Chouliaras, K . Koutsomyti. T . Jacobs, S. Parr. D. Mulvaney, and R. 
Thomson, "Syste mC-de fined SIMD instructions for high performance SoC 
arch itectures," in Electronics, Circuits and Systems, 13th IEEE International 
Conference on, Nice. France, 2006. 
[431 "AMBAn.• Specification (Rev2.0)." ARM D-Il 001 1A. ARM Ltd. 1999. 
[44] "AMBATM 3 AXI System Compone nts Data Sheet." ARM DOI 0194-1/09.04. 
ARM Ltd. 2004. 
[45] D. E. Culler, J. P. Singh. and Anoop Gupta. Parallel Complller Architecture: A 
llardware/Software Approach, 1998. 
3. Pnra/leli.u11 Techniques 
[46] ARM Ltd: www.arm.com. 
[47j "ARM 1176JZF-STM Revision: r0p2 Technical Reference Manual ," ARM DOl 
030 I 0, ARM Lld . 2006. 
[48] D. Seal. ARM ArchiTecTure Reference Manual. 2nd edition: Addison-Wesley 
Longman Publishing Co. , lnc., 2000. 
[49] D. Brash, "The ARM ArchitecLUre Version 6 (ARM v6)." ARM Ltd . 2002. 
94 
[501 "Coding of moving pictures and associated audio for digital storage media at up 
to about 1,5 Mbit/s." vol. 11172: ISO/lEC. 1993. 
[5 1] "Generic coding of moving pictures and associated audio (MPEG-2)." vol. 
13818: lSO/ffiC. 1995. 
[52] "Information technology-- Coding of audio-visual objects-- Part 2: Visual." 
ISO/ffiC 14496-2.2004. 
[531 "[nformation technology --Coding of audio-visual objects -- Pan 10: Advanced 
Video Coding," lSO/IEC 14496-10,2005. 
[54J "Video Codec for Aud iovisual Services at p x 64 kbits/s," ITU-T 
Recommendation H.26 1, 1993. 
155] "Video Coding for Low Bitrate Communicati on." ITU-T Recommendati on 
H.263, 1996. 
[56] "ArchitccLUre and Implementation of the ARM® Cortexn1-A8 Microprocessor," 
White paper. ARM Ltd. 2005. 
[571 "ARM Architecture Rererence Manual- ARMv7-A and ARMv7-R edition." 
ARM DOl 0406A, ARM Ltd. 2007. 
[58] ''Cortex™-A8 Revision: rlpO Tec hni cal Reference Manual," ARM 001 0344A, 
ARM Lld. 2006. 
[591 ''NEON Technology Data Sheet." ARM DOl 0192-1109.04(5), ARM Ltd . 2004. 
[60] "lntel Architecture Optimization Manual." 242816-003. Inte l Corp. 1997. 
[6 1] "Desktop Performance and Optimization for lntel Pemi um 4 Processor," 249438-
01, lntel Corp. 2001. 
[621 "AMD-K6 MMX Enhanced Processor." Ad vanced Micro Devices, Inc., 
207260/0 1996. 
[63] "3D Now! Technical Manual." Advanced Micro Devices, Inc., 2 19280/0 1998. 
3. Parallelism Techniques 
[64J ''AMD-K6-2 Processor Data S heet." Advanced Micro Dev ices. lnc .. 2 1 850J/0 
2000. 
95 
r651 "Enhanced 3DNow!n 1 Technology for the AMD Athlonn1 Processor." Advanced 
Micro Devices. INC. 2000. 
[661 "The AMD AthlonTM XP Processor with 512KB L2 Cache," Advanced Mic ro 
Devices. INC. 2003. 
[671 "Ve locity Engine," Apple Compute r, Inc.: 
http:/ /dcvelopcr.applc.com/ha rd wu red ri vers/vc/i ndex. html . 
[681 "VMX." IBM. 
[691 "AltiVec," Freescale Semiconductors. 
[70] "AltiVec Technology," ALTNECFACT Rev.4. Freescale Semiconductor 2004. 
[711 "Power lSATM Version 2.03." IBM 2006. 
[721 "AltiVec Technology Programming Inte rface Manual," ALTlVECPIM/D. 
Freescale Semiconductors 1999. 
[73J S. Fuller. "Motorola 's AltiVecTh1 Techno logy." ALTIVECWP/D. Motorola Inc. 
1999. 
CHAPTER 4: 
HARDWARE TECHNIQUES FOR EXPLOITING TLP 
4.1. Thread-Level Parallelism 
As described in section 3.2, the explo itation of TLP is achieved by execut ing sections of 
the CFG (subject to the absence of data dependencies) on separate CPU contexts where a 
C PU context can reside in the same physical core (SMT organisation), mu ltip le processor 
cores (CMP organisation) or a combination thereof (CMP/SMT hybrid organisation). 
This chapter illustrates the methodology developed to exploit the thread- level parallelism 
in three state of the art video codecs. The e ncoders, as opposed to the decoders. in d1ese 
three codecs were chosen for threading due to their high computational complexity and 
the asymmetric nature of the codecs. 
The three encoders chosen to be threaded were MPEG-2[1], MPEG-4l21 and H.264[3]. 
These were chosen for the popularity. wide range of potential uses and ad vanced 
functional ity respecti vely. To illustrate this, the chapter has been di vided up into four 
sections, one devoted to each video encoder and a final section to show the relati ve 
success of each implementation. ln thi s c hapter the threading methodology and 
development tool used is the modified multi-threaded simulator from SimpleScalar, s im-
system. Using this simulator a ll ows full manual partitioning of the control fl ow graph 
whil st using a minimal number of spec ial functions. barrier and context lDs. 
4 .2 . The MPEG-2 video compressio n s tandard 
As discussed in section 2.4.4, MPEG-2 is the second compression scheme developed by 
the Moti on Picture Expert Group (MPEG)[4] and rati fied into the lSO/TEC [S) 13818 
standard and lTU[6] H.262 Recommendati on. It is one of the most widely used 
compress ion scheme util ised in consumer dev ices due to its widespread adoption in the 
DVD and DVB standards. lt is therefore a good first c hoice algorithm for TLP extraction 
with the developed me thodology. 
96 
4. Hardware Techniques (or Exploiting TLP 97 
During the development programme of the MPEG-2 visual component. a number of test 
encoders were developed to explore the success and viability of each proposed addition to 
the developing standard. These encoders. know as Test Models are shown in Table 4- 1. 
Table 4-1 Test model (TM) stages in the MPEG-2 development. 
Version Meeting Date MPEG doe. no. ITU-T SG15 location doe no. 
TMO Singapore Jan 92 - -
TM 1 Haifa, Israel Mar 92 MPEG 92/160 AVC-260 
TM 2 Rio, Brazil Jul92 MPEG 92/245 AVC-323 
TM 2.2 Tarrytown, NY Oct 92 MPEG 92/535 AVC-356 
TM 3 London Dec 92 WG11 92/328 AVC-400 
TM 4 Rome Feb 93 MPEG 93/225b AVC-445b 
TM S Sydney Apr93 MPEG 93/457 AVC-491 
TM 5b Sydney Apr 93 WG11 93/400 AVC-491b 
"TM6" New York City Jul 93 - -
The above table indicates the development of the standard along with each test encoder 
(test model (TM)). After the TM5 meeting in Sydney, the main profile syntax of the 
standard was frozen; addilional information on temporal scalabi l ity was later included in 
the semi official TM6 release. With the syntax of main profi le frozen at the TM5 stage 
and the encoder source being made publicly available by the MPEG Software Simulation 
Group (MSSG)[7l. focus was centred on MPEG-2 TLP exploitation of this test model, 
which is also regarded as the primary reference implementation of the standard by the 
video coding community. The following section analyses the structure of the standard in 
terms of the TM5 source code and illustrates any modifications carried out during the 
static partiti oning process of the control flow graph (CFG) in order to achieve parallel 
sections. 
4. Hardware Techniques [or E:cploiting TLP 98 
4.2.1. Test model 5 
The TM5 implementation command line arguments are a command fi le, the location of a 
parameter (*.par) file and the output bitstream fi le. The parameter file contains all the 
information, paths and settings, mandated by the standard to produce the compressed 
output file. The sections within the parameter file that are of particular interest to this 
project are the input files, the bitrate and the search ranges used in motion estimation. The 
uncompressed input picture data frames are required in a special fonnat in which separate 
files for each colour component (Y, U and V) followed by separate component files for 
each frame. A second format consists of a single fi le containing all YUV colour 
components for each frame finally. a third format stores each frames YUV data within a 
portable pixel map (PPM) file. The target bitrate used in thi s work was kept at a constant 
4MBits/s to allow for fair comparisons with this implementation and the non-threaded 
TM5 code. 
The first stage in the encoding process is the parsing of the parameter fi le and the 
initialisation of all the data wi th values extracted from the parameter fil e. Upon 
completion of the initia li sation stage the frames are reordered to converted the input 
stream from frame to encod ing order, eg ffiBPBBP to fPBBPBB. When the frames are in 
the appropriate order. they are processed one by one by the loop depicted in Figure 4- 1. 
The structure is similar for all frame types. intra, prediction and hi-prediction, and 
contains fi ve main functions, Figure 4- 1. 
do ( 
motion_estimation() 
transform() 
predict () 
iquant () 
itransform() 
while there are no more frames to process 
Figure 4-1 Programme now loop illustrating the five functions responsible for encoding one 
MPEG-2 frame using the TMS reference encoder. 
The functions in Figure 4-1 can be directly mapped to the compression functionality 
specified in the MPEG-2 standard. Table 4-2 below depicts thi s mapping and provides an 
additional description of each functional block. 
4. Hardware Techniques (or £reloiring TLP 99 
Table 4-2 Table illustra ting the mapping of comprc ion functionality onto C functions 
within the TMS encoder. 
Loop function Functionality Description 
Creation of motion vectors 
through searching techniques to 
motion_ estimation() Motion Estimation locate areas of similarity within 
the current and reference 
frames. 
DCT transformation converting 
transform() Transformation pixel data from spatial to the 
temporal domain. 
Subtraction of reference frame 
created from motion vectors 
predict() Motion compensation produced by ME from original 
frame to produce the residue 
frame. 
Quantisation of residue frame 
iquant() Quantisation data and inverse quantisation for 
use in reference frame creation. 
Inverse DCT transformation of 
itransform() Inverse temporal data into spatial transformation domain for reference frame 
recreation. 
Profiling of the non-threaded (MPEG-2 TMS reference implementation) encoder was 
carried out using a custom ISS originally based on the SimpleScalar toolset[8] and the 
number of instructions executed by each functiona l un it recorded as depicted in Figure 
4-2. Through c lassifying each function into a major functional bloc k, it becomes c lear to 
see where the majority of computational time is spent and thus where applying threading 
techniques could potenti al ly prod uce the greatest benefit in terms of reducing the 
execution time of the encoding process. Usi ng the profiling results and knowledge of the 
MPEG TMS re ference impleme ntat ion, and s ince they account for 85% of the dynamic 
instructions TLP was exploited tn the following three functi ons: 
motion_estimation(). transform() a nd itransform(). 
4. 1/archrare Techniques {or £\p/oiling TLP 
53"4 
Mo!Jon estmauon T r811Sfonnatoon / In-se 
transformation 
100 
10"4 
D l..-------'1 __ 
MoUon compensatJon Others 
Figure 4-2 Profile data for the TMS MPEG-2 encoder iUustra ting the percentage of 
execution time spent within each of the functional blocks. 
4.2. 1.1. Motion E. timati on (function motion_estimation()) 
Thi · i a high level function that is re pon ·ible for searching for correlation between 
frame and producing the motion ector. for all macroblock on a frame by frame ba. i for 
P and B frame . Thi function i imply pas ed through without performing any earching 
for 1 frame . It proce e each MB in the current frame in raster order (top left to bouom 
right, row by row), electing the appropriate ME function depending on whether the 
picture i interlaced or not. Thi i depicted in Figure 4-3. 
f or (j=O; j<height2; j+=16) 
{ 
for (i=O; i <width; i+=16 ) 
{ 
execution o f ME functionality 
Figure 4-3 Double loop arrangement in the motion_estimation() function to navigate through 
a frame in ra ter order to provide Carte ian pixel coordina tes to each MB of the frame. 
Figure 4-3 illustrate how, through the u e of two embedded for l oop~, one 'for' the 
co lumn and one for the row, it i pos ible to track each MB within the f rame u ing the 
loop indice a- Cartesian coordinate , wi th the origin being the MB in the top left of the 
4. Hardware Techniques {or Exploiting TLP / 0 / 
frame. For progressive frame video. frame_ME () is called for each MB. whereas 
field_ME () is ca lled if the input source is interlaced. This is shown in Figure 4-4. 
if (pict_struct==FRAME_PICTURE) 
frame_ME( ... } ; 
else 
field_ME( .. . ); 
Figure 4-4 Execution of motion estimation functionality through either frame or field specific 
functions. 
Both these ME functions work in a similar manner with field_ME (} having three 
additional parameters. At this level in motion_estimation (} . the actual methods by 
which the MVs are calculated are of no direct importance to the threading process as this 
is the responsibility of the lower-level function. In addi tion. use of an orthogonal form of 
parallelism. data-level parallelism, in the lower level functions has resulted in their 
e fficient imp lementation. What is of concern. however. is how the newly created MVs are 
stored . 
/* macroblock information */ 
struct mbinfo { 
int mb_type; 
int motion_type; 
int dct_type; 
int mquant; 
int cbp; 
int skipped; 
int MV[2][2][2]; 
/* intra/forward/backward/interpolated */ 
/* frame/field/16x8/dual_prime */ 
I* field/frame DCT */ 
/* quantization parameter */ 
I* coded block pattern */ 
Jw skipped macroblock */ 
int mv_field_sel[2) [2); 
int dmvector[2); 
/* motion vectors */ 
;~ motion vertical field select */ 
/* dual prime vectors */ 
/* activity measure */ 
/* for debugging */ 
double act; 
int var; 
} ; 
Figure 4-5 Macroblock structure of MPEG-2 TMS reference encoder depicting all 
information stored per MB. 
The results from both ME functions, frame_ME (} and field_ME () are stored in the 
MB information structure, mbinfo. Figure 4-5 depicts this structure and illustrates the 
information that is stored wi thin . 
4. Hardware Techniques (or Exploiting TLP 
mbinfo = (struct mbinfo *)malloc(mb_wi dth*mb_height2*sizeof 
(struct mbinfo)) ; 
102 
Figure 4-6 Dynamic creation of an MB structure during initialisation, allocating sufficent 
memory to represent all MBs in one frame. 
The mbinfo structure is allocated in the initialisation function (init ()) and has enough 
storage size for all the MBs in a single frame. Once this structure has been created, access 
is through a pointer, allowing, with each increment access to the subsequent MB in the 
frame. 
When modifying motion_estimation ( ), in order to exploit the TLP certain issues 
need to be addressed. This primari ly involves maintaining exc lusivity o f all variables 
during the parallel sections of the code. As the current frame and the reference frame are 
pre-computed, are both avai lable to all active threads and are not altered during the ME 
processes, they can be accessed as a shared resource without raising any concerns of write 
order. The only write operation that needs special handl ing within 
motion_es timation () is that to update the MB information structure; a method was 
the re fore required to prevent both thread writing to the same location. 
static struct mbinfo *mbi_array[MAX_THREAD) ; 
Figure 4-7 Creation of a static shared array to hold the pointers to the MB structure 
required by each available thread. 
As shown in Figure 4-7. instead of having one pointe r for the mbinfo structure. an array 
of pointers is created. T he array is of sufficie nt size that each thread is al located its own 
pri vate poime r. This allows each thread's pri vate pointe r to point to the correct position 
within the single mbinfo for the specific MB that the thread is processing. fn addition to 
having a separate pointer allocated to each thread. it is essential for each thread to know 
the location of its MB. This location can either be specified us ing the Cartesian 
coord inates or the number of the MB in raster scan order. 
Having e liminated the data dependency issue within the motion_estimation () 
function. the next step is to modify the loops in Figure 4-3 to take into account the 
number of threads executing the code simultaneously. 
4. Hardware Techniques for £tploiting TLP 
iteration = width I MAX_THREAD / 16 ; 
for ( j =O; j<height2 ; j+=16) 
{ 
for (i=O; i<iteration; i++) 
{ 
/03 
Figure 4-8 Modified double loop in MPEG-2 TMS encoder allowing for MAX_THREAD 
number of threads to execute the inner most loop 
This is achieved by (a) reducing the inner loop iteration limit and (b) reducing the 
increment from 16 to 1. The original limits. height2 and width. specify the maximum 
number of pixels in each frame and thus are both incremented by 16 to identify the pixel 
location of each MB. The new limit for the inner loop. iteration, is ca lculated by 
dividing the original limit (width) by the number of avai lable processors in the system 
(MAX_THREAD) and also div ided by 16 (the number of pixels horizontally per MB). This 
division, along with the modi fied increment, changes the variable i from representing the 
location with in a row of pixels to represent the number of MAX_THREAD MB groups. 
M odifying the loop in this manner allows MAX_ THREAD threads to encode MAX_ THREAD 
MBs per loop iteration. 
mbi_array[context) = mbi+con t ex t +(i*MAX_THREAD); 
Figure 4-9 Populating MB information pointer array with the correct location of the current 
MB and so allowing each thread to access a different section of the MB structure. 
The first task of the loop is to fill mbi_array [ ]. Figure 4-9 illustrates how, through 
offsetting the pointer that points to the first MB of the row (mbi), using each thread's 
private ID number (context), the loop index (i) and the number of threads 
(MAX_THREAD), and the location in mbinfo where the current MB information is to be 
stored is calculated and stored. 
Now that each thread writes to the correct location of the single mbinfo. the final issue is 
to ensure that each MB in a row is encoded. Since the number of processors is not 
guaranteed to divide exactly into the number of MBs per row (Figure 4-8) a remainder 
strip-mind loop code is necessary to ensure that these final MB s are properly encoded. 
4. HardiVare Techniques [or Exploiting TLP 
if (context < (width/16) % MAX_THREAD) { 
mbi_array[context) = mbi+context+(iteration*MAX_THREAD); 
104 
Figure 4-10 election procedure, though use of modulus operator, for remainder, (stripmine) 
execution. 
Figure 4-10 shows how specific threads are selec ted to encode the re maining MB in the 
row by us ing the modulus ope rator. The array is populated in the same manner as be fore 
but using the constant iteration instead of the variable i. The calls to the specific ME 
functions are identical to that of the main loop. the only dif erence being which threads 
are active and hence whic h MBs are encoded. 
mbi = mbi + (width I 16); 
Figure 4-11 Realignment of mbi pointer to point to the location within the MB structure 
representing the first MB in the subsequent row 
Once a fu ll row has been e ncoded and the MVs for each MB computed and stored in the 
correct location of the MB structure, the next iteration of the outer loop is carried out. 
incre menting j (Figure 4-8) by l6 pixels and moving down to the next row of MBs. 
Be fore this can happen the ori ginal mbinfo pointer {mbi) is moved to the first MB of the 
new row as shown in Figure 4-l l . Be fore discussing the threadi ng of the transformation 
function , a brief description of the non-threaded prediction function is given. Within 
predict ( l, the MVs produced by ME are combined with the reference frame to produce 
the predicted frame. This frame. for P and 8 frames. is based solely on prediction from 
ME. For I frames where no ME is carried out, the prediction frame is set to pixel val ue 
128. 
4 .2. 1.2. Transform (function trans form()) 
The transform function is responsible for taking the current frame and the predicted frame 
produced by predict () . producing a residue error from these two frames and 
performi ng a DCT transformati on on each block o f this error frame. The functi on takes as 
its inputs the current and predicted frames along with a pointer to the mbinfo struc ture 
and the trans form is implemented as a three-level loop; these loops raster-scan each MB 
and then. within each MB, scan the blocks. 
4. Hardware Techniques {or Exf)lo iting TLP 105 
In the inner-most loop. at the block level, the code can be divided into three sections. The 
first examines the frame type. progressive or interlaced, and the colour sub-sampling of 
the frame. thus identifying the number of each block type in the MB and the type of the 
block under test. The second includes the function sub_pred () and goes through all 64 
pixels in both the current block and the predicted block, subtracting one from the other 
and saving the difference (error) in a s imple struc wre known as block. Third ly, once this 
error block has been computed, the final section of code in this inner loop performs the 
forwa rd DCf transformation. Here, fdct () takes as its input the corresponding e rror 
already computed in block and performs the DCT on the data and the o utput is saved 
back to the same location. 
To allow the transform to be threaded, the variables used in the parallel sec tion need to be 
examined . All the calculated values in section one are private to each block and thus can 
be calculated in para llel; the results of these calculation be ing saved in a shared array. as 
depicted in Figure 4-12. 
static int cc[MAX_THREAD]; 
static int offs[MAX_THREAD]; 
static int lx[MAX_THREAD]; 
Figure 4-12 Declaration of shared memory arrays for storing block information required by 
the sub_pred() and fdct() functions. 
Due to the internal structure of the sub_pred () function , it is executed seria lly by a 
s ingle thread. To allow for a single thread to ca ll sub_pred () us ing the infonnation 
gathered by each of the other processors in the first section, the data that obtai ned needs 
to be stored in an assessable form. This is achieved by the use o f shared data arrays as 
defined in Figure 4- 12. 
The arrays are created in the shared me mory space using the static command. so that 
they are visible to a ll processors. To allow for write-exclusivity, each thread is onl y able 
to write to its specified index of the array. By specifying a s ingle separate location for 
each of the threads to write to within the shared memory space, private data is made 
readable by the other threads. 
4. Hardwrrre Techniques [or Exploiting TLP 106 
To ensure data-consistency in these shared memory arrays. a method of guaranteeing that 
all the wri te operations have been completed be fore any read is performed. This is 
achieved by the use of a barrier instruction. The barrier allows for all the processors to 
finish wri ting to the array before the data is read back. 
BARRIER 
if (!context) 
for (x=O; x < MAX_THREAD; x++) 
sub_pred(pred(cc(x)]+offs(x] ,cur(cc(x]]+offs(x),lx[x], 
blocks[k*block_count+(x+(n*MAX_THREAD))] ); 
Figure 4-13 Code segment representing context zero the executing a seria l loop in which 
sub_ pred () is called using each individual thread's privately stored data. T he barrier 
instruction ensuring a ll write operations to shared variables required by sub_prcd() are 
complete prior to the execution of the seria l loop. 
Figure 4-13 depicts how a single thread (context zero) executes sub_pred () 
sequentia lly by accessing the pri vate data stored in the shared memory arrays. following 
the synchronising of all threads at the barrier point. 
fdct(blocks[k*block_ count+(context+(n*MAX_ THREAD)))); 
Figure 4-14 Para llel implementation of FDCT including the reallocation of block number 
based on the loop itera tior , context ID. 
Once all the predicted errors have been calculated, the data can be transformed by 
fdct (). The transform function accesses independent data sets and thus can be executed 
in parallel with no exclusivity issues. As with motion_ estimation () . a remainder 
section is added after the main threaded loop. Using the same approach as earl ier. the 
modulus command specified the threads and blocks to use in this case. 
4.2.1.3. Inverse transform (function itransform()) 
T he inverse transform function is responsible for calculating the predicted frames to be 
used in motion compensation. To preserve the quality of MY predictions. the reference 
frames in both the encoder and decoder on which these pred ictions are based need to be 
ident ical. Using frames obtained di rectl y from the input source bitstream as references 
breaks this rule, since the decoder does not have access to these original frames. 
However, the frames that the decoder can access are those previously decoded. To ensure 
4. 1-/ardiVare Techniques for Exploiting TLP 107 
that the encoder uses only such frames. it also has to decode the transmitted bitstream and 
this is achieved by converting the quantised DCT coeffi cients into pixel data. It is these 
quanti sed DCT coefficients, with the addition o f the M Vs produced by ME, that are used 
by i transform () to produce pixel data. As in the forward transform function, 
i transform () scans the MBs row by row and block by block within each MB. The 
structure of the i transform () function miiTors that of the transform ( ) function in its 
three sectio ns. As before, the first section calculates the exact type and location of the 
c urrent block. These calculations are stored in a sha red memory array. all owing for 
exclusive wtite access from each thread in addition to allowing communal read access. 
The fi rst maj or difference between the transform and its inverse is the orde r in which the 
latter two sections are processed. Due to these functions being the complement of one 
another the order of the calc ulations has to be reversed to allow for coJTect informatio n 
retrieval. 
idct(blocks[k*block_count+(context+(n*MAX_THREAD))J); 
Figure 4-15 Parallel implementation of inverse DCT within MPEG-2 encoder 
The functi on that performs the inverse transformatio n is idct (). This func tion takes as 
its input the quantised ocr coeffi cients, stored into struc ture block, performs an inverse 
DCT on lhis data and writes the residual pixel data back into the same loca tion in block. 
This fu nctio n can be executed in parallel due to each call operating on exclus ive memory 
locations. To ensure that each thread has carried o ut its a llotted transformation before the 
encoder moves on to the final section. a barrier instruction is inserted. 
BARRIER 
if (!context) 
for (x=O; x <MAX_THREAD; x++) 
add_pred(pred[cc[x))+offs[x),cur[cc[x))+offs[x) ,lx[x) , 
blocks[k*block_count+(x+(n*MAX_THREAD)))); 
Figure 4-16 Code segment reprc enting context zero executing a serial loop in which 
add_predO is caJied using each individua l thread's privately tored data. The barrier 
instruction ensuring all write operations to shared variable required by add_pred(), arc 
complete prior to the execution of I he serial loop. 
As shown in Figure 4- 16, sub_pred () is replaced by add_pred () in i transform (). 
Thi s functi on is responsible for reconstructing the current frame using the freshly 
4. Hardware Techniques for Exploitin g TLP 108 
transformed pixel data containing the quantisation e rrors. Using thi s pixel data, along 
with the M Y and the reference fra me on which the MY were based, add_pred {) 
produces the approx imation of the current frame that the standalone decoder will 
re produce. On completion, predictions can be based on thi s frame. A remainder section 
comple tes the threading of itransforrn{). 
4 .2 .2 . Fast ME algorithms 
The Test mode l 5 encoder descri bed in section 4.2. 1. employs a bas ic and computationa ll y 
expens ive method of computing MYs. The encoder uses a method known as full searc h 
which ana lyses every MB within a predefined area, by ca lculating the sum o f absolute 
di fference (SAD) values, and producing MYs relati ve to the MB with the lowest error 
va lue. The drawback o f this approach is the number of SAD calculations required. 
Various adapti ve search me thods have been developed to reduce the numbe r of MB tests 
while still produc ing MYs of comparable quality to full search. What fo ll ows is a 
descri pti on of twelve adaptive ME search methods imple mented for MPEG-2 and the 
respecti ve methods by which all these me thods were threaded . 
T hree-stepf91 search is very popular ME method because of its simplicity, robusmess and 
near-optima l pe rformance. lL searches for the best motion vectors in a coarse-to- fine 
square search pauem. Four-step search[ LO] is based on ma ny rea l world image sequences· 
characteristic of centre-biased motion. T he computational complex ity of the four-step 
searc h is less than that of the three step search. while the performance, in terms o f quality 
is as good. 20 loga rithmic search[ 111 yields similar results to the three-step searc h. It 
uses a additi on (+) pattem for testing MB instead of a square pauern used three-set 
search. Orthogona l search[l 2j is a hybrid method based on the three-step and the 20 
logarithmic searches, whic h searc hes first in the vertical direction and then the horizonta l 
direc tion for each iteration of the search. Large d ia mond search[l3] uses a large nine-
block dia mond pattern followed by a sma ller fi ve block diamond for fine grain 
adjustment once the coarser search has centred on a block. Modified diamond searc h uses 
a sma ll di amond as a search window and continua ll y evaluates and moves until a target is 
acquired. Centre-biased d iamond search uses a dia mond o f inc reasing size (one, two and 
four step s izes) to locate the best blocks. Hie rarchical d iamond search uses a similar 
technique to the centre-biased diamond searc h, but begins with a step size of four and 
4. 1-/ardiVare Techniques for E.xploiring TLP /09 
subsequently reduces the step size. Conj ugate direction searchl 141 is similar to 
orthogonal but carries out att hori zontal searches before the vertical MBs are examined. 
ln the cross search algorithm a logarithmic step search is carried out. however, the search 
locations picked are the end points of a x pauern rather than a + pattern[ 15]. The spiral 
search approach combines elements from both logarithmic and cross search by using both 
an x and + pattern with a graduatty decreasing step size 116]. The New Three Step 
search[ 171 is a modified version of the three step search with additional tests in the first 
step to quickly identify zero or near zero motion. By implementing each one of these 
methods in frarneME ( l the threading process for each is no di fferent from futt search. 
Since frameME () is a high level function, it is independent of the search method used to 
calculate the MY. 
4.3. The MPEG-4 Visual video compression standard 
T he second video coding standard addressed in this work is MPEG-4[2 [. There is a large 
number of software codecs that produce MPEG-4 visua l compatible bitstreams 11 8-23 ]. 
One if these, X viD is released under the Free Software Federations general publish 
license (GPL) [24] . This license ensures that the source code is freely available and any 
product that uses it also becomes open sourced. The XviD encoder is compatible with the 
advanced simple profi le of the MPEG-4 visual standard. 
Year Product Owner 
Modified version of 
1998-2002 Divx ;) 3.11 a Microsoft MPEG-4 Version 
3 encoder 
Jan 2001- July 2001 OpenDivx DivxNetworks, lnc 
July 2001 - March 2002 Divx4 DivxNetworks, lnc 
July 2001 - Present XviD Open source 
March 2002 - July 2005 Divx5 Divx, lnc (formally DivxNetworks) 
July 2005 - Present Divx6 Divx, lnc 
Fagure 4-17 A hmehne vaew of both the XvaD encoder and ats n val encoder, Davx. 
4. 1/ardware Techniques [or Exploiting TLP 1/0 
The origin · of X viD date back to 1998 when Jerome Rota rever e engineered and 
modi fied Microsoft ' MPEG-4 ver ion 3 video encoder al lowing it to output to container 
file formats uch as aud io video interleave (AVT) in. tcad of Micro oft ' . more re. trictive 
ad vanced system format (ASF). This compression scheme, know as Divx ;) 3. 1la, gained 
huge popularity on fi le sharing networks on the internet a it provided free, high quality 
video at a much lower tile ize compared to other codec avai lable at the time. Due to 
legal issues with u ing thi modified codcc, in January 200 1 the Di vxNetwork ser up an 
open ource project, known as OpenDivx, to create an MPEG-4 encoder free from 
ownership is ues. Later DivxNetwork decided to clo e OpenDivx and use the code a the 
basic to develop a dosed ource commercial codec know a Di vx4. however. the open 
source community kept developing it own codec into what i known a. X viD (Di vx spelt 
backward ). 
The ver ion of X viD used in thi project was 0.7 which, only includes the' imple profile' 
and not the 'advanced simple profile' available in the current ver ion ( 1. 1 ). 
r 90 1 
eo 
71) 
~ 
g eo I 
~ 
.D !: 50 
.. 
0 g ~0 
~ 
~ ~ , 
20 
10 
0 
01 03 05 
Quality Settings 
D Ouantlsauon 
• ocr 
OME 
OMC 
•Total threaded 
a Total non·threaded 
Figure 4-18 ProfiJing data for the XviD encoder over three encoder quality setting ·; 
illustrating the percentage of execution time pent in executing each functional compression 
block. 
T o examine the function. that would potentially benefit from exploiting TLP, a imilar 
methodology to that in the MPEG-2 development wa followed. Here the non-threaded 
encoder wru set to encode a te t equence at various quality etting enabling a number of 
4. Hardware Techniques for Exploiting TLP Ill 
features in the encoder used to ach ieve compression of the input frame. The lowest 
quality selli ng is Ql and imple me nts very bas ic operations in which a single MV is 
produced for each MB, with MC precision restricted to full pixel resolution. Quality 
setting Q3 inc reases the number of potenti al compress ion methods avail able to the 
encoder, a ll ow ing for up to four MVs to be produced for each MB, as well as increasing 
the MC precision to half pixel. Quality Q5 increases the number of available methods sti ll 
further with the add ition of chroma ME and the introduction of tre llis quanti sation. F igure 
4-18 shows the percentage of insLructions that are associated with the four highest-level 
fu nctional b locks, namely quantisation, D CT, ME and MC. T he bars depicting the total 
threaded percentage represents the instructions that would be executed in parallel if these 
f unctional blocks are threaded successfully. Jf this is indeed successful , over 70% of the 
instructions required by the encoder would be executed in a parallel environme nt leading 
to potentia ll y substantial performance improvements. 
Usage : xvid_en craw [OPTIONS] 
Options 
-as m 
- w integer 
-h integer 
- b integer 
-f float 
-i string 
-t i n teg er 
-n integer 
-q integer 
-m boolean 
-o string 
-mt integer 
- mv integer 
use assembly code 
frame width ([1.2048]) 
frame height ([ 1 . 2048]) 
target bitrate (>0 I default=900kbit) 
target framerate (>0) 
i npu t filename (default=stdin) 
input data t ype (yuv= O, pgm=1) 
number of frames to encode 
quality ( [ 0 . . 5 ] ) 
save mpeg4 raw stream (0 False* , !=0 True) 
output container filename (only usefull 
when -m 1 is used) : 
When this option is not used 
one file per encoded frame 
When this option i s used : 
+ stream .m4v with -mt 0 
+ stream .mp4u with -mt 1 
output t ype (m4v=O, mp4u=1) 
Use motion vector hints (no hints=O, 
get hints=l, set hints=2) 
-help prints this help message 
- quant integer fixed quantizer (disables -b setting ) 
(* means default) 
Figure 4-19 Command line options available within the xvid_encraw example programme. 
Each option s pecifies parameters that can be used in the encoding procedure. 
4. Hrmlware Techniques [or £rp/oiring TLP 112 
Unl ike the TM5 MPEG2 encoder, the XviD codec employs both a dynamically and 
statically li nked li brary. To illustrate the access mechanism avail able, an additional 
executable programme, xvid_encraw, is also provided. This programme's encoding 
parameters are passed to the XviD library via command-line switches, as shown in Figure 
4-19. Once all these switches have been validated and the encoder's intemal variables and 
structures initialised, the main loop in the programme is executed. 
do 
read_yuvdata(); 
enc_main ( ) ; 
while ( filenr <ARG_MAXFRAMENR ) ; 
Figure 4-20 Encoding loop of xvid_encraw re ponsible for reading in one frame's pixel data, 
in either yuv or pgm picture format, and ubsiquently calling XviD's main encoding function 
enc_main() to encode that fra me. 
Figure 4-20 shows the outermost loop of the Xv iD encoder. Here each frame is read in 
from file and is encoded by enc_main (). In enc_main (), the programme flow forks 
depending on the current frame type. The main functions responsible for each frame type 
are FrameCodei () and FrameCodeP () for l frames and P frames respectively. 
if (pFrame->intra == 1) { 
pFrame->intra = FrameCodei(pEnc, &bs, &bits); 
else { 
pFrame->intra = FrameCodeP(pEnc , &bs , &bits, l , 
write_vol_header); 
Figure 4-21 Selection of frame encoding function (FrameCodel() or FrameCodeP() ) based 
on the rramc type flag, pFramc->intra. 
lf the pFrame->intra flag is speci fied, an I frame is encoded when either the internal 
scene detection algorithm detects that a scene change has occurred or when the auto 
key frame (T frame) insertion code detects that the number of frames since the last I frames 
has passed a preset limit. Thi s guarantees that there is a maximum separation between I 
frames to aid seeking and error correction. 
4. Hardware Techniques [or Exploiting TLP 1/3 
4 .3.1. FramcCodel 
As in its parent function enc_main ( l, FrameCodei ( l is responsible for encoding one 
frame per cal l. After writing frame header information to the bitstream, this function 
rranscends from the frame level to the MB level for encoding. 
for (y = 0; y < pEnc->mbParam . mb_height; y+ +) 
for (x = 0; x < pEnc->mbParam.mb_width; x++) 
Figure 4-22 ingle threaded double loop responsible for identifying and accessing MBs 
within raster scan order. 
To access the individual MBs of the frame, two loops are used. as shown in Figure 4-22. 
These loops locate each MB with a set of Cartesian coordinates in a raster scan order. It 
was the inner loop. responsible for scanning MBs in a given row, which was para llelised. 
In this threaded implementation of the XviD encoder the avai lable processor contexts 
execute parallel iterations through the inner loop. 
iteration = pEnc->mbParam . mb_width I MAX_THREAD; 
for (y = 0; y < pEn c->mbParam . mb_height; y ++) 
( 
for (x_new = 0; x_new < iteration; x_new++) 
( 
x = context+ (x_new * MAX_ THREAD); 
Figure 4-23 FrameCodel() parallel double loop. Illustrating the calculation of the maximum 
loop parameter for inner loop, the modified inner loop u ing this new maximum and the 
recreation of original loop index, i. 
To achieve the parallel implementation, the maximum iteration loop parameter is reduced 
to take into account the additiona l processors (threads) executing, in parallel , per loop 
iteration. as shown in Figure 4-23. This newly reduced loop has also been assigned new 
loop indices, x_new, since doing so allows for the private variable, x . to be recreated for 
each thread allowing the correct location w ithin the row to be known. 
The data structure that contains information for each MB in the frame is pEnc-
>current ->mbs [ 1, and is pre-allocated space in memory during the initia lisation 
function. In the single-threaded implementation of the encoder. a single poin ter. pMB. is 
4. Hardware Techniques for Exploiting TLP 114 
used during each iteration of the loop. Th is allows for a ll the data acquired , such as MV, 
quantisation va lues. to be sorted in the correct me mory location for the given MB under 
test. In this mu lti-threaded implementation, havi ng just a sing le pointer is not practical 
since each processor has to access the MB struc ture simultaneous ly and modify that 
pointer. 
static MACROBLOCK *pMB[MAX_THREAD); 
Figure 4-24 Declaration of shared ar ray of pointers to ring M B locations for each thread. 
The shared array of pointers, as shown in Figure 4-24, a llows one ele ment of the array for 
each processor context of the system. ln the threaded vers ion, a shared arny was used 
instead of pri vate pointers for each context due to the abi lity to run serial sections with 
the parallel loop using the location o f each thread· s MB data . 
pMB[context) = 
&pEnc->current->mbs[x + y * pEnc->mbParam.mb_width]; 
Figure 4-25 Allocation of each clement within MB pointer array to their corresponding 
location within MB structure. 
Using the Cartesian coordinates for MB location, y from the outer loop indices and x 
recreated for eac h thread , the location for the MB data structure is found a nd assigned to 
the correct index of the shared array. With the access mechanism to the MB structure 
constructed. the remainder of the loop executes four functions in order to encode the MB. 
Figure 4-26 
for each row in the frame 
{ 
for each MB in the row 
{ 
CodeintraMB () 
MBTransQuantintra() 
MBPredic tion () 
MBCoding () 
Figure 4-26 Pseudo representation of the functions called within Fra mcCodel(). 
4. Hardware Techniques for Explo iting TLP 115 
CodeintraMB () is responsible for initialis ing MBs for intra encoding and this is carried 
out in two separate processes. Firstl y a ll MVs of the MB are set to zero as MVs are not 
required for intra encoded frames. Secondly. a special adapti ve quantisation technique 
known as lu mimasking is performed, where MBs with e ither very high or very low 
luminance va lues have the ir quantisation factor, and hence quality, reduced since the 
human eye is not sens iti ve to such regions. This in turn all ows more bits to be avai lable 
for other areas of the frame. When lumjmasking is selected. the MBs that have been 
identified by adaptive_quantisation ( l arc permitted to alter the global frame 
quantisat ion value. S ince every MB uses this shared quantisation value, any changes 
caused by lumimasking affect the remaining MBs in the frame. Due to thi s, in the 
threaded case, indi vidual copies of the sha red quamisation factor need to be stored for 
each MB. To achieve, this a serial section is created to ca lculate the quantisati on values 
along with any change arising due to lurrumasking and to store this val ue in each MB 
structure as shown in Figure 4-27. 
for (serial=O;serial<MAX_THREAD ; serial++l { 
if ((pEnc->current->global_ flags & XVID_LUMIMASKING)) 
if (pMB[serial]->dquant ! = NO_CHANGE) { 
pMB[serial]->mode = MODE_INTRA_Q; 
pEnc->current- >quant += DQtab[pMB[serial] - >dquant] ; 
if (pEnc->current->quant > 31) 
pEnc->current->quant = 31; 
if (pEnc->current->quant < 1) 
pEnc->current->quant = 1; 
pMB[serial]->quant pEnc->current- >quant; 
Figure 4-27 A serial loop to calculate and store private quantisation factors for each 
available thread, based on whether lumimasking is being implemented. 
Since the quantisalion factor is now stored for each MB of the MB struclUre itself and not 
as a shared variable, functions that use th is value have to be altered to read the value from 
the correct locati on. T his can be seen in MBTransQuantintra ( l, whic h carries out the 
main functionali ty in the loop. In MBTransQuantintra (), pixel data is transformed 
from the spatial doma in to the frequency do main, quanlised , and then inversely 
transformed back into the spatial domain for use as re ference frames. 
4. Hardware Techniques for Exploiting TLP 
MBTransQuantintraMT( 
&pEnc->mbParam, pEnc- >current,pMB[context) , x, y, 
dct_ codes[context), qcoeff[context), pMB[context)->quant) ; 
Figure 4-28 Modified MBTransQuantlntra to accept. private MB quantisation value. 
116 
This function, now called MBTransQuantintraMT (), has been modified to accept the 
additional private MB quanti sation factor. The modili cation involves discarding the 
global quantisation val ue and using the one passed as a private value instead. With the 
creation of this new function , the data dependencies in MBTransQuant () are removed 
allowing it to be executed in para llel. Such a modification of this function. now permits 
the main computationally ex pensive fu nction that encodes intra frames to be executed in 
paralle l. The remaining functions in the loop arc MBPrediction () and MBCoding () . 
that are responsible for the variable length coding of the resultant transfonned and 
quantised data and their storage in the bitstream buffer. Due to the nature of thi s write 
operation, these functions arc executed seriall y. 
if (context< (pEnc- >mbParam . mb_width % MAX_THREAD)) 
{ 
CodeintraMB() 
MBTransQuantintra() 
MBPrediction () 
MBCoding() 
Figure 4-29 election of threads to process the remainder (stripmine) MBs through the use of 
the modulus operator. 
As in the threading of MPEG-2, to ensure that every MB in a row is encoded, a remai nder 
secti on is included that uses the modu lus operator to select the threads to process those 
MBs not already processed. 
4. HardiVart• Tecl111iaues [or Exploiting TLP 117 
Maximum contexts=2 
I I 
Iteration 2 3 4 5 ! A ! 
Context I 0 l1 I 0 l1 I 0 l1 I 0 l1 I 0 l1 I 0 I 
Maximum context=3 
' Iteration : 1 2 3 R 
I I I f 
Context I 0 j1 l2 l 0 l 1 l 2 l 0 l 1 l2l 0 l 1 I 
Maximum context=4 
Iteration 1 2 R 
Context I 0 l 1 I 2 13 1 0 l 1 l 2 131 0 l 1 I 2j 
Figure 4-30 Graphic representa tion of 1\18 encoding on a row of 11 MD for two, three and 
four CPU context. Remainder (stripmined) MD a rc shown with a red border 
As an aid to understanding the MB proce ing and the remainder ( trip mining) operation, 
Figure 4-30 haws the encoding of an 11 MB wide row for two context, three context and 
four context. lt can be seen that for two contexts there are five iteration, of the main loop 
with only context zero executing the reminder ection. For a maximum of three context 
however, lhere are three iteration of the main loop and now both context zero and 
context one have to process MBs in the remainder section. Finally, for four context there 
are only two i teration of the main loop, but now context zero to two need to proce 
additional MB in the remainder ection. 
4.3.2. FmmcCodeP 
P frame encoding by the XviD encoder i carried out in the FrameCod e P () function 
which, like FrameCodei (), proces e one frame per cal l. The function can be pl it into 
two main ection • . 
4. Hardware Techniques {or Exploiting TLP 
MotionEstirnation() 
for each row in the frame 
{ 
for each MB in the row 
{ 
MotionCornpensation() 
MBTransQuantinter(l 
MBCoding ( l 
118 
or MBTransQuantintra() 
or MBPredic t ( l 
Figure 4-31 Pseudo code segment representing the function calls in FrameCodeP(). 
Depending on the outcome of MotionEstimation(), two encoding techniques are implemented 
during the latter part of this function. 
As can be seen in Figure 4-31 the main functional blocks of FrarneCodeP () can be 
executed in two distinct part, depending on whether the curTe nt MB is to be encoder using 
intra or inter methods. Firstly, MotionEstirnation () is carried out on the frame as a 
whole. The intemal ME process be discussed in secti on 4.3.2.1. If certain crite ria are me t, 
the whole frame may be encoded as an I frame by cal ling FrameCodei () directly after 
ME and hence forfei ting the re mainder of Fra.meCodeP (). 
The main loop in Fra.meCodeP (). like Fra.meCodei (). is responsible for transform. 
quantisation and vari able length cod ing. This is executed on a per MB basis. row by row. 
iteration = pEnc->rnbPararn . mb_width I MAX_THREAD; 
for (y = 0; y < pEnc->mbParam.mb_height; y++) 
{ 
for (x_new = 0; x_new < iteration; x_ new++) 
{ 
Figure 4-32 FrameCodeP() parallel double loop showing the calculation of the ma"':imum 
loop parameter for the inner loop. 
The reduction of loop indices, Figure 4-32, the recalcu lation of x and the setting up of 
pMB [] are carried out as in the previous descriptions of the parallel impleme ntation. In an 
inter frame it is permitted that a MB be entire ly encoded us ing intra techniques. 
Consequentl y. two alternati ve program flows are needed in the loop contain ing both the 
inter and intra encoding Functions. 
4. Hardware Techniques [or Exploiting TLP 119 
bintra[context] = (pMB[context]->mode -- MODE_INTRA) I I 
(pMB[context]->mode -- MODE_INTRA_Q); 
Figure 4-33 A erting nag within shared memory array blntraO for a given thread 
depending on the type of the MB being processed by that thread. Flagging a MB ensures that 
it is encoded using intra techniques. 
Figure 4-33 shows how. by using a shared me mory array, bintra [].each thread in the 
row flags whe ther or not its specific MB is to be encoded using intra techniques. 
For inter MBs, before transformation can take place, the residue image needs to be 
created through MC. Here MBMotionCompensation () takes the MVs created by ME 
and applies these to the reference frame to produce an approx ima tion of the curre nt 
frame. This is then subtracted from the current frame to produce the residue (error) frame. 
The re are no data dependencies associated with this fu nction and thus it is exec uted in 
paralle l. Fo llowing thi s, a seria l section is inserted for lumimaski ng correc tions. Here. as 
with FrameCodei ( l, the shared quantisati on va lues are changed and sto red depending 
on the luminosity of the MB. 
if ( !bintra[context]) 
pMB[context]->field_pred = 0; 
pMB[context]->cbp = MBTransQuantinterMT(&pEnc->mbParam, 
pEnc->current, pMB[context), x, y, dct_codes[context], 
qcoeff[context],pMB[context) - >quant); 
) 
else 
MBTransQuantintraMT(&pEnc->mbParam, pEnc->current, 
pMB[context], x, y, dct_codes[context] , qcoeff[context), 
pMB[context] ->quant); 
Figure 4-34 Section of MB encoding function based on an intra flag held in the blntra 
shared memory array. 
The inte r transformati on function has been modified in a s imilar manner as in 
MBTransQuantintraMT () to a llow the use of private quantisation values. Wi th this 
modification. the functions can be executed in parallel, but the final bitstream coding is 
executed in a serial loop. 
4. Hardware Techniques for Exploiting TLP 120 
4 .3.2.1. Motion Estimation 
There are two pri mary tasks for motion_estimations (). The first is the calculation, 
via searching, of MVs fo r every MB in the c urrent f rame and the second to tlag for intra 
frame coding if the fnme meets specific crite ria. There are three variables for which data 
dependenc ies need to be in vestigated, Figure 4-35, and a ll three require mod ificatio n to 
allow access to shared me mory arrays in order to enable exclusivity of wri te data access 
between contexts (MB prediction storage, and flags for use in intra fra me in sertio n ). 
static MACROBLOCK *const pMB[MAX_THREAD); 
static VECTOR predMV_mod[MAX_THREAD) [4); 
static int iintra_array[MAX_THREAD); 
Figure 4-35 Creation of shared memory arrays responsible for storing the pointer to the MB 
location in the MB structure, the modified prediction pattern results, a nd the intra frame 
encoding flag. 
As with the two previous loops, in ME each frame is searched MB by MB, row by row. 
Loop index reduction in add ition to reca lc ulati on of x and repa inting of pMB is carried 
out. 
predMV_mod[context) [0) = get_pmv2_mod( ... ) ; 
pMB[context) - >sadl6 = SEARCH16( . . . ) ; 
Figure 4-36 The combined prediction a nd search pairing repeatedly executed in ME. 
Predictions based on neighbouring MBs in the current frame are used as a start reference 
for motion sea rching using reference frames. 
The main process by which MVs are produced is a repetitive seque nce of motion 
prediction and searching. Figure 4-36 illu strates how a pattern of predict ions 
(get_pmv2_mod ()) a nd searches (SEARCH16 ()) were used to prod uce M Vs. This 
prediction and search cycle is executed in motion_estimation () but repeated a 
number of times de pendent on the success of successive searches. After each 
predi ct/search cycle (in addition to detecting if further searches a re required), a decision 
on whether the MB is to be encoded as an intra MB is taken. If this happens, it is flagged 
in a private element of the shared array iintra_array [ 1. A private flag is used instead 
of a shared count due to write issues if more than one thread atte mptes to increment the 
count at the same time. Thi s e li mjnates the need to use of semaphores. 
4. Hem/ware Techniques [or E:rploiting TLP 121 
Tn the o riginal single-thread codes get_pmv2 () is called to produce MY p redic tion us ing 
data from the surrounding MBs in the c urre nt frame. Using these predictio ns as a start ing 
point for the search process. the search function calculates the MVs using the data from 
the re ference frame. 
(a) (b) (c) 
Figure 4-37 Prediction patterns: (a) ideal, (b) standard, (c) proposed, each based on the 
mean average MV from the selected MB within the current frame. The MB whose MV is 
being calcula ted is shown as a darkened block in the centre of the nine neighbouring MBs. 
To estimate the most accurate predic tion o f the motion for a g iven MB, the mean of a ll 
neighbouring MB's MYs needs to be calculated, as shown in Figure 4-37a. Due to the 
raste r scan processing o rdering of MBs, when the MV of a given MB is required not all 
the MYs of the neighbouring MBs wi ll have been dete rmined. The 'next best solution' is 
to use the top, top-right and left MBs since, in raster scan o rder these MBs have a lready 
been processed. as shown in Figure 4-37b. This pattern is specified in the MPEG-4 
standard . In a multi-processor implementation however this prediction patte rn is not 
practical since the le ft MB will be processed simultaneous ly. To overcome th is. a new 
prediction pauern is proposed in the current work that replaces the left MB with top-le ft 
MB, so that no data will ex ist between MBs on same row, but with onl y a minimal 
increase in bitrate , as shown on Fig ure 4 -37c. This has been implemented in a modified 
version of get_pmv2 () called get_pmv2_mod () whic h stores these modified vectors in 
a shared me mory array predMV_mod [) [). 
The MPEG-4 standard specifies that the displacement between the calculated and 
predicted MVs is stored in the bi tstream and not the MY itself. Thi s displacement value is 
calcu lated by subtracting the MY from the predicted MY found fro m get_pmv2 (). This 
is a problem for the modified prediction pauern as decoders will not be a ble to decode 
such a pattern and hence not determine the dis placement va lues to the correct M Y. 
4. Hardware Tecllllir(lles for Exploitin g TLP 
if (pMB[seria l]->mode == MODE_INTER) 
{ 
predMV[O] = get_pmv2 (pMBs , pParam->mb_width , 0 , x, y, 0 ); 
pMB[serial] ->pmvs[ O] .x = pMB[serial]->pmv s( O] .x 
+ predMV_mod [serial ] [0 ] . x 
- predMV [ 0 J . x; 
pMB[ seri a l] ->pmvs [O] . y = pMB [serial ]->pmvs[O ] . y 
+ predMV_mod [serial ] [0] .y 
- predMV [ OJ . y ; 
122 
Figure 4-38 Recalculation of displacement MVs by calcula ting motion prediction using the 
prediction pattern specified in the sta ndard, and basing the new displacement on these 
predictions. 
To overcome th is issue , the method illustrated in Figure 4-38 is proposed. By ca lling the 
unmodified prediction func tion. get_pmv2 ( ) seriall y a fte r all the searches completed, 
'con·ect predictions' according to the sta ndard can be made. By knowing the con ect 
predic ti on as we ll as the modified ones allows for the d isplacement to be recalcul ated. 
After this calculation, a ny decoder will be able to re trieve the correct MY from the 
displacement 
/ *****serial******/ 
iintra =+ i i ntra_array[serial ]; 
if (iintra >= iLimit ) 
iintra_flag = 1; 
/*****parallel*****/ 
if (iintra_flag == 1) 
retu rn 1 ; 
Figure 4-39 Accumulation of intra encoding nags and evaluating the total against predefined 
values in a seria l loop, in order to find the total number of intra-encoded MBs in the fra me. 
If the total is above the specified limit each thread in parallel the exits ME function with the 
intra exit value. 
The fina l pi ece of code added to ME deals w ith intra coding. In the same serial section 
concerned wit h recalcula tions of MY predi ctions. all iintra_a rra y [ J fl ags are added 
to a sha red tota l. As in the single threaded version thi s va lue, iint ra is eva luated against 
a set limit which indicates whether the frame will be encoded as imra frame or will 
re main an inter frame. The decis ion as to which code technique is to be use d is indicated 
in the ME function by the value it returns ro its ca lling function. If intra-frame coding is 
4. Hardware Techniques {or Exploiting TLP 123 
to be used, a value of one is returned. whereas. if the P frame is to be encoded using inter-
frame techniques, a zero is returned. When i r n tra does exceed iLimi t then. to allow 
all acti ve threads to exit the function while returning the correct return va lue. a shared 
flag, iintra_flag, is set in the serial section indicating to al low the ex it status to be 
indicated to all threads. 
4.4. The H.264 video compression standard 
The final v ideo coding standard para llel ised in this work is H.264[3. 25] . There are 
numerous commercial software H.264 encoders [23, 26-29] and a small number of open 
source encoders in development with the two most prominent being JMl221 and x264[30J 
as shown in Table 4-3. The JM encoder is the official reference encoder produced by the 
ITU and is designed to demonstrate a range of features and to i llustrate valid bitstreams. 
whereas x264 is an open-source community-developed encoder designed from scratch by 
volunteers. 
4. 1-/ardwMe Techniques [or £xploiti11g TLP 124 
Table 4-3 Comparison of software 11.264 encoders, both commercial and open- ourced. 
Encoder Creator License Features 
H.264 Main Commercial Baseline and Main profiles up to Encoder V2 Concept level 5.1 
JM Open Baseline, Main and extended Reference ITU 
Encoder source profiles 
Blocksize: 
• 8x8 P-Frame 
• 8x8 B-Frame 
QuickTime 7 Apple, Inc. Commercial 4x4 !-Frames • 
No CABAC, Max 1 B-frame, No 
multiple reference frames, No 
weighted prediction 
Blocksize: 
• 4x4 P-Frame 
• 4x4 B-Frame 
x264 - Open • 4x4 !-Frames Source 
CABAC, multiple B-Frames, 
multiple Reference Frames, 
Weighted prediction , Lossless 
mode 
Tests were carried out on the two open-source encoders to evaluate which would be the 
more suitable for use in the current research. For encoding time on x86 architecture, even 
with the fast motion estimation switch in JM enabled, x264 was of the order of 100 times 
faster for the majority of sources, but this was probably due to the optimisations (both 
platform dependent and independent) in the x264 encoder and because the JM encoder is 
intended as a proof of concept encoder to illustrate the standards' functi onality rather than 
be a viable encoding solution. A fast encoder is of limited use if the quality of the output 
it produces is low, so the relati ve bitrates of each encoder were examined with respect to 
image quali ty using a number of different test sequences. 
4. Hardware Techniques {or Exploiting TLP 125 
105 . 
104 
~ 103 
QJ 
..... 102 IQ .._ 
..... 
:.0 
QJ 
101 -~ 
..... 
IQ 
QJ 
cc 100 
99 JM 9.6, 1 reference frame 
x264 rev 228, 1 reference frame 
98 
34 35 36 37 38 39 40 41 
Image quality in PSNR [dB] 
Figure 4-40 Illustration of x264 relative batrate compared to the reference JM encoder for 
image quality 34-40dB. 
Figure 4-40 shows one test result, when e ncodi ng a 25 frame CIF video sequence 
"mobi le'' (see Figure 4-50), the bitrate produced by x264 at low image qua lity ( <35dB) 
was around 5% worse than that of the JM re ference encoder. This the n drops to around 
3% at qualities greater than 36dB. As the bitrate required to perform encodjng us ing x264 
is only margi nally higher, the dramatic diffe rence in encodi ng time clearly support the 
use of x264 for this research. 
4.4. 1. Threading granularity 
T he initia l plan for threading H.264 was to use a similar approach to that used for the 
MPEG-2 and MPEG-4 implementations and to operate at the MB level. However after 
further examination of the processes involved, a number of difficulties emerged. The 
main area of concern was that the specification permitted variation in the dimension of 
the MB in an e ncoding stream. By al lowing an MB to be subdivided up into smalle r MBs, 
the workload to encode a sing le standard 16x 16 MB wou ld vary cons iderably. This 
unpredictable variati on in workload. and hence encod ing time, would require a different 
me thod for the allocation threads to MBs. As with MPEG-4, the use of predictions based 
on the c urre nt frame data becomes an issue. but in H.264 this is further compounded by 
-1. 1/ardware Techniques [or Exploiring TLP 126 
the range of MBs specified. Consequently. to implement the threaded version of H.264 in 
the same manner as lhat used for MPEG-4, could require the identification and 
modification of a large number of prediction patterns. risking significantly worsening of 
the encoder 's output quality. A lternati ve methods of threading the encoder were 
examined, one being to change the granularity of the threading from the MB level to the 
slice level. 
A slice is a group of MBs in a frame that can be encoded independently and in parallel. 
To achieve this independence, predictions and MVs are not allowed to cross slice 
boundaries. Paralleli sing H.264 at the slice level yields memory latency performance 
advantages, without lhe need to modify prediction patterns. 
104 
103 
c52. 
QJ 102 
+-' 
tU 
..... 
+-' 
~ 101 QJ 
-~ 
+-' 
tU 
QJ 100 a:: 
99 
98 
34 35 36 37 38 
Number of Slices 
1 
2 -----------
3 .......... .. . 
4 ---
39 40 
Image quality in PSNR [dB] 
41 
Figure 4-41 Comparison of bitrate relative to a ingle sliced frame for 2, 3, and 4 !ices at 
image qualities ranging from 34 to 40d8. 
With the introduction of slice groups, restrictions are imposed on the encoder. To create 
these independent processing groups. all MVs and predictions have to be kept within the 
sl ice boundaries. By restricting the potential locations from which MVs can be referred. 
the quality of the MVs can pOLentially be reduced. Figure 4-41 illustrates how the bitrate 
is affected by the introduction of sl ices. The same set of test video sequences as used to 
generate the results in Figure 4-40 was used to produce those shown in Figure 4-41. In the 
4. Hardware Tech11iques (or £mloiti11g TLP 127 
figure . the actua l bitrates produced by the e ncoder were scaled re lative to the single slice 
g roup. It can be seen that for low quality images the increase in bitrate for four slice 
g roups is approximate ly 4%. but is less than I% for image qual ities greater than 40dB. 
The reason the re lative bitrate decreases as the image qua lity increases, is due to the 
decreased pro portion of the to tal number of bits within the frame, that represent the MVs 
compared to the transform coeffic ients. To e ncode hig h quality images. the total bitrate is 
high than for low quality images. This increase in bitrate equates to an increases in the 
numbe r of bits available for representing the quamised coefficients. as the bit required to 
re present the MY are essentiall y constant. With these additiona l bits it is possible to 
compensate for the potential lower quality MY that resu lt from the increase in the number 
of slices. 
130 
120 
110 
100 
90 
32 34 
HD 1 920x1 088 ---
CIF 352x288 ··········· 
QCIF 176x144 --·--···-
36 38 40 42 
Image quality in PSNR [dB] 
Figure 4-42 Comparison of bitrate produced during a four sliced encod ing relative to a single 
lice at four different frame resolution and fo r image quality range 32 to 42dB. 
Figure 4-42 shows the effect of resolution on the bitrate for a fixed number of slices. 
He re, the ·mobi le' sequence has been scaled to three d ifferent resolutions, HD L088p, CIF 
and QCIF, encoded after first dividing each frame into four s lices and plotted relative to 
the corresponding s ing le s lice for that resolution. The bilrate increase is less as image 
q uality improves. again due to the larger number of in bits available to encode the 
sequence. The re is significant d ifference between the results obtai ned at the three 
4. Hardware Techniques {or £rploitin g TLP 128 
resolutions. At an image q ua lity of 32 dB , four slices in the QCIF image increase the 
bitrate by a quarter; in the CIF image. whe re the increase is less than 10% and for the HD 
image, the increase is around 2%. The reason for such a large variat ion is due to the 
number of MBs present in each slice. The QCIF s lice only contai n 24 MBs, compared to 
99 for CIF and 2040 for HD. In those slice groups containing larger numbers of MB , the 
red uction in MV quality will on ly be seen in the relatively small number of MBs that are 
located at the edge of the slice group, and it is these that sustain the g reatest reduction in 
MV search range. When the number of MBs within the slice is reduced the number of 
MBs affected by these edge e ffects increases in comparison to the total number of MBs 
and hence the number of bits required to produce a specific image quali ty increases. 
4.4.2 . Slice encoding in x264 
lnitial work on the x264 codec was catTied out at the ti me when the ava ilable version of 
the encoder (r 190) did not support slices. After di scussions w ith the developers on the 
development mailing list it was found that there was a modified version ( rl 83-slice) 
which crudely enabled multiple slices per frame. T his version was modified by Champ 
Ye n in collaboration with other authors and was the starting point for our work. 
To achieve encoding at the slice level, specific entries in the main encoder's data 
structure were separated and multiple sub-struc tures o f these data were produced, o ne per 
slice. Doing so allowed storing of separate MBs and bitstream data for each slice and thus 
allowed mu ltiple slices to be encoded concurrently. 
4. Hardware Techniques (or £'Cploitin g TLP 
for( mb_xy = 0, i_skip = 0; 
mb_xy < h->sps->i_mb_width * h->sps->i_mb_height; 
mb_xy++ ) 
{a) 
for( mb_xy = h->sh.i_thread_first_mb [ i], i_skip = 0; 
mb_xy < h->sh . i_thread_last_mb[i]; mb_xy++ 
(b) 
129 
Figure 4-43 Assignment and execution of MBs in lice groups. In (a) the non-sliced encoder 
assigned all MB in the frame to a single lice, whereas in (b) the specified M B ranges based 
on MB limits defined in the given slice' array in lhe encoder's data structure. 
The two loop statements in Figure 4-43 show the original loop with no slicing and the 
modified slice loops responsible for cycl ing through each MB and performing analysis 
and encoding for a given slice group. In the non-s liced code, all MBs are encoded in a 
single slice in a scan line order starting at the top left. I n the multi-sliced implementati on. 
the start and end locations of the MB slice are determined in advance and stored in the 
arrays i _thread_first_ mb [] and i _ thread_ last_ mb [] respecti vely, with each 
slice group's range speci fied in a dif ferent element in each array. 
for(i = 0 ; i < i_slice_ num; i++) 
{ 
h - >sh.i_thread_ first_ mb[ i ] = 
((h- >sps- >i_mb_hei ght*i) / i _ slice_num)*h- >sps->i_mb_width; 
h->sh . i_thread_last_mb[i] = 
((h- >sps->i_mb_height*(i+l))/i_ slice_num)* 
h - >sps->i_mb_width; 
h->sh . i _ thread_ last_ mb[ i_slice_ number-1] = 
h - >sps - >i_mb_height* h ->sps- >i _ mb_ width; 
Figure 4-44 Allocation of MB range to each s lice g roup based on an even dist ri but ion of 
MBs between slices. 
When determining the MB range to allocate to each sl ice, the total number of MB rows 
wi thin the frame are evenly distributed up between the avai lable slices. In some cases. the 
slice range is extended to account for non integra l results when dividing the number of 
rows by sli ces. 
4. Hardware Techniques for Exploiting TLP /30 
The x264 implementation of a multi-sliced e ncoder used in this work had known issues in 
its rate contro l and scene detection methods. The stati stica l information regarding how 
many bi ts were allocated to each slice was statically de tennined and bore no re lation to 
the information needs of each s lice. Usi ng this muhi-sliced code. Yen imple mented the 
encoder in parallel using the POSIX thread (pthrcad) library, so that each pthread was 
allocated to a s lice group. 
From vers ion r240. Yen's ideas of a multi-sliced encoder were fom1a lly committed to the 
mai n x264 encoder tree. This led to great improvements as far as the current work is 
concerned . Firstly, a number of bugs have been fi xed and features added to the tree since 
Yen wrote his imple me ntation. This includes the re moval of consta nt bitrate (CB R) 
encodi ng and its re placement by average bitrate (ABR). a llowi ng the partitioning of 
avai lable bits across the slices to be impleme nted successful ly. Secondly, the partitioning 
o f the data structure became more structured producing more natura l looking code. 
Porting r240 for use on the custom simulator used in the current work was reasonably 
straightforward since a ll the data dependences had been taken care of. All the 
initialisation has been written for the mu lti-slice environment and so allocating this work 
to one thread initialised the syste m for all threads. The functionality that is required to be 
performed in parallel was all imple mented inside x264_slice_write () and its sub-
functions. 
for( i = 0; i < h->param.i_ threads; i++ 
x264_ slice_write( h - >thread[i) ) ; 
Figure 4-45 Single-threaded loop in the multi-sliced encoder responsible for calling 
x264_slice_writeQ. This parent function encapsulates all encoding functionality for the given 
slice group. 
Figure 4-45 above shows the multi-sliced non-threaded calling of 
x264_slice_write (). A similar method of conversion from a serial loop to a paralle l 
one was implemented in a similar manner to that used for MPEG-2 and MPEG-4. 
4. Hardware Techniques for £reloi1ing TLP 
iteration = h->param.i_threads I MAX_THREAD; 
for( i_new = 0; i_new < iteration; i_new++ ) 
{ 
i = context + (i_new * MAX_THREAD); 
x264_slice_write( h->thread[i]); 
131 
Figure 4-46 Multi-threaded multi-sliced x264_sliuce_ write() execution loop. Modifica tions 
were implemented to a llow multiple threads to execute the loop in parallel. 
By implementing the parallel loop in this manner the number of threads and the number 
of slices are kept separate. T his allows the number of slices to be determined between 
encoder sessions. 
4 .5 . Results 
A s described in section 3.2, the idea behind thread-level parallelism is to take the 
workload from a single processor and to partition and distribute it across separate 
independent processing elements (processor contexts). With this in mind, a measure of 
how effective threading is to the execution of a speci fie program is required. Due to this 
work focusing prima.rily on the saving obtained through exploiting parallel techniques at 
the application/processor level. traditional measures, such as operational frequency and 
cycle count. are not appropriate due to these also taking into account the total system 
configuration and performance. As stated previously in chapter 3. the ability of the 
memory sub-system to service the paral lel executing cores with data will effect the 
systems performance. and hence any results obtains using a metric that incorporates this. 
One such approach that is not affected by these factors is the number of instructions 
executed on each of these individual processors. Additi onally. through comparing thi s 
instruction count to that of the single non-threaded environment a proportional change in 
executed instructions is obtainable. By executing each workload on our custom ISS 
(section 3.2.3. 1). with its 'pelfect' memory model. it is possible to obtain a dynamic 
instruction count (DIC) for each thread running in the simulator. 
-1. Hardll'are Techniques {or £rploiting TLP /32 
100 
80 
~ 
c 
"' 0 () 60 c 
0 
""' u 2 
o; 
.E 40 
.l:! 
E 
.. 
c ,.. 
0 
20 
0 
0 8 16 24 32 
Number of Processors 
Figure 4-47 Relative distribution of instruction amongst CPU contexts for the number of 
contexts ranging from 1 to 32. This distribution based on the ideal case of an equal division 
of the workload. 
Figure 4-47 above how an example of the distribution of workload acro. s a continuous 
range. Thi di tribution depicting an ideal (inverse relation hip, 1/N), since it a ume 
each instruction can be divided into ections that can be recur i vely divided into mal ler 
ub- ection of equal execution time. Ln practice, this is not pos ible ince an instruction 
can not be ub-divided indefinitely. Due to thi inability to ub-d ivide in truction an 
uneven distribution of in truction wi ll be een acro the available proce or , a hown 
in Figure 4-48. 
4. Hardware Techniques {or Exploiting TLP 
Maxim um Number of Processors 
2119 
25 23 
31 :1927 
3 1 
1 5 
1311 Q 
1715 Processor Number 
133 
Figure 4-48 A typical 3D view of practicaJ distribution of instructions among avaliable 
processor illustrating the ripple effect observed due to the difficulty of precisely sub-
dividing all instructions. 
In the process of allocating thread to MB and lice a described in thi chapter, the 
more practical di tri bution of instructions to avai lable proces or is depicted in Figure 
4-48. Another influence on the pattern hown i due to the remainder section in which a 
number of proces ors, fewer than the total available, is put to use. Although Figure 4-48 
provide a useful vi ual repre entation of the likely effect of the threading on the 
instructi ons per proces or, a proper tudy require a quanti tative investigation of practical 
video equences. 
max [instruction count per thread] 
Relative dynamic instruction count = - --- - - -------- X 100 
Single threaded instruction count 
(4-12) 
The propo ed metric for repre enting the threading proces as a whole i the relative D JC, 
equation (4- 12). The number of instruction of each of the proce or in the y tern is 
4. Hardware Techniques for ExploiTing TLP 134 
determined, their maximum found, and the ratio between thi s maximum value and the 
DIC for the single threaded encoder i. taken. 
100 
~ eo c 
::> 
0 
(.,) 
c 
.2 
u 60 
2 ;; 
..: 
u 
·e 40 .. 
c 
"' 0 
Q> 
> 
0:: 
.. 20 
.. 
a: 
0 
0 8 16 24 32 
Number of Processors 
Figure 4-49 Relative distribution of instructions amongst CPU contexts for number of 
context ranging from 1 to 32. The distribution is based on an ideal but non-equal division of 
the workload. 
Using thi method of calculating the relati ve DIC, Figure 4-49 repre ents the ideal 
distributi on of instructions amongst 32 processors. The proce sor wi th the highest 
instruction count i usually thread zero ince it is responsible for initialisation and 
bitstream control but as hown below for H .264 this i not always true. By taking a 
relati ve measure of the instruction count change, it i po ible to analy e and evaluate 
results obtained from test encoding for a range of different parameter . 
4 .5. 1. Test sequences 
Each encoder was eva luated with a 15 different test input sequences. T hese sequences are 
standard video sequences used to quantify on the abi lity of an encoder to handle a specific 
compression challenge. Figure 4-50 shows a selection of te t equence u ed in this 
re earch howing a frame from each accompanied by a de cription of the equence, along 
with the specific attribute that sequence ha and that is of u e in the testing of encoder . 
4. HardiVare Techniques [or Exploiting TLP /35 
Football 
The football sequence depicts an 
American football match. This is used to 
test the encoders' ability to cope with fast 
moving objects. 
Tennis 
The tennis sequence shows a match of 
table tennis. The difficulty within the 
sequence is the for the encoder to 
successfully track the fast moving small 
white ball. 
Mobile 
The mobile sequence is a slow panning 
shot through a colourful child's room with 
toys moving in the foreground. This 
sequence is useful to test intra MB 
selection due to the slow movement. 
4. Hardware Techniques (or Exploiting TLP 136 
Container 
The container sequence shows a 
container ship sailing through a river 
estuary. This sequence is of interest since 
the majority of the movement is 
concentrated within only the top third of 
the image. 
Snow lane 
Snow lane depicts a snowy road junction 
scene. The movement of cars and 
pedestrians is observed along with snow 
fall. This requires the encoders to track 
objects with the added interference of the 
falling snow. 
Rotating city 
Rotating city is a still shot of the 
Manhattan skyline performing a full 360 
rotation. Rotating an object makes 
demands on both the ME and DCT 
components of an encoder. 
4. Hardware Tec/111iaues (or Exploiting TLP 137 
Rush hour 
Unlike the above sequences that are 
available in sub 01 resolutions rush hour 
is fu ll HD1 080p resolution. lt depicts a 
street scene at rush hour in Germany. 
This is one of three HD1 080p sequences 
tested. 
Figure 4-50 Example frame, description and specific encoding challenges of a selection of test 
sequences used for each of the three video codjng standards. 
4 .5.2. MPEG-2 
To analyse and validate the threaded TM5 MPEG-2 encoder, a substan ti al te ting 
procedure was undertaken. Th is involved encodi ng a selecti on of twelve test sequences of 
vary ing resolutions, changing different input parameters to the encoder. The first 
parameter that wa~ investigated wa the search range in the reference frame u ed by M E. 
120 1800 
,. 
1600 
100 j 
.. 
1400 g> 
"'o ~ 
~ 
',( 80 1200 ~ 
.. 
.. ,,. 
"' 1000 c :I d :E 0 u 80 -~ c .2 800 m u :lE 
2 • 0 ~0 600 o; / ~ .!: 
E 
·--
400 :I 
• loSin..ICtiOn Count z 20 
.----·---------
200 
N~bef ot MB withll search 
IWlgG 
0 0 
3x3 7x7 12x12 15x15 23x23 31x31 40x40 
Search Range 
Figure 4-51 Dlustration of the relationshjp between sea rch range used for ME and the 
instruction count observed while encoding the 'snow lane' test video on the TMS reference 
encoder. 
As the earch range within the full search M E algorithm is i ncrea ed, the number of MB 
located within the area also increa e . This increase in MBs in the search area and the 
4. Hardware Techniques (or Exploiting TLP 138 
corresponding rise in in tmcti on count are depicted in Figure 4-5 1, for the 'snow lane' 
test equence encoded using the standard single threaded TM5 encoder. The above figure 
demon trates that the instmction count ha a clo e correlati on to the number of MB in 
the earch area (and hence the earch range). Thi can be ea il y explained from a tudy of 
the earch method employed by, which is to find MY u ing a full earch in which every 
MB in the earch area is le ted . For example, for a CfF frame, the increase in the number 
of SAD operati on, required u ing a 40x40 earch range rather than of 3x3 i 1620%. 
Evaluating the relati ve DIC allow di rect compari on between earch range . To en ure 
that the threaded encoder does not alter the generated bit tream, a binary compari son wa 
carried out after each encoding run between the reference bitstream (unmodi tied, single-
thread encoder output) and the paralleli eel encoder bitstream. 
70 
-+-3x3 -e- 7x7 
12x12 15x15 
- 23x23 -+-31x31 
- 40x40 
0 
0 8 16 24 32 40 48 56 
Number of Processors 
FigUI·e 4-52 Relative DIC obset·ved when encoding the CIF 'Tennis' sequence on the 
modified multi-thread MPEG-2 encoder. 
Figure 4-52 shows the relative DIC ob erved when encoding 25 frames of the Tennis test 
sequence with different numbers of processors. T he family of curves is produced by 
generating results for different earch ranges. A number of ob ervati on can be made. 
Firstly, for larger earch range , more SAD operation are can·ied out and the proportion 
of the D IC being executed in parallel increases. Due to thi s increa ed proportion, the 
relati ve DIC executed on all the proces or decrea e . Secondly, a the number of 
processors is increased the relati ve DIC improve (decrea e ) dramatica lly until at a point 
4. Hardware Techniques for Exploitin g TLP / 39 
where eight thread have been introduced, the slope becomes les steep. A t 22 threads the 
relati ve DIC fa ll to produce a lower constant DIC which remains constant for longer 
number of proces ors. The dramatic fa ll at the lower number of thread i ev idence that 
of the workload of the encoder is relati vely evenly divided between the available 
proces or , clo ely following the pattern seen for the ideal distribution shown in Figure 
4-47. A di cu sed previou ly, the di tribution of rhe workload to the proces or produces 
a distribution graph such that shown in Figure 4-49. It can be seen that the results 
obtained when encoding the 'tenni ' equence u ing the multi-threaded MPEG-2 encoder 
do indeed follow thi pat1ern. 
~ 
c: 
:I 
0 
(.) 
c 
.2 
u 
2 
;;; 
70 
60 
50 
40 
~ 
0 30 
·e 
.. 
c 
!; 20 
., 
> 
. , 
.. 
-.; 10 
a: 
. - ·- ·--· -
3x3 • 7x7 
12x12 15x15 
• 23x23 31x31 
40x40 
-----·-----------· 
O L---------------------------------------------------~ 
0 8 16 24 32 40 48 56 64 
Number of Processors 
Figure 4-53 Reduction in relative DIC observed when encoding the SJ2x380 'snow lane' test 
sequence on the modified multi-thread M PEG-2 encoder. 
To explain the location of the tep fa ll , the result from the encoding of the other video 
sequences where examined. Figure 4-53 shows the encoding of the snow lane equence 
using the same parameters as used to generate Figure 4-52. In thi new equence a very 
imilar hape graph can be een, with the only major difference being the location of the 
tep change in relative DIC. In the higher resolution snow lane sequence the final relati ve 
DIC is reached a! the 32 processor mark, wherea for the tennis sequence the 
corresponding figure was 22. This can be explained when the threaded loop are 
examined, and principal ly the maximum iterati on va lue of the original loop that indicate 
how many times the loop would originally need to be run . To produce the optimum 
4. HardiVare Techniq11es for Explo iting TLP 140 
parallel performance us ing the allocation methods described previously. each of these 
iterations should be executed by a separate processor allowi ng the loop as a whole to be 
exec uted in one parall el run . At thi s point, if add itiona l processors are added they will 
remain unused s ince each iteration of the original loop will already been a llocated to a 
processor. For both the transformation func tions the loop was threaded at the block level, 
so that each block in the MB could be encoded in paralle l. The maximum number o f 
processors that can be brought to bear s imultaneously is re lated to the colour sub-
sampling scheme employed in the input sequence be ing encoded. The optimal number of 
processors to execute the transformation functions in paralle l ranges from 6 for the sub-
sampling schemes 4:2:0 to 12 for the 4:4:4 scheme. As seen in the profi ling results in, 
Figure 4-2, the majority of the execution time of the encoder is spe nt on ME and hence 
improvements in its impleme ntation will have a great effect on pe1formance. During ME 
the loop is threaded at the MB level. where the MBs in a row are distributed among the 
ava ilable processors. To achieve best use of these processors their number would be the 
same as the number of MBs in a row. It is this relationship that gives rise to the step 
c hange in performance in the DIC results for the results were obtained with the tenn is 
(352x288) and snow lane (5 12x380) seque nces. By dividing the horizontal pixel 
dimens ion by 16, the width of a MB, the optimal number of processors can be calc ulated, 
these fi gures being 22 for the tennis sequence and 32 for the snow lane video. As can be 
seen from both Figure 4-52 and F igure 4-53 these are the points in the graph beyond 
which further increases in the number of processors had no additional effect on the DIC. 
4. 1/ardware Techniques {or Exploiring TLP J../1 
8 
7.5 
0 
0 
E 
:l 7 0 (.) 
c: 
.Q 
u 
2 6 5 
iii 
E 
\,! 
E 6 
"' c: >. 
0 
55 
s I 
Figure 4-54 The relative DIC observed for each adaptive ME search method when using the 
full earch techniques at a earch range of 40x40pel for the sinlge threaded MPEG-2 
encoder. 
Figure 4-54 il lu Irate t.he DlC reduction for twelve earch methods compared to full-
earch both executing in the ingle-threaded MPEG-2 encoder. The result were obtained 
by encoding 25 frames of the tenni. sequence w ith a search range of 40x40pel for full 
earch. 1t i clear the election of earch method can explicitly effect the DlC, a in each 
earch (apart from fu ll earch) only certain MB are elected for test, tark contrast to the 
full earch which, for a 40x40pel search range, e aluate the entire 1600 MB . A 
compari on of the threaded ver ion of the search method encoder with re peel to the 
original fu ll search encoder would not gi ve a meaningful evaluat ion of the effect of 
threading in the e encoders, and instead the threaded ver, ion of the earch for each 
encoder was evaluated aga inst its corresponding single th readed ver ion. 
4. llard11'Mf! Techniques [or £reloiting TLP 
70 
l 
c: 
0 
·o;:; 50 
~ 
-= u 
-~ 40 
c: 
> 
0 
41 
~ 
~ 30 
a: 
20 
2 16 
Number of Processors 
70 
~ I 
::- eo 
c: 
" 0 (.) 
c: 
0 
't; so 
" ~ 
-= u ! 40 
c: 
~ 
41 
> 
·= ~ 30 
a: 
20 
2 4 16 
Number of Processors 
22 
22 
Three step 
• Four step 
Diamond 
HierarchiCal 
diamond 
• Cenb diamond 
.-. Orthogonal 
Large diamond 
-conjugata 
• Cross 
2D_Iog 
Spiral 
32 
New three step 
Gradient 
• 
32 
142 
Figure 4-55 Reductjon in rela tive DIC ob crved through exploiting TLP for twelve adaptive 
sea rch methods when compa red to single-threaded encoding of each method. 
These re ults are hown in Figure 4-55 and demon trate the ame underly ing curve shape 
for each . earch method. AI o, a smaller reduction in the relative DJC i ob. erved for the e 
alternati ve earch method when compared to the full earch re ulu. Thi i due to the 
reduction in the number of MB being evaluated, meaning the proportion of the encoder ' . 
in tmction. that are being executed in parallel ection i reduced. ft can be een that 
when 16 proce or are involved, the relati ve DIC ha reduced to le than 35%, 
demonstrating a ubstanti al improvement in performance. 
4. Hardll'are Techniques (or Exploiting TLP 143 
4.5.3. MPEG-4 
A s was the ea e for MPEG-2, the verification and testing proce for MPEG-4 wa. based 
on result · obtained when encoding a large selection of input sequences and ob ·erving the 
change to the DTC when varying one parameter of the encoder. Unlike MPEG-2 where 
the algorithm remained unaltered during the threading proce s, in MPEG-4 the initial 
prediction pattern wa~ changed to allow for threaded encoding and hence a binary 
compari on to the unaltered code is not po sible. To perform validation, a compari on 
with the threaded encoder was carried out after each te. t in olving the new prediction 
pattern implemented. 
:\ 
~50 ~ E 
" 8 4,5 
c 
~40 
2 
~35 
u 
~30 
~ 
0125 
~ 
.!!! 8!20 
15 
10 
2 7 
-+- Full p1xel resolution ME 
..... Hall plxel resolulion ME 
--Hall pixel resolution ME. 4 MV per MB 
Hall pixel resolution ME, 4 MV per MB. chroma ME 
-Hall pixel resolution ME, 4 MV per MB, chroma ME. trells quantlsatlon 
t2 17 22 27 32 
Number of Processor 
Figure 4-56 Reduction in relative DJC observed through exploiting TLP with the MPEG-4 
encoder XviD using the C lF 'Tennis' sequence. 
The quality setting of the XviD encoder can be elected to be in the range I to 5 inclu i ve, 
with the higher the value turning on additional encoder feature . A i hown in Figure 
4-56 all five quality settings how a imilar curve hape, with the percentage difference in 
reduction between the curve being le then 5%. A imi lar DIC reduction to MPEG-2 was 
i. obtained for MPEG-4 except that the ore reduction wa. mooth up to 22 proce or · 
rather than dropping in stages. This can be explained by con. idering the max imum loop 
iterati on of each of the threaded loop . Unl ike MPEG-2 where the tran formati on 
functions are threaded at the block level and ME at the MB level, MPEG-4 operates 
entirely at the MB level. Consequently, the most uitable number of proces or for each 
4. 1/ardware Techniques {or £rploiting TLP 144 
loop i directly related to the re ·olution of the input equence and. in the CIF tenni. 
sequence u ed to generate the re ul! in Figure 4-56. this number of proce · or i 22. Thi 
is indeed ratified by the findings of the te ·t procedure. Even with all the threaded loop 
having the amc and most uitable number of proce sors, a step reduction in DIC would 
be expected, so a further te t was performed to investi gate thi. apparent absence. 
50 
c: 40 
:I 
0 
u 35 
c 
0 
1l 30 
:I 
~ 25 
-= ~ 20 E 
C1l 
~ 15 
~ 10 
i 
'ii 5 
0:: 
................................................................................ ;-------, 
1 , ............................................................................... . '\.. QCIF 
""- CIF 
HO 
- , ............................................................................ '-;.----' 
0 ,_-----,-----.----~------.-----.------.-----.------, 
0 16 32 48 64 80 96 112 128 
Number of Processo rs 
Figure 4-57 Reduction in relative DJC observed through exploiting TLP using a caled 'rush 
hour ' sequence at QCfF CIF and HD IOSOp resolution . 
ln Figure 4-57, the 'ru h hour' te t sequence wa re- am pled down from HD I 080p to CIF 
and QCIF resolution . By studying the area around the DlC tep points it can be een 
that the reduction occurs in discrete teps and not a a continuou proce · . Thi is in 
accordance with the earlier assumption and ill u trated in Figure 4-49. What can be een 
by examining different resolution is that by increasing the resolution and hence increusing 
the optimal number of processor , the relati ve DlC continue to reduce before it sellle at 
ils optimal value. Figure 4-57 hows how lhe relati ve DLC can be reduced to only 6% of 
that of the ingle threaded encoder when u i ng 120 proce or for HD I 080p. For a more 
practical number of proce. sor of I 0. the DIC i reduced to 22%, 16% and 12% for QClF. 
CIF and HD 1080p re pecti vely. 
4. 1/tmlware Techniques (or £rploiting TLP /45 
4.5.4. 1-1 .264 
The final video codec evaluated was 1-1 .264. A. with the other two codcc , H.264 wa 
studied using a selecti on of te t equenccs and the relati ve DJC of each whit t varying the 
encoder parameter . Ln addition to this, due to the level of threadi ng granularity chosen 
(. lice- level), the encoder was . tudied u ing the same parameter, but varying the input 
sequence. For all the te ·t · the encoder wa et to use I reference frame, no loop filter and 
to u e CA VLC in tead of CABAC for entropy encoding. To pu h the encoder , all MB 
ize. were analysed and sub-pi xel ME wa ~et to it 'be t' etting allowing the encoder to 
. elect the preci ion that produced the highe t qual ity MY . 
Figure 4-58 Reduction in relative OIC obser ved through encoding various CIF sequence with 
the multi-threaded x264 encoder u ing 4 active slice groups. 
Figure 4-58 illustrate the reducti on in relative DlC for seven te t sequence all encoded 
with four lice . When calcu lating the relative D!C the thread with the maximum number 
of in tructi on i u ed for the calculations. In previou codec where the encoder wa 
threaded at the block or MB-Ievel thi thread tended to be thread zero ince it i this 
thread that i re pan ible for initiating the encoder and performing the seri al .ection . For 
I-1 .264 however thi s i not nece ari ly the ca.e. Since thi encoder i threaded at the lice 
level the workload a. ociated with each slice i. not guaranteed to be equa l. Thi . can be 
evidently seen in Figure 4-58 with the equence container. If we examine the sequence 
itself the reason for this relatively poor performance due to threading can be seen. 
4. Hardware Techniques for Exploiting TLP 146 
Figure 4-59 Frame from Coastguard sequence illustra ting slice specific motion. 
Thi. test sequence shows a view of a r iver estuary with a container ship sail ing in the top 
third of the image. Since the movement within rhi sequence i located primarily in the 
upper third of {he frame thi will be enclosed within the first slice and hence encoded by 
one thread alone. T o allow for a more even distribution of the workload between lice 
the number of these slice group can be increa ed, however, a hown in Figure 4-41 
increasing the number of lices decreases the efficiency of the encoder. To over come 
thi s, increasing tbe re olution of the input equence act to min imise the affect of the. e 
additional slice group , Figure 4-42. Using thi s idea of increasing the reso lution the nex t 
et of te t were carried out u ing the HD I 080p te t sequence ru h hour. 
60 
-+-20 
~50 
• 22 
c 
:J 
0 
~<40 24 
.2 
l3 
2 
~ 30 • 26 
0 
E 
"' c ~ 20 
28 
"' > 
·.::: 
"' Gi 
a: 10 
0 
0 8 16 24 32 
Number ol Processors 
Figure 4-60 Reduction in relative DIC observed when encoding the HD1080p 'Rush hou r' 
sequence using the mutt-threaded x264 encoder with 32 active slice groups. 
4. Hardware Techniques [or Exploiting TLP 147 
Figure 4-60 shows how this HD1080p sequence·s relative DIC. when encoded with 32 
slices. It is clear that all avai lable processor contribute to the performance. During this 
test the input sequence was encoded with fi ve different quantisati on factors. These factors 
are used to specify the amount of quanti sation steps ava ilable for the encoder when 
performing quantisation on the transformation coefficients. From Figure 4-60 the usual 
relative DIC reduction curve is observed. It is clear that the quantisation selected for each 
encoding does not have an influence on the abi lity to distribute the workload evenly 
amongst the available processors. 
Using the results obtained f rom all three TLP encoders. it is clear that the potential gains 
achieved through by parallel ising these computational expensive loops in order to 
exploiting TLP are huge. 
4.6 . Conclusion 
In this chapter it has been shown that through the use of thread-level para llel techniques it 
is possible to produce a dramatjc reduction in dynamic instruction count of three complex 
video compression standards. The three v ideo standards chosen were one established, 
MPEG-2. one upcoming, MPEG-4, and one advanced standard, H .264. The chapter was 
split into three sections following the methodology and obtained results of all three 
compression schemes as each were manually threaded. The process of threading the 
codecs was achieved by statically panition the control flow graph of each encoder in 
order to distribute the workload of the highly complex functions within each encoder to 
avai lable processor context w ithin the simulated system. The approach taken for both 
MPEG-2 and MPEG-4 was to thread the encoders at the MB level of abstraction whereas 
for H.264 a slightly higher level of the slice was decided. Through threading each of these 
encoders a dramatic reducti on in DIC was observed. The result obtained from our custom 
MT -ISS show that reduction of DlC of 80-96% can be ach ieved with 32 processor 
context. These substantial. saving obtained demonstrate the potential of using thread level 
techniques in accelerating MPEG based video encoding. 
4.7. References 
Ill "Generic coding of moving pictures and associated audio (MPEG-2). " vol. 
138 18: fSO/JEC. 1995. 
4. llardll"are Techniques [or £rploiting TLP 
[2 1 "Information technology-- Coding of audio-visual objects -- Part 2: Visual," 
lSO/IEC 14496-2. 2004. 
148 
[31 "Information technology-- Coding of audio-visual objects - Part 10: Advanced 
Video Coding," JSO/IEC 14496- 10,2005. 
[4] "Motion Picture Expert Group," http://www.chi ariglione.org/mp0g. 
[5 1 "lnternational organisation of standardisation," hltp://www.iso.org. 
[61 R. Li, B. Zeng, and M. L. Liou. "A new three-step search algorithm for block 
motion estimation," Circuits and Systems for Video Technology. IEEE 
Transactions on, vol. 4. pp. 438 - 442 1994. 
17] "MPEG Software Simulation Group." www.mpcg.org/MPEG/M G/. 
[81 I. E. G. Richardson. ''H.264 White Papers," Video & Image Compression 
Resources and Research 2002. 
[91 T. Koga, K. Iinuma, A. Hirano. Y. Lijima , and T. lshiguro, "Motion Compensated 
interframe coding for video confercncing," in IEEE NTC'Bl. 198 1. pp. 531--534. 
f I 0] L.-M. Po and W.-c. Ma, "A novel four step search algorithm for fast block 
matching," Circuits and systemsj01· video technology, IEEE Tran sactions on vol. 
6. pp. 313-317. 1996. 
I J 11 J. Jain and A. Jai n. "Displacement Measurement and Its Application in lnterframe 
Image Coding," Communications, IEEE Transactions on, vol. 29, pp. 1799-
1808. 1981. 
I J 21 A. Puri, H.-M. Hang, and D. L. Schilling, "An efficient blockmatching algorithm 
for motion compensated coding." in Acoustics, Speech, and Signal, IEEE 
International Conference on. 1987. pp. 25.4.1-25.4.4. 
r J 31 Z. Shan and M. Kai-Kuang, "A new diamond search algorithm for fast block-
matching motion estimati on," Image Processing. IEEE Transactions on, vol. 9, 
pp. 287-290. 2000. 
I 141 R. Srinivasan and K. Rao, "Predic tive Coding Based on Efficient Motion 
Estimation," Communications, I EE£ Tran sactions on vol. 33, pp. 888 - 896 1985. 
[151 M. Ghanbari. "The cross-search algorithm for motion estimation" 
Communications, IEEE Transactions on vol. 38. pp. 950 - 953 J 990. 
[ 16] T. Zahariadis and D. Kalivas. "Fast algorithms for the estimation of block motion 
vectors," in Electronics, Circuits, and Systems ICECS '96. , Third IEEE 
lmemational Conference on 1996, pp. 716- 7 L9. 
4. Hardware Techniques {or Exploiting TLP 
[17] R. Li , B. Zeng, and M. L. Liou, "A new three-ste p search algorithm for block 
moti on estimation," Circuits a11.d Systems for Video Technology, IEEE 
Transactions on vol. 4, pp. 438-442 1994. 
[18] "3ivx 04 4.5 .1," 3ivx, Inc. www.3ivx.com. 2006. 
[19] J. Rota, "Divx ;) 3. ll a," 1998. 
[20] "Divx 6.3," Di vx, Inc. www.divx.com, 2006. 
[21] F. Bel lard, "FFMPEG," http://ffmpcg.mplayerhq .hu, 2006. 
[22] "HDX4," Jornigo Visual Technology GmbH, w•vw.hdx-J. .com, 2006. 
149 
[23] C. Lampert, M. M ili tzer, P. Ross, E. Gomez, and R. Czyz, "XviD MPEG4 Core," 
www.x vid.org, 2006. 
[24] "GNU General Public License," Free Software Foundation, Inc. , 1991. 
[25] "Advanced Video Cod ing." vol. 11496-10: ITU-T Rec. H.264 I ISOIIEC 2002. 
[26] R. Lee. "Subword permutation instructions for two-d imensional multimedi a 
processing in MicroSIMD architectures," in Application-Specific Sysrems, 
Architectures, and Processors, 2000. pp. 3 - 14 
r27] V. A. Chouliaras, J. L. Nunez-Yanez, and S. Agha, "S il icon Implementation of a 
Parametric Vector Datapath for Real-Time MPEG2 Encoding," in lASTED 
Conference on Signal and Image Processing. 2004, pp. 98-303. 
[28] K. Sankaralingam, R. Naga rajan, L. Haiming, K. Changkyu, H. Jae hyuk, D. 
Burger, S. W. Keckler, and C. Moore, "Exploiting ILP, TLP, and DLP with the 
polymorphous trips architecture," Micro, IEEE, vol. 23, pp. 46-5 1. 2003. 
[29] M. Gschwind and D. Maurer, "An extendible MIPS-I processor kernel in VHDL 
for hard ware/software eo-design," in European Design Automation Conference, 
Geneva 1996, pp. 548 - 553 
[30] L. Ai mar, L. Merritt, E. Petit. M. Chen, J. Clay, M. Rullgard, R. Czyz, C. Heine. 
A. Izvorski, and A. Wright. , "x264," http://developcrs.v ideo lan.org, 2006. 
CHAPTER 5: 
HARDWARE TECHNIQUES FOR EXPLOITING DLP 
5 .1. Data-Level Parallelism 
This chapter addresses data-level parallelism (DLP, section 3.3) in the context of the 
video coding workloads of this work. 
The DLP section is split up into four sub-sections. Section 5.1.1. gives a brief description 
of the process undertaken for vectorising the MPEG-4 encoder, XviD[ l ]. Here it will be 
shown how the programme is analysed, new vector instructions created and inserted into 
our custom ISS which is used to quantify the performance bene fits in te rms of re lati ve 
dynamic instructi on count (DIC) obtained through vectori sing the e ncode r. The second 
and third sections, 5.2 and 5.3, present two di fferent design methodologies, one based on 
tn dit ional RTL coding and the other usi ng novel ESL methodologies to design and 
implement the vector datapath whic h execute these new vector instructions. Sub-section 
5.4, evaluates the power/freque ncy and area/frequency performance cha racteristics of 
each of these data paths illustrating the strength and weaknesses of each des ign process. 
As described in chapter 3, exploitation of data- level paralleli sm (DLP) is of paramount 
importance in video coding as it a llows a large quantities of data to be processes in 
para llel, in a SIMD fashion, dramatica ll y reducing the instruction count compared to 
sca lar processors. To be able to exploit DLP, vector arc hitecture is required ; this typica lly 
cons ists of a vector datapath, vector registe r fil e, and a memory sub-system capable of 
handling the large vector data. This vector architecture can either be in the form of a 
vector eo-processor that sits in paralle l to the executi on datapath of the main scalar 
processor or is encapsulated within that datapath. 
5 .1.1 . Vectorising the XviD encoder 
The application chosen for DLP exploitation was the XviD encoder. Vectorisation work 
undertaken is based on previous studies by the ESD group at Loughborough University 
[2-7]. The steps required to vectorise the encoder are as foll ows. Firstly the vector ISA 
150 
5. Hardware Techn iques [or Exploi1ing DLP 151 
and state needs to be modelled wi thin the C reference code. This C model allows for the 
system to be modelled quickly at the functional level without the need to specify any 
underly ing technology or architecture. This level is designed to test the functionality of 
the proposed vector instructions, veri fying their accuracy through ensuring that the output 
of the vector encoder directl y corresponds to that of the original scalar encoder. 
Assuming thi s is process is successful the system is simulated on a custom vector ISS 
based on the SimpleScalar roolset. 
As stated the vector ISA and state needs to be modelled and validated before the 
vectorisation of the encoder can take place. This model includes additional registers 
(state) that are required for the vector unit to operate and the extra SJMD instructions. 
Vector Register File 
VLMAX • B·blt elements 
11 111111111~~~ ~kCTOR_REG~1) 
Scalar Register File 
32-blt § RF[O) RF[. I RF (INT_REGS-1] 
Vector Length Register 
32-blt 
I IVLEN 
Figure 5-l Programmers model used for modelling the vector environment illustra ting the 
addition vector and scalar registers required for vector operations. 
Figure 5-l illustrates the vector programmers model. This is implemented in the C 
programming language as follows: 
typedef struct 
{ 
11 Vector length register 
int VLEN; 
11 Vector register file 
uint8_t VRF[VECTOR_REGS) [VLMAX) ; 
vstateT; 
Figure 5-2 C model of the vstatcT structure that defmes the vector register file (VRF) and 
vector length rcgi ter that arc required for vector operations as sta ted in the programmers 
model. 
5. Hardware Techniques [or Exploiring DLP /52 
Figure 5-2 illustrates how the vector registers out of the programmers' model are 
implemented in C. Major architectural feaLures are the VLMAX parameter and the VRF and 
VLEN registers. The parameter VLMAX is a system wide constant that represent the 
maxi mum width of the vector registers. fn the proposed architecture, each vector element 
is specified to be 8-bits in length and therefore a vector register is VLMAX*S-bits wide. 
The vector register fi le, VRF represents the working register file on which the vector 
ALU operates on. A single 32-bit register is used to store the curTent vector length, VLEN. 
This register is used to specify the length. in 8bit blocks. of the vector register that 
comprises of va lid data and is operated on. This system wide register is of great 
importance to the vector instructions since it specifies which of the data items within the 
VRF is valid and hence should be operated on and written to. 
typedef struct 
{ 
11 Scalar integer registers 
signed int RF[INT_REGS); 
sstateT; 
Figure S-3 C model of the sstateT structure that defines the scalar register file (RF) that a re 
required for vector operations as tated in the programmers model. 
ln addition to the vector registers. the C model of the vector architecture requires a scalar 
register file, RF. which contains I NT_REGS 32-bit registers as depicted in Figure 5-3. 
Custom vector instructions are modelled by speci fying C macros that implement their 
exact functional ity. The first instructions to be defined are vector load/store instruction 
that allow for vector operands to be loaded from the main memory into the registers and 
stored from the VRF to memory. Vector ALU instructions are carried out after the vector 
load/store instructions have been spcci lied. 
The process of defining vector ALU instructions is an intuitive one. Through profiling the 
XviD encoder. spec ific functiona l blocks within the code have been identified as potential 
locations where DLP exploitation could be successful. In such locations, the inner-most 
loop of the aJgorithm is studied and a combination of possible SIMD operati ons 
considered. Through studying the dataflow in these functi ons a finite set of vector 
instructions is defined. With this finite set of vector instructions defined Lhe process of 
vectorising and testing is undertaken. 
5. Hardll'are Techniques for £rploiting DLP 
for ( j = 0; j < 8; j ++) 
{ 
for (i = 0; i < 8; i++) 
{ 
sad+= ABS(*(ptr_cur + i) - *(ptr_ref + i)); 
ptr_cur += stride; 
ptr_ref += stride; 
/53 
Figure S-4 Reference sum of ab olute difference (SAD) C code pre ent within Ute X viD 
encoder. 
Figure 5-4 il lustrates one of the most computationally expensive functions within the 
video encoder, the sum of absolute di fference computation. This is the original C code 
present within the X viD encoder that examines each pixel in both the current frame and 
reference frame MBs and produces a cumulative total of the absolute pixel di fference. 
Figure 5-5 through Figure 5-9, demonstrate a number of operations used to make up the 
vectorised implementation of that original C loop. Figure 5-4. 
uint8_t *froml=(uint8_ t *) (ptr_ cur); 
uint8_ t *from2= (uint8_ t *) (ptr_ ref); 
Figure S-5 Assignment of the inputs of the function, in this case Ute current and reference 
frame pointers, to vector pointers representing vector registers. 
The first process when vectorising a function is to identi fy the inputs and outputs of that 
function and how they can be represented as a vector. By associating these vector streams 
with 8-bit pointers, this allows for the data to be represented using the 8-bit vector space 
within the VRF. 
sclr(l); // Clear scalar register 
Figure S-6 Clearing of scalar register 1 for future use within the SAD algorithm through the 
use of cl rO in truction. 
The output from this SA D function is a scalar va lue and thus before any calculations can 
take place, the scalar register that will hold the result in our model has to be cleared. This 
is done by issuing the sclr ( ... ) instruction and passing to it the number of the scalar 
register to be cleared. 
5. Hardware Techniques for £ rploiting DLP 
ldv len (VLMAX); 
for ( i=O ; i < 8 I VLMAX ; i++ ) 
{ 
vldb(l , froml); 
vldb (2 , from2) ; 
vsad ( l , l,2 ); 11 perform sad and stor e r esul t in scal a r reg 
froml += VLMAX ; 
from2 += VLMAX ; 
!54 
Figure S-7 Vectorised inner loop of the SAD algorithm illustrating the u e of vector loads 
(vldb) and the vector instruction vsad. 
Figure 5-7 illustrates the main vector loop o f the function in Figure 5-4. The first task is 
to load the vlen register with VLMAX using the l dvl en (VLMAX ) instruction. This 
indicates that the whole of the vector leng th conta ins valid data. This is true whi le 
exec uting the loop since it only deals with whole vector lengths. Here. like TLP, the 
original loop range is decreased by di viding the max imum loop ite rati on by the number o f 
vector elements available , VLMAX. Doing this in combination with incrementing the 
vector pointe rs, froml and from2 . by VLMAX a llows for each iteration of the loop to 
access the next set of vector data. Withi n this vector loop three custom vector instructi ons 
are executed. The fi rst two are vector loads which load the data from locations froml and 
from2 and put them into vector registers l and 2 respecti vely. The main SAD 
functionality in the loop has been implemented within the vector instruction , vsad ( .. . ) . 
Thi s instruction takes in three parameters: the output sca lar register and the two input 
vector reg iste rs. 
if (8 % VLMAX 
{ 
} 
ldvlen (8 % VLMAX ) 
vl db ( 1, froml) ; 
vldb (2,from2); 
vsad ( l ,l, 2 ) ; / I perform sad and store resu l t in scalar reg 
Figure S-8 Remainder (stripmined) instruction ca lls being executed if the original maximum 
loop parameter 8, does not divid exactly into VLMAX. Note the setting of the vlen register 
to specify the valid elements withjn the vector. 
Since the original loop maximum iteration value. in this case 8, may not be exactly 
divis ible by VLMAX a remainder section is inserted to ensure that all the original data are 
5. 1/ardwnre Techniques [or Exploiting DLP /55 
processed, as shown in Figure S-8. This code is only executed if the modulus operation. 
8%VLMAX is non-zero. If this is the case, the vector instructions are executed again. This 
time however onl y a sub-set of the vector register block will be valid. To represent thi s 
partially filled vector register the vlen register is upd ated, by issuing the 
ldvlen (8%VLMAX) instruc tion. vlen now indicates which subset of scalar elements 
within the vector registe r are va lid and all further vector instructions access and update 
only these scalar e lements. 
sstw(l,&sad); 
Figure 5-9 Sca lar store instruction to return the computed SAD value to the C variable sad. 
The final operation in this specific function is to save the newly comple ted sca lar SAD 
output back to main memory. This is done by issuing the sstw ( . .. ) in struction which 
stores the contents of a scalar registe r to a given location (&sad) within the memory. 
#define vsad(rd, vrsl, vrs2) \ 
( { \ 
extern vstateT vstate;\ 
extern sstateT sstate;\ 
int index;\ 
for (index= 0; index< vstate.VLEN; index +=1)\ 
{\ 
} \ 
} ) ; 
sstate . RF[rd) += ABS(vstate . VRF[vrsl) [index] -
vstate.VRF[vrs2) [index]);\ 
Figure 5-10 Vector instruction for sum of absolute difference. Executing SAD operations on 
all valid vector elements as defined by vlen. 
Figure 5-10 shows the C macro implementati on of the vsad vector instruction. As 
illustrated in Figure 5-7 this instruction takes as its inputs the two vector registers and 
produces one scalar value. The instruction implements the SAD functionality for all valid 
scalar elements within the given data vectors. This is achieved by iterating through each 
scalar element within the valid range (vlen) and executing the appropriate code. 
The fully vectorised XviD encoder requires 41 vector instructions. Out of these. 10 were 
load/store based leaving 31 algorithmic based instructions. After the creation of these 41 
instructions and the vectori sati on o f each function completed. the system was validated 
5 . Hardware Techniques fo r Exploilin g DLP / 56 
by encoding severa l test sequences with both the original and vectori sed encoders and 
pe rforming a binary comparison of the resulting bitstream fi le. With the encode r 
successfull y verified the performance improve ment in the DIC metric was measured. 
uint32 t sad8_c(con st uint8_t * const cur, 
const uint8 t * const ref, con st uint32 t stride) 
uint32 t sad = 0 ; 
uint32_t i , j ; 
uint8_t const *ptr_cur = cur; 
uint8_t const *ptr_ref = ref; 
#ifdef ORIGINAL 
for ( j = 0; j < 8 ; j ++) { 
for (i = 0; i < 8 ; i++) 
} 
sad+= ABS(*(ptr_cur + i) - *(ptr_ref + i)); 
} 
ptr_cur += str ide ; 
ptr_ref += stride; 
#else 
u int8_t *froml=(ui nt8_t *) (ptr_cur); 
uint8_t *from2=(uint8_t *) (ptr_ref); 
sclr (1); 
ldvlen (VLMAX) ; 
for ( j = 0; j < 8 ; j ++) { 
froml=ptr_ cur ; 
from2=ptr_ref ; 
for ( i=O ; i < 8 I VLMAX ; i ++) { 
vldb(l , froml) ; 
vldb(2,from2); 
vsad(l,l,2) ; 
froml += VLMAX; 
from2 += VLMAX; 
} 
11 Remained part 
if ( 8 % VLMAX ) { 
} 
ldvlen(8 % VLMAX) 
vldb (1 , froml) ; 
vldb(2 , from2) ; 
vsad(l , l,2); 
ptr_cur += stride; 
ptr_ref += stride; 
} 
sstw(l,&sad); 
#end if 
r eturn sad ; 
} 
Figure 5-11 8x8 SAD calculation function seen with both the original C and the vectorised 
implementation 
5. Hardware Techniques [or Exploiting DLP 157 
This process starts by insertin g each of the 4 1 vector in tructi on into the custom vector 
JSS (sim-vector). With each instruction successfu lly inserted into the simulator, the X viD 
encoder was modified to call the e embedded in truction in tead of u ing the C macro 
implementation. T hi i achieved by u ing macro switche to elect ei ther the C model 
code or the appropriate a embly command to execute the de ired instructi on wi thin the 
imulator. a hown in Figure 5- 11 . 
Figure 5- 12 depicts the reduction in the relati ve DIC metric of the vectorised MPEG-4 
encoder at vary ing vector length and quality ettings. Thi wa achieved by encoding 25 
frame of the Tennis test sequence and recording the DIC produced for each VLMAX, 
from 8 to 248byte , and for quality etting 2, 4 and 6. It can be een that for each quality 
sening chosen, a reduction of 70% in DIC i obtained for a vector length of 32bytes[7] . 
This result trongly suggests to developing the vector architecture for MPEG-4 encoders 
and future indicate. that a vector length of 32-bytes is sufficient to capture most of the 
DLP in that application. 
0.65 
~ 0.6 
~ 
~ 0.55 
8 
" ~ 0.5 . 
2 
] 0.45 
0 
·e 
~ 0.4 
>o 
0 
., 
-~ 0.35 
~ 
4i 
a: 
0.3 
0.25 
-+-Ouahly2 
--ouahly4 
Oual11y 6 
8 24 40 56 72 88 104 120 136 152 168 184 200 216 232 248 
Vector Length (Bytes) 
Figure 5-12 Reduction in relative DIC observed through vectorising the XviD MPEG-4 
encoder with a vector length (VLMAX) ranging between 8 and 248. 
5.1 .2. Hardware implementations 
Th i ect ion addre e the development of a vector datapath u ing e tabl i hed 
methodologies (RTL) and ESL, having as the starting point the C-macro defin ition of a 
5. Hardware Techniques (or £rploiting DLP !58 
number of vector instructions. Thi s RTL datapath is designed to exec ute the 31 vector 
instructions identified through the course of vectorising the MPEG-4 e ncoder and which 
achieved the DIC reduction seen in Figure 5-12. Developme nt o f the vector datapath 
us ing two methods leads to interesting conc lus ions on speed/power/throughput as well as 
the maturity o f ex isting ESL tools. 
ln order to come up with a valid comparison of each methodology's abi lity to imple ment 
the required datapath. both RTL and ES L designs have to have the same constraints for 
inputs, outputs and number o f stages. The vector width of the datapaths is set by the 
compile-time constant VLMAX which re presents how many bytes the vector length will 
be. This physically hard-wires the s ize o f the vector registers to VLMAX number of bytes. 
Each datapath was designed w ith a theore tical upper limit fo r VLMAX of infinity. 
Obviously physical constraints and perfo rmance bene fits li mit thi s. The lower limit o f 
VLMAX is dependent on the vecto r in structions being imple mented. In thi s speci fie vector 
instruction set the smallest all owed value of VLMAX is 8 due to micro-arc hitectura l 
constrai nts in cross vector operators (splat. extract etc). The data inputs to the datapaths 
consist o f three vecror operands (op2. op3 and op4). one 32bit scalar (rs). a 5bit vector 
for instr uction selection (select_op) and a s ignal to specify the vector length ( log2 
VLMAX bits (vlen)). The outputs of the datapath consist o f one vector result (opl). one 
scalar result (rd) and a sing le bit to indicate which of these two outputs are va lid 
(vector_ scalar). The latency of the datapath was decided to be two c loc ks which 
allows for the vector computa ti on to be performed in stage 1 while the add-reduce is 
pe rformed in stage 2. 
5.2. YHDL (RTL) Imple me nta ti on 
[n the RTL implementa tion of the datapath three function units were created. The main 
vector computation is carried out in a replicated logic block e ntity, velement. The pack 
and merge instructions. which can ' t fit into the repetitive structure of velement, as they 
access a ll e le ments of a ll vector sources are carried out in e ntity pack_merge. Both 
these units are instantiated within stage one o f the datapath. The sing le vector result from 
either o f these units, along with the pipe lined sca lar input. vlen and op code are passed 
5. Hardware Techniques for Exploiting DLP !59 
into stage two where the third functional unit, add_ tree, resides. This two stage design 
as well as the three functiona l logic blocks are shown in Figure 5- 13. 
o o 9 --------s -------------·---------------------:: ---s 
~ I ! ! ! l 
- 0 0 -
6 
~· 
i i 0 I ::;. t 11 I j I 
velement pack_merge 
0 
i 
i ···~~ ... · 0! tJ •M~ ... ""~ 
6 
~ 0 ~ i l ~ ~ - I' i 
- ---------------------------------~----------- ----------------- ----~ 0 ~ 
. ~ ! ~ t 
1 1 
~~ 
"'sr-" 
6 
! 
:6 ;. 
0 i i 
.. g 
I 
I 0 
J___l, MIKI ... oe» ... f'1(4 0) 
~ I add_tree 
! j i l : 1 
0 T 
i. i ! -------------------------~ ------~ _______ j --------------------------
Figure 5-13 Schematic view of the RTL vector datapath illustrating the three main function 
logic blocks: velement, pack_merge and add_tree; as well as the 2 stage design. 
5. Hardware Techniques {or Exploiting DLP 160 
5 .2.1. Vector e lement unit 
The vector element unit (ve lement) is a block of logic that takes the vector source 
operands and pe rforms a vector operation on the valid data e lements within the vectors. 
Due to the replicating nature of a vector instruction the velement unit is instantiated for 
each 32-bits of the input vectors. Examining the vector macro instruction produced 
previously, it is found that the minimum bit width of velement is 32-bits. Each 32bit 
velement is there fore instantiated with a VHDL gene rate loop, with each e lement 
connected to the corresponding data range of the vector inputs and outputs. 
mull shift er add_sub mise 
0 0 0 
i § i ~ ~ 
" ! 5 
Figure 5-14 Illustration of the four replicated vector element units. 
Within the velement, eighteen vector instructions are implemented. These instructions 
are subdi vided into four categori ses; add_sub, mul t, shi ft and mise. These, as 
there names s uggest, perform addition and subtraction, multiplication, and shifting of the 
data. The rema ining instructions that are implemented within velement whi ch don ' t fit 
these categorise are implemented in side a mi scell aneous block (mise entity). Be low, each 
functional block is described in detai ls with the combi nation of all four blocks to form 
velement described after. 
5. Hardware Techniques (or Exploiting DLP 161 
5.2.1.1. Addi ti on and subtraction unit 
Add_sub 
VADD16 2x 16bit addition 
VADD32 1 x 32bit addi tion 
addsub_input1 ~ ~ addsub_output 
VSUB16 2x 16bit subtract ion 
. 12~ ~add_cout addsub_tnpu 
add_sel~ add_sub VSUB32 I x 32bit subtraction 
VNEGATE 2x 16bit negation 
VABS16 2x 16bit negation 
4x 8bit absolute 
VAD difference 
Figure 5-15 Seven vector instructions and their accompa nying description present within the 
add_sub entity. 
The add_sub unit executes seven instructions a ll requ iring the use of adders. The unit 
has in total four 8-bit adders arranged in a manner that allows for 4x 8-bits, 2x 16-bit or 
1x 32-bit addition operation to be carried out. This is impleme nted by daisy chaining the 
adders together and by selecti vely linking one adders carry outs to the carry ins of the 
next. Binary subtraction uses this same set of 4x 8-bit adders a nd differs in the carry in to 
the adders and the in version of the input s ignal addsub_input2 is performed. To 
achieve 16bit negation , as required for VNEGATE and VABS16, the number system used 
to describe negative numbers need to be consistent. Thi s datapath, like most modem 
computers, use two's complement arithmetic for s igned numbers. To produce the 
negative of a two's compleme nt value, the ope rand is firstl y inverted and fo llowed by the 
add ition of l. This is achieved within the add_sub unit by setting the 4 adders into 2x 
16-bit mode, inverting the first input addsub_inputl and setting addsub_input2 
to OxOOO 10001 (two 16bit 1 ). For the VAD instructi on, which is patt of the ove rall VSAD 
operation, is performed wi thin the add_sub unit at 8-bit granularity. The 8-bit operands 
are subtracted from each other and the difference is passed to the mise unit where the 
absolute value is calc ulated. The sign of these differe nces is produced as the 4bit signal 
add_cout. 
5. Hardware Techniq11es for Exploiring DLP 162 
Figure 5-16 Add_sub unit. 
5.2. 1.2. Multiplication unit 
mttlt 
2x 32bit multiplication , 
mu1t_10put1 ~ ~)>- mull_ output VMULT18 
mult_input2 ~ lower 32bit output 
sel~ mult lower 16bits input, 2x 16bit 
VMULTE16 
mult iplication 
upper 16bits inputs, 2x 16bit 
VMULT016 multiplication 
F1gure 5-17 Three vector mstructlons and the1r accompanymg descnphon present w1thm the 
mult entity. 
The multiplication unit ca rried out three instructions. The major datapath component in 
this unit is a single 32-bit multiplier. For the instruction VMULT18, the two 32bit 
operands, mul t_inputl and mul t_input2, are connected direc tly to the multiplier 
inputs and the full 32bit product is produced on the output mult_output. Overflows 
occurring from this multiplication instructi on are not recorded. In contrast, the other two 
multiplications operations use only 16-bi ts from each input operand and produce a full 
32bit output maintaining full representation of the product. Since the latter instructions 
5. Hardware Techniques for Exploitin g DLP 163 
operate the full 32-bit prec ision, even though only 16 of these bits are used in the 
multiplication, the sign of each 16bit input needs to be recorded. When operating on the 
lower 16-bits from each input operand, wi th instruction VMULTE16, the sign bit is used 
and these 16-bit signals are zero extended to 32bits before being driven to the multiplier. 
On the other hand, the upper bits processed during VMULT016 contain the sign bit and 
this has to be taking into account when sign extending to 32-bits. Figure 5-1 8 depicts the 
multiplier and the sign extension logic for the lauer two operati ons. 
Figure 5-18 Mutt unit. 
5.2. 1.3. Shif ter units 
shifter 
shifter_lnput~ ~ shifter_output VSHR16 - L 
2x 16bit shi ft right 
. s _,.. 
shifter_amount......- VSHR32 _ L lx 32bit shift right set~ shift er 
VSHL32 _L l x 32bit shift left 
-Figure ~-19 Three vector mstructtons and their accompanymg descnptJon present within the 
shifter entity. 
The shifter entity is responsible for executing three shift operations that make use of 
bitwise shi ft. The unit has a 32-bit value input. shifter_input, a input which specifies 
the amount of shif ting to perform and a di rection input. To perform these operations, two 
5. Hardware Teclw iaues (or Exploitin g DLP 164 
16-bit barre l shifters are used. Each such barre l s hifter a llows for a 16-bit numbe r to be 
shifted by a programmable numbe r o f bits in either direction. By contro lling both the 
carry-in and the ca n·y-out s igna ls from each ban·el shi fter it was possible to implement 
both 2x 16-bit and 1x 32-bit shi fte r s hift operati ons in either direction. T o preserve the 
sign of shif t e r_input when executing the VSHR32_L instruction, the catTy-i n of the 
barre l shif te r representing the upper 16-bits, shif t 32cin, is set to either OxOOOO or 
OxFFFF for a positive or negative value of shif ter_i nput. The least significant bits 
ban·el shi fter, which in this configuration is daisy chained together with the most 
significant b its barre l shifter, has its carry in, shift16cin, connected to the carry-out of 
the LSB barrel shifter, shift32cout. When executing shift left , instructi on VSHL32_L, 
the shift-in bi tstream of the LSB barrel shifter is set to OxOOOO and the shift-out bitstream 
of the MSB barrel shifter is discarded. When execut ing the final instructi on, VSHR16_ L, 
the two barre l shi fter work in an indepe ndent manner w ith no connections I ink the shifters 
together. 
~ 0 0 I -~ ; J ~ s ~· IIEfJ 
'\' · ~ ~ 
... .,....,,.., vshifter16u ,....,, ..... ,,., __l 
\.01 1(1. MtUO, ,.-'-~-....__, 
I ... , ........ 
l i j . i I i i i· •' ~ t 1 i' i ~ 
vshifter1 6u ~ 
il 
Figure 5-20 Shifter unit. 
5. Hardware Techniques {or Exploiting DLP 165 
5.2. 1.4. Miscellaneous unit 
mise 
Extract upper 16bits and output to 
VEXTRACTE16 
lower 16bi ts 
Extract upper 16bi ts and output to 
VEXTRACT016 
upper 16bi ts 
A t 16bit granularity output input! 
VCMP_H_Gl6 when input3 positive else output 
input2 
misc_input1 ~ [-4 misc_output At 16bit granulari ty output input! 
.. t2~ mosc_onpu 
misc_input3 ~ 
mise VCMP_H_L16 when input3 negative else output misc_sel~ 
input 2 
VCLIP16 Clip each 16bit to -256<x<255 
VSPLAT_W_RD Copy 32bit input to output 
Copy lower 16bitto upper and lower 
VSPLAT_H_RD 
16bi t of output 
VSPLAT_ B_RD 
Copy lower 8bit to each 8bit element 
of output 
VABS16 2x l6bit absolute value calculations 
VSAD 4x 8bit absolute value calculations 
F1gure 5-21 Ten vector mstruchons and the1r accompanymg descnphon present w1thm the 
shifter entity. 
T he mise unit executes the remaining 10 in structions using fi ve sma ller datapalh 
compondents: extract, cl ip, compare, SAD and splat. 
5. Hardware Techniques [or Exploiting DLP 166 
Figure 5-22 Mise unit Figure 5-23 Extract Functions 
The VEXTRACTE16 and VEXTRACT016 instructions a re responsible for copying the upper 
16-bits from the source operand, misc_ inputl, a nd place them in a specific bit positi on 
in the destination. ln VEXTRACTE16 these 16 input bits are extracting to the 16 LSB 
positi ons o f the output operand whereas in VEXTRACT016 these 16-bits are extracted to 
the MSB of the o utput. To preserve the data a lready residing within the destination 
register, this registe rs data is passed as an input into the mise bloc k as signal 
mise_input2. Using this data, the mise unit chooses the appropriate 16bit tha t is 
overwritte n by the extractio n and concatenate the extracted 16bits to it and fin a lly 
replacing the whole 32bit vector e lement with thi s amalgamation. 
Figure 5-24 Clip Function. 
The VCLIP16 instruction is used to constrain the input signal , misc_ inputl, to within 
the range -256 to +255. Clipping is done at 16-bit granularity so the clipping hardware in 
duplicated and the two results concatenated to produce the full 32-bit velemen t output. 
Using two's complement notation this corresponds to a range be tween 
Bx 11 I 111 1100000000 to BxOOOOOOOO 111111 I 1. By evaluating just the uppe r 8-bits of 
mise_inputl it is poss ible to de te rmine if the signal fits within this range. If thi s is the 
5. Hardware Techniques [or Exploiling DLP 167 
case then that signal is kept untouched otherwise one of the above clipping values 1s 
passed to the output, depending on the sign of the input signal (upper most bit). 
Figure 5-25 Compare Functions. 
The VCMP_ H_Gl6, VCMP_ H_ L16 and VABS1 6 instructions have been grouped together 
due to the similarity in their implementation. All three of these instructi ons work at L6-bit 
granularity and use a combination of multiplexers to select one or the other of their 
inputs. The absolute function takes two inpul signals, misc_inputl and misc_input2 
where misc_inpu t 2 is the negated version of mi s c_inpu t l produced by the add_sub 
unit. T he choice of which of these two inputs are driven to the output is based on the sign, 
MSB, of mi s c_i n pu t l. The compare functions on the other hand use a thi rd signal to 
control which signals are dri ven to the output. The decision as to which of the inputs are 
outputted within the compare functions is based on whether the third input is greater or 
less than zero. By checking the value of M SB in this third signal its sign is obtained and 
hence whether it is greater or less than zero. Since the compare functions are closely 
related the M SB of the third signal is inverted depending on which compare functi on is 
being executed. 
5. Hardware Techniques (or Exploitin g DLP 168 
Figure 5-26 Sum of Absolute Differences Function. 
VSAD, unlike the other instruction s within the mi se unit, o perates at 8-bit granula rity. 
Thi s instruction takes both its inputs fro m the outputs of the add_ sub unit. This is due to 
the sum part of thi s instruc tion havi ng a lready been ca lculated in the later, leav ing the 
rema ining (a bsolute diffe rence) to be calc ula ted in the mise unit. ln the same manner as 
VABS16, a negation is performed on the input signal, thi s time though the negation is at 8-
bit granularity instead of 16-bits. Using both the orig ina l summed input. mise_ inputl. 
and the newly calculated negated va lue, vnega t e8d, a s ign check is performed and the 
posi ti ve value dri ven to the output. The s ign of each is de te rmined by the four carry-out 
b its produced from the add_sub unit. Each of these bits represents the sign of the 8-bit 
input signals. 
0 0 0 0 0 0 0 
"' 
.;; r; i t:. ..... ..... ~ :; ~ ~ s s 0. 
"' !. c c c .E c c ;I :::1 ::I 151 ::I ::I ., ., ~ ., ~ 3 ~ ~ ~ B 5 
.. 
& 
Figure 5-27 Splat Functions. 
5. Hardware Techniques (or Exploiting DLP 169 
The VSPLAT instructions, unlike the other instructions within the mise unit take a single 
32bit sca lar input, rs, along with a full vector operand. That 32-bit same va lues can be 
splat ( replicated) across the entire vector length . T here are three variations of the splat 
instructions, each replicating specific bits from the scalar input to the output vector. These 
three splat instructions operate at word (vsplat_w_ rd), half word (vsplat_h_rd) and 
byte (vsplat_ b_ rd) level granularity. 
From the previous discussion and as can be seen in Figure 5-14 each block within 
velement is responsible for a number of instructions with ce1tain instructi ons straddl ing 
two functi onal blocks. Each of the add_sub, mult, shifter and mise blocks contain its own 
decode logic which, if the op-code in velement relates to its specific functionality, 
decodes the 7-bit op code into speci fic control signals. The output f rom the selected 
datapath is then driven to the output of the velement block after going through a fi nal 
stage of multiplex ing. 
5. Hardware Techniques for Exploiting DLP 170 
5.2.2. Pack/merge unit 
pack_merge 
Pack lower 8bi t of 16bit block 
vpack16to8 fro m input to o utput 
Pack lower l6bit of 32bit 
vpack32to16 block from input to o utput 
Unpack 8bi t input to lower 
vunpacku8to16 16bit in output, zero exte nd 
op2~ VlM<'l("B op1 
op3~ Unpack 16bit input to lower 
select_op~ pack_merge vunpackl6to32 32bit in output, sig n extend vlen~ 
Merge lower half o f inputl 
hi16 vmerge_ and 2 at 16bit granula ri ty 
Merge lower hal f of input I 
vmerge_hi32 and 2 at 32bit granula ri ty 
Merge upper half of input I 
vmerge_ lol6 and 2 at 16bit granula rity 
Merge upper half of input! 
vmerge_ lo32 and 2 at 32bit granularity 
Unli ke velement whic h contains instructions performing arithmetic operati ons, the 
pack/merge unit consists entirely of instructions involved in data organi sati on w ithin the 
vector registers . To accommodate this intra vector manipu lat ion , the pack_merge unit 
processes the vector registers as a w hole. This is oppos ite to velernen t which operates 
on a sing le 32-bit quantity of the vector register and is he nce instantiated multiple times. 
5. Hardware Techniques [or Exploiting DLP 
for (i = 0; i < 8; i++) 
{ 
uint8 t c 
uint8 t r 
cur[j * stride + i ); 
ref[j *stride+ i); 
cur[j * stride + i) = r; 
dct[j * 8 + i] (intl6_t) c - (intl6_t) r; 
Figure 5-28 The original unvectorised transfer_8to16sub_c() function present within the 
XviD encoder. 
171 
Figure 5-28 depicts the transfer_8tol6sub_c() function which is used within the Xvid 
encoder to transfer data to and from the 8-bit current and reference frame buffers. 
expands these values to 16-bit, subtracts them and finally stores the m in the 16-bit DCT 
coefficients buffer. This functi on has potentia l for vectorisation due to its single repetitive 
data operation (subtraction). 
uint8_t *froml=(uint8_ t*) (cur+j*stride); 
uint8 t *from2=(uint8_ t*) (ref+j*stride); 
uint8_t *tol= (uint8_t *) (cur+j *stride) ; 
uint8_ t *to2 = (uint8_ t*) (dct+j*8); 
ldvlen (VLMAX); 
for(i=O;i<l6 / VLMAX;i++) 
{ 
ldvlen (VLMAX/2); 
vldb ( 1, froml) ; 
vldb (2,from2); 
vstb(2, tol); 
ldvlen (VLMAX); 
11 load cur into vector register 1 
11 load ref into vector register 2 
11 store vector register 2 (ref) to cur 
vunpacku8to16(4,1); // 
vunpacku8to16(5,2); / / 
( intl6_t) c 
(intl6_t) r 
vsubl6(3,4,5); 
vs tb ( 3, to2) ; 
froml+=VLMAX/2 ; 
from2+=VLMAX/2; 
tol+=VLMAX/2; 
to2+=VLMAX ; 
/ 1 dct = c - r 
/ 1 store vector register 3 into dct 
I / 
11 realign pointers for next iterati on 
I/ 
I / 
Figure 5-29 Vectorised transfer_8to16sub_c() function illus trating vector manipulation of 
the vunpacku8to16() vector instruction. 
5. Hardware Techniques [or Exploiting DLP 172 
As described in section 5.2.1. 1 and illustrated in Figure 5-29, the velement unit 
implements a vector subtraction instruc tion (vsubl6) whic h will s ubtract two vectors 
containing 16-bit va lues and produce a thi rd vector containing the results, also 16-bits. 
However, in order for this instruction to be used the input vector registers must contain 
16-bit e lements. One of the instructions impleme nted within the pack/merge unit is 
vunpacku8tol6 which can be used in this example to ex pand the 8-bit va lues in the lower 
half of the input vector register and produce a compl ete vector register contains the 
expanded 16-bit e lements. 
Due to this requireme nt to manipulate data present with in the vector registers in orde r for 
them to be in the correct location and granularity for the following arithmetic operation, 
the pack/merge unit implements 8 instructions. These 8 instructions fall into two distinct 
categories: pack/unpack and merge functions. 
5 .2.2.1. Pack and Unpack functions 
The pack/unpack operati ons comprise of four instructions; two for pack and two for 
unpack. 
0 
I 
0 
:.i 
i 
0 
vpack16to8u mask_process vpack32to16u 
Figure 5-30 Pack Functions. 
5. Hardware Techniques [or Exploiting DLP 173 
Both pack instructions extract data from a source vecto r register and packs them into the 
lower portion of the destination vector register. Since that data comes f rom the e ntire 
length of the vec tor into the least significant ha If of the vector, these instructions cannot 
be processed in the replicated vector unit. Both pack instructi ons, vpackl6to8u and 
vpack32 to16u, prod uce an intermediate vecto r comprising of data packed from the 
entire so urce register. Since the valid vector data he ld within the source registe r is not 
necessarily o f length VLMAX, data within the source register in the range [v l en, VLMAXl 
are discarded . Since the valid packed data produced by both these instructions are of 
length vlen/2 a mask signal of the same length is applied to the intermediate vecto r 
allowing for the valid data, [0 , vlen/2], to be preserved w hi le eliminating the out-of-
range data beyond this limi t. 
vunpacku8to1 6u vunpacku 16to32u 
Figure 5-31 Unpack Functions. 
The unpack instructions operate in an opposite manner to the pack functio ns. In this case 
the least significant half of the source registe r is extracted and inserted into the desired 
locations in the destination register. Unli ke the pack instructions which need to mask 
elements lO, vlen / 2 ]. the unpack instructions need to be masked up to vlen. T hi s, like 
the others instructions within this datapath block, is carTied o ut immedia tely before the 
output register of stage one. 
5. Hardware Techniques [or Exploiring DLP 174 
5.2.2.2. Merge functions 
Figure 5-32 Merge Functions. 
There are four merge instruction each respons ible for combining data from two input 
vectors and outputting the corresponding data to one vector. The choice o f which data 
range to merge (hi or low) and the g ranularity to use when me rging the data are two 
independe nt actions and hence arranging them to execute in series. The first of these two 
sequential bl ocks is responsible for selecting which data is to be used from the inp ut 
vector by the second block. The two possible locations for selecting thi s data from are the 
lower half of the vector, [0, VLMAX*4], or the upper half, [VLMAX*4, VLMAX*8] . T his 
simple di vision of the input vector does not take into account the situation whe n 
vlen<VLMAX. lf this occurs the midpoint of the vector is not VLMAX*4 but instead 
vlen*4. T he lower ha lf of the valid vector can sti ll be represented within the range 
[0, VLMAX*4 ], however the upper half o f the valid vector does no longer fall within the 
range [VLMAX*4, VLMAX*8]. The range of the upper valid bits of the vector are vlen*4 
5. Hardware Techniques [or ExploiTing DLP 175 
to v l en*B whic h in rea li ty are padded to bits vlen*4 to v l en*4+VLMAX*4. To 
achieve this dyna mic selection of the upper bits a right shift is implemented by an amount 
vlen*4 and from this the lower VLMAX*4-bits are used as an input to the second stage. 
After selecting which data range is required as inputs to the second sequential block, the 
two hal f-vector inputs are driven to vmergeu. Vmergu is a me rgi ng uni t Lhat can work 
at both 16 and 32bit gran ularities. The resulting merged output from vmergeu is a full 
vecto r registe r length with [0, vlen *8] bits being valid. 
With all the vector e lements generated and the pack_merge un it imple mented, the 
bigger pic ture of the first stage of the vector data path can be revis ited, Figure 5-33. 
- --------
--------- ------------------------------------
----;:; ;:; 0 ;:; 
~ ~ i ~ ~ I ~ :! .: a ~ :1! f. J ~. "' i i 
;:; ;:; ;:; 0 ;:; 
~ !. 
.. ~ ~ 
~ ~ ~ ~ r ~ 0 ~ i- r r "' r lt lt i I 8 
- --;;;r--- -s- ~,..... ~,..... ~ ~ 
~ i. i i. ~ ;:; 0 ;:; t . i ~ '? ,., ;:; lt t ~ ;.' •' 11 f f I e lt ~ i 
" 
~ 
velement pack_ merge 
0 0 
i ;;; i ~ 
~ ~ i ..... _1(\'UW("fol 0 ...... _I('JUMX"fol 0) li 
M'-Q op -c• a• 
ii 
i vten r(1oo2vlma:. 0) 
i ......vt ........ o. 
-[ J 
;:; 
i ;:; ; E 
:z .: 
0 f g 
- i ___________________________ { ~ 
---------------------------------
----
Figure 5-33 Schematic view of stage one of the RTL based vector datapath illustrating the 
replicated velement process and the single pack_merge process. 
5. Hard111are Techniques {or Exploiting DLP 176 
The registered inputs to the datapath are selectively dri ven to either the velement unit o r 
the pack_merge unit. Dri ving only the unit that the specific instruction will be executed 
in conserves power due to reduced switching acti vity from the complex logic blocks. The 
appropriate o utput from e ithe r velement or pack_merge is selected and passed onto a 
mask ing process before it appears at the actual output of stage one. The outputs from 
these units a re full vectors, VLMAX*8 bits w ide. Although a fu ll complime nt o f bits is 
produced fro m each unit, the validity of each bit needs to be examined, due to the fact 
that each ins tructi on is only valid for a specific vector length. 
Figure 5-34 Stage One Masking Process. 
Since VLMAX is a compile-time constant that cannot be altered during run-time vlen 
indicates the valid scalar e lements for each vector o peration. at 8-bit granularity. We have 
already seen in some special cases within the pack_merge unit, the use of v l en to 
eliminate data from outs ide this range. For the majority of instructions, (those not already 
covered in pack_merge) the elimination of invalid data is carried out just before data 
ex its stage one. A masking process, as shown in Figure 5-34, is implemented to mask out 
data in the range [vlen*8 , VLMAX*B]. Finally the other outputs from stage one are the 
registe red scalar input along with vlen and the op code ready for use in stage two. 
5. Hardware Techniques [or Ex ploiting DLP 177 
5.2.3. Stage two 
The second stage in the datapath is responsible for processing the vector outputs from 
stage one. This functionality is necessary for three instructions: vaddreduce16, vacc 
and vsad. 
------ - ------- ------ ---
0 0: 0 0 ~ 11\ V M E ~ ~ ~ > a! ~ 0 ...J 01 
2. ~ ~ 
;I c:l "' 
.. 
1 1 ~ a. > 0 ,.---
" 0 
,;, 0 X ~ .. E 
::. Ci > ...J N > ~ E tf ~ .. c:l Q. e "' 0 >
9 
add tree ,;, 0 ~ - V 
::. ~ 
...J 0 0 2. u l ~ 
1 
~ 1 I 1 1 ~ ~ 1:! J ~ 
" 
r 
0 
i ';i 
...J 0 u 
2:. 1i 'l! <::! ~I 
I <::! ., g 1:!1 ~ 
... 
---- ------- -------- ----
Figure 5-35 Schematic view of stage 2 of the RTL vector datapath illustrating the add_tree 
logic block. 
When any of these three instructions reaches stage two the input vector along with the 
pipelined sca lar value, rs_ r2, is fed into the adder tree. The output from the adder tree is 
driven to the scalar output of stage two whereas the vector output is dri ven to zero. 
Alternatively when ei the r of the other in structions appears in stage two the adder tree is 
not acti vated and the vector data is passed through the stage unaffected {pass through 
operation). In addit ion to these outputs there is a further a single bit output, 
scalar_ vector, representing whic h of the stage 2 o utputs are valid. 
5. Hardware Techniques [or Exploiting DLP 178 
.._O--------VLMAX/(2row ·2~ 
Row 
~ 
s 
.. 
Figure 5-36 Adder tree logic block composing of a number of adder components arranged 
into a tree formation with the number of adders per row decreasing as they travel down the 
tree. 
The adder_tree unit is divided into two sections. F irstly the adder tree itself which 
adds up the sca lar elements of the input vector and secondly an adder to add this newly 
created value to the scalar input from stage one. The adder tree component comprises of a 
two dimensional array of signals and adder units implemented using RTL V HDL 
generate statements. The initial row within the u·ee is fully occupied with 8-bit adder units 
which take as their inputs the corresponding 8-bit element from the input vector. As the 
sums are produced they are sent down through the tree while the number of adders is 
reduced by one each level while one bit is added to the sum. This growth in bits of the 
adder is to eliminate overflows. The final result of the adder tree is added to rs_r2 . This 
is not done inside the adder tree component itself since the number of input operands to 
the tree needs to be a power of two. The output from the adder tree is zero extended to 
32bits upon when the addition of the scalar input takes place. The resulting 32bit value is 
outputted from the adder_tree unit and passed back into the second stage where it is 
dri ven to the output of the datapath. 
5. Hardware Techniques [or Exploiting DLP 
5.3. SystemC Impleme ntation 
~· .. tot 
I!Pl'...,_~·&.IO. 
., otJl .._,(VI.MAJt'l-1 01 
........ _ IIIQ._~'I- I V. 
111nS_proc1U 
~~ 
f-'"""" .. L '··:I<:)IIc:t ..___ o-i "-' 
""'-.. J..!.!!!s.!ll.""-"'""-"-·..J ~"'-' 1-'_""""_..._.,, -=:.!!'"'---+l...._, 
.. ~aa_..,~"!a.I(• OI ~09-' 
.....,_..._~~-.. 0! """'-' 
.. l.t,4J .. 1 .•4_1'1tV\.MM' .. I ot -~-~--~oc:2() 
n, "'J f-"""""·""-"::P'-"IO!c..._ _ o-i ,..ft ep1J2t\'l.MA.XI 
~._..__ •. ~~--~r11• 01 
vltf\.....U ~ ..... ~~·ot """-1'2 
091 .,,_.._<M~VlMAJt'&-t O) ._,_aM,MAXJ 
.. _ .. 
179 
..._.._,..l!...!!L~ 
.. I .. IIIQ.N t.l raCVllo&AA' .. I Ot._ 
r~~_-._w &)tdi':II GII 
.....,_~•1vttft(llaalv"nn• OI . ... 
Figure 5-37 SystemC vector datapath implementation illustra ting the three asynchronous 
and one synchronous process. 
ESL models and methodologies for hi gh-level silicon design have been proposed in the 
past few years[8-10]. To provide for a true comparison between the two design flows, the 
traditional VHDL approach and the SystemC-based approach, the vector datapath 
discussed in secti on 5.2 was re-implemented in the latter language. The SystemC-based 
vector datapath mai ntains the same ' top e le mental' 1/F as the VHDL-based one, while it 
also followed a 2-stage organi sation. In addition to this the SystemC code had to be 
synthesisable. Existing compilers. such as GNU compi ler g++[ll ], implement a large 
subsection of the SystemC LRM[12] compared to that provided by leading ESL tools 
such as Celoxica Agility[1 3]. Syste mC can be seen as a set of c lasses addit io nal to the 
C++ compi lers. They de fine all the constructs and data types needed to describe hardware 
such as signals, conc urrency and time. Unfortunately the SystemC spec ification is still 
relatively immature thus lead ing to inconsis tencies in the SystemC subset handled by the 
software and hardware compilers. The sub-set of SystemC that can be used to trans late to 
hardware is known as the synthesisable SystemC. T hi s subset is not rigid and is very 
compi ler dependant. As the compiler techno logy advances, more sections of the SystemC 
LRM will be able to be translated into hardware. The compiler chosen to carry out this 
conversion process from SystemC to verilog was Colexica's Agility 1.0. To prod uce a 
synthesisable design starting at SystemC the designer needs knowledge of the hardware 
design and the underlying microarc hitecture. 
5. Ha rdware Tech11.iques [or Exploiting DLP 180 
F igure 5-38 depicts the SystemC 1/F of the vector datapath. It is very simi lar to the RTL 
mode l and includes 3 input vectors, 1 input scalar, l output vector, 1 output sca lar, 
select_op, vlen and clock and resets. 
public : 
I* 
* Ports 
* I 
sc_in<bool > elk; 
se_in<bool > reset; 
sc_i n<bool > cloek_enable; 
se in<se_bv<VLMAX*B> > op2; 
se_in<sc_bv<VLMAX *B> > op3; 
se_in<se_bv<VLMAX*B> > op4; 
se_in<se_ bv<S> > sel ect_op 
sc_in<se_bv<32> > rs; 
se_in<se_bv<log2vlmax> > vlen ; 
se_out<se_bv<VLMAX*B> > opl; 
sc_out<se_ bv<32> > rd; 
a) SystemS 
entity datapath is 
port( 
elk in std_ logie ; 
in std_logi e; 
in std_logie; 
reset 
elock_enabl e 
op2 in s td_logie_veetor(VLMAX*B-1 
op3 
op4 
seleet_op 
rs 
vlen 
opl 
rd 
end datapath; 
in 
in 
in 
in 
in 
out 
out 
s td_logie_veetor (VLMAX*B-1 
s td_logie_veetor(VLMAX*B -1 
std_logie_vector(4 down to 
std_l ogie_veetor(31 down to 
std_logie_veetor(log2vlmax 
s td_logie _ veetor(VLMAX*B-1 
std_ logic_veetor(31 down to 
b) VHDL RTL 
down to 0) i 
down to 0) i 
down to 0) i 
0) i 
0) ; 
down to 0 ) i 
down to 0 
0) ; 
Figure 5-38 A comparison between the 10 interface declarations of a) SystemC and b) RTL 
designs. 
VHDL std_logic_vector's correspond to SystemC bit vectors (sc_ bv) with the total 
number of bits required as a parameters rather than the bit range. Unlike VHDL, where 
statements ca n be concurrent operations (unless specifically specified to run sequentially 
inside a process or a function) in SystemC things all executable code must reside within 
processes 
5. Hardware Techniques [or Exploiting DLP 
private : 
/ * 
* Processes 
*I 
void clock_proc() ; 
void trans_stagel_proc(); 
void trans_stage2_proc(); 
void bypass_proc(); 
Figure 5-39 Declaration of four processes present within SystemC design. 
18 1 
The vector datapath includes four processes: synchronous cloc k process responsible for 
updating all of the registered state and 3 combinatorial processes, one for the 
fu nctionality required in each stage and a third to co mbinatorial produce the bypass 
circui ts. 
5.3 .1 . Process: c lock_proc() 
The clock process maintains a ll state within the datapath. All three sets of registe rs ( input, 
stage 1 to 2 and output) are placed together in this process. T here is no functionality 
defined within this process, s imply reading and writing of data on the rising edge of the 
clock. 
op2_sig_r.write (op2.read()) 
Figure 5-40 Use of r ead and write properties of SystemC signals to illustrate signal 
assignment present within proc_clock() process. 
By using the built in methods .read() and .write() associated with the type sc_signal, 
the transfer of information from one signal to the othe r is accomplished. Figure 5-40 
illustrates how the input s igna l op2 is read and its value writte n directly to the internal 
signa l op 2_s i g_ r. 
5.3.2. Process: trans_stage l_proc() 
The combinatorial transient process of the first stage is responsible fo r the majori ty o f the 
functionality of the datapath. As with the RTL descri ption, a ll but three instructi ons are 
executed in this stage. Since the macro definit ions of the vector instructi ons are already 
described in C, the me thod of translating them to SystemC is very simple. The most time 
consuming task was the description of the internal connecti vity. 
5. Hardware Techniques [or Exploiting DLP /82 
select_op_r = (sc_uint<S>)select_op_sig_r . read(); 
Figure 5-41 Use of casting to read in a specific number of bits from a signal and assigning to 
a variable. 
For direct mapping of signals to variables, as shown in Figure 5-41 , reading of the signa l 
and then casting it to the desired variable type is all that is needed. 
sc_biguint<VLMAX*8> op2_biguint_r = op2_sig_r.read() ; 
for (int i=O; i<VLMAX;i++) 
{ 
op2_r[i) = (sc_uint<8>) (op2_biguint_r>>8*i); 
} 
Figure 5-42 Assignment of a bit vector to an variable array by use of a intermediate integer 
signal. 
However to map bit vectors signals into an unsigned 8bit array is not directly poss ible. 
Firstly the signal is read into an unsigned integer variable, sc_biguint. A sc_biguint 
is used rather than a sc_uint because the sc_ uint type will only accommodate 
numbers with 64 or less bits and hence would restrict our datapath to VLMAX<=8; 
sc_biguint on the other hand does not have such a restriction applied to iL. Once in a 
variable form, a process of shifting and casting is applied to separate each 8-bit e lement 
to the correct index of the required array. 
Once data is converted into variables instructions are enumerated with an if else 
statement. Within each such statement the macro code is used virtuall y unaltered. Two 
sets of alterations were required though. The first re lates to the casting. Since in the 
SystemC description the va riable widths are not necessarily defined in ANSII C, casting 
is needed. The second change is re lated to the loop parameters. 
for (index= 0; index< vstate.VLEN; index +=2)\ 
{ \ 
srcl=(signed int) (vstate.VRF[vrs) [index) I 
vstate .VRF[vrs) [index+l )<<8) ;\ 
srcl = srcl >> (uint8_t)amount;\ 
vstate . VRF[vrd) [index) = (uint8_t) (srcl) ; \ 
vstate .VRF[vrd) [index+l) = (uint8_t) (srcl >> 8) ; \ 
}\ 
Figure 5-43 C macro representation of shift right 16 instruction. 
5. Hardware Techniques for Exploiting DLP 183 
Figure 5-43 shows the original C macro for the shi ft right 16 instructions. As can be seen, 
the max imum number of iterations of the loop is determined by vlen. This is a valid 
instruction in software but in hardware having an unbounded loop at co mpile time is not 
possible since the compiler is unable to instantiate the con·ect amount of hardware. As 
with the RTL mode l, in SystemC the loops are carried out for the whole vector length. 
Thi s is potential source of de fic iency compared to RTL YHDL. 
for (index = 0; index < VLMAX; index +=2) 
{ 
) 
src l =(sc_ i nt<32> ) (op2_r [ index ) I 
(sc_uint<l6>) (op2_r[index+1)<<8)); 
srcl = srcl >> (sc_uint<8> )rs_r ; 
opl_i[index] = (sc_uint<8>) (srcl); 
opl_i[index+l) = (sc_uint<8>) (srcl >> 8); 
Figure 5-44 SystemC implementation of shift right 16 instruction. 
Figure 5-44 represents the same instruction but imple mented in Syste mC. Here the 
changes to the loop paramete rs and to casting of each variable are shown. 
As with the RTL modeL vsad and vaddreduce have the ir functionality spread between 
stages one and two and thus the macro codes have been split between the 
trans_ stagel_proc () and trans_stage2_proc ( ) processes also. 
for (int i=O; i<VLMAX ; i++) 
{ 
if (i<vlen_r) 
opl_mask_i[i) 
else 
opl_mask_i[il 
opl_i [i] ; 
0; 
Figure 5-45 Function to mask elements in the array at locations greater than vlen. 
Due to the alte ring o f the maximum loop iteration from vlen to VLMAX, as with the RTL 
methodology, a mask is created to invalidate vector eleme nts greater than vlen as shown 
in Figure 5-45. 
5. Hardware Techniques (or Exploiting DLP 
sc_biguint<VLMAX*B> temp= opl_mask_i[O); 
for (int i=l;i <VLMAX;i++) 
temp= temp+ (sc_biguint<VLMAX*8>) (opl_mask_ i [ i)<<B*i); 
opl_sig_i = temp ; 
184 
Figure 5-46 Assignment of a variable array to a bit vector by use of a intermediate integer 
signal. 
The final operation wjthin the process is to dri ve the masked an·ay, opl_mask_[ J to the 
output signal of t rans_stage l _proc (), opl_sig_ i. As was previously shown, 
reading signal s into variable anays is not a direct operation. To overcome thi s, an 
intermedi ate variable, temp, of type sc_biguint is used. This variable is filled with the 
information stored within the array by shifting each ele ment and adding this shifted va lue 
to temp. Once all the array values are ma pped into temp, thi s is assigned to the final 
output s ignal opl_sig_i . 
5 .3.3. Process: trans_stage2_proc() 
Stage two in the datapath is responsible for the add-reduce operation (used in the VSAD 
instruction). As with the combinatorial transient process ofr stage one, the input s ignals 
are first converted into variables and from there if e lse statements are used to select the 
operation to be performed. Thi s is required for three jnstructi ons, the remaining having 
the vector input from stage one dri ven straight to the o utput ports of the datapath. As 
shown prev iously, the loop iterat ion limit needs to be changed from the variable vlen to 
the constant VLMAX. Issues arising from addi ng the entire vector le ngth instead of only the 
length indicated by vlen are overcome due to the masking performed in stage one. 
5.3.4. Process: bypass_proc() 
The bypass process is responsible for providing bypass data back to different stages 
within the main vector processor. Bypassed signals are unregistered outputs from each of 
the stages within the datapath. Three signals per stage are being bypassed; one vector, one 
scalar and the cune nt vlen. The bypass p rocess takes as its inputs the output signals 
produced at the end of each combinatorial process and sends the m directly to the bypass 
outputs. 
5. Hardware Techniques [or Exploiting DLP / 85 
Both combinatori al processes and the bypass process are sensiti ve only to the ir specific 
input signals. Conversely, the c lock process is synchronous and ma intains the registers of 
the data path, updating on the ris ing edge of the c lock. 
5.4. Power and Area Analysis 
Both the RTL and SystemC-based datapaths were verified us ing -Jmillio n test vecto rs 
produced by performing e ncodings of real video sequences, recording the inputs and 
outputs from each of the C macros. A communa l testbench was created where each 
design was instantia ted in addition to a VHDL foreign language interface (FLI) e ntity. 
The FLI entity a llows the inspection of C code which reads in the test vectors from a text 
fil e, produced by the vectori sed simulator, and applies stimulus to the design in the test 
bench. Both the designs were simulated and verified on Mentor Graphics M ode lSim[ 14]. 
Thi s software package allows fo r event driven simulation and direct comparison between 
the outputs from the designs and the ex pected results produced by the C macros. Once it 
had been confirmed that both datapaths produced the very same outputs, the process of 
evaluating the 'quality' of each design was unde rtaken. Qua lity is a subjecti ve measure 
and is wholly based on the specifications imposed on the design. For these designs three 
different metrics were studied, mainly maximum operating frequency fm:1x, silicon area 
and power consumption. Silicon tec hnology was TSMC's 0.13 J..lm high speed (HS) [15]. 
F ront-end synthesis was ca1Tied o ut using Synopsys Design Compiler (DC)[l6] fo r 
varying VLMAX parameters and various target frequencies. The front-end synthes is 
output was an optimised structure d verilog netli st a long with timing constrains in 
Synopsys Design Constraints (*.sdc) format. Using the netlist produced by DC, each 
design was then subjected to a f ull place and route flow using Cadence F irst Encounter 
(FE) ll7] whe re fl oorplanning, clock tree synthesis and place and route took place. At this 
stage the fina l silicon area and the maximum operating f requency of the design are 
calculated by FE. Along with these figures, FE produces path delays (full static timing 
ana lysis), timing constraint values. interconnect delays in standard de lay forma t (*.sdf) 
and a second structured verilog netlist which represents the final des ign. The new netlist, 
sdf and sdc fil es are then fed back into DC where statistical power ana lysis takes place. 
5. Hard ware Techniques [or £rp/oiting DLP 
80 
70 
60 
50 
!!: 
E 
~ 40 
3: 
0 
0.. 
30 
-+-VLMAX8 
VLMAX 32 
-e-vLMAX 16 
-VLMAX84 
20 
10 ~---~·~'!' :::=::: ~:==:::=:.::: =-----= -:---= 
0 
100 125 150 175 200 225 250 275 300 325 350 375 400 
Requested Frequency, MHz 
/86 
Figure 5-47 tatisticaJ power consumed on RTL design at various operating frequencies and 
for each VLMAX. 
Figure 5-47 de pict the tati tical power co nsumpti on ob. e rved for varying VLMAX of the 
YHDL-RTL de ign. Here each mea ured fnu~ is plotted aga inst its correspond ing power. 
This i re peated for various YLMAX values. Figure 5-48 de pict the a me set of 
mea, urement. but for the Sy te mC-ba ed de. ign. 
140 
120 
100 
!!: 80 E 
~ 
., 
3: 
0 60 a. 
40 
20 
0 
t OO 125 
-+- VLMAX8 
VLMAX 32 
!50 175 
Requested Frequency, MHz 
-+-VLMAX 16 
-.-VLMAX84 l 
200 225 
Figure 5-48 tatistical power consumed on 
and for each VLMAX. 
design at various operating frequencies 
5. 1/ard ll'are Techniq11es (or £rploi1ing DLP 187 
Ob. er ing both et of re ·uh a number of ob er at ion. can be made. Fir, tly the general 
ha pc of both graph are of a imi Jar naLUre. lt can be een that as the VLMAX increa e~ the 
power con ·umption. lncrea ing VLMAX lead 10 larger de ign which lead 10 an increase 
in power. Both designs illu trate how the power consumed ha a direct relationship to the 
maximum clock frequency and the final speed it operates at. As the syntheses tool i. 
pu. hed to optimi e the de ign at higher frequencies, the power dis. ipated is increasing. 
Thi. re earch i particularly intere ting a it directly compares two design of the ame 
functionality. de igned with different methodologie, . A ignificant di fference in power 
con ·umption i ob erved ror all VLMAX and r1ro._ tested; thi can be een as a fairl y 
con tant three-fold increase in power consumption for the Sy temC de ign a compared 
to RTL. At vector length 8 and rrequency range of 100 to 200MHz the RTL de1.ign 
consumed power of 2.3 to 4.6mW whcrea Sy temC consumed 6.02 to 12.79mW. 
1.4 
1.2 
'E 08 
E 
~ 06 ~ 
0 4 
_._VLMAX8 --VLMAX 16 
VLMAX 32 - VLMAX 64 
02 --------..---------------------~~~--~ 
0 
100 125 150 175 200 225 250 275 300 325 350 375 400 
Reques ted Frequency, MHz 
li'igurc S-49 Area of RTL design at varying operating frequencies and for each VLMAX. 
In add ition to studying the power con umption of each de ign methodology the phy ical 
area of each de ign wa recorded. Figure 5-49 show how thi area varie for dirferent 
value of VLMAX and for dirferent fma>. fo r the RTL de ign. The corre ponding Sy temC 
graph i depicted in Figure 5-50. 
5. Hardware Techniques [or Exploiting DLP 
1 6 
1 4 
1 2--~ 
... 
E 
~ 08 
.. 
~ 
< 
0 6 
0 4 
-+-VLMAX8 
VLMAX 32 
-e-VLMAX 16 
- VLMAX 64 
02 ~~~----~-----------------+--------------+------. 
0 
100 125 150 175 200 
Reques1od Frequllncy. MHz 
188 
225 
design at varying operating frequencies and for each VLMAX. 
As wi th the re ult obtained for power, the area graphs show both similaritie and 
difference . The main properties of the.e area graph are that, for a given VLMAX, the 
required area hows a marginal change with a slight empha i tOward higher area for 
frequency. The vast majority of the silicon area within the chip is accounted for by the 
logic gate performing the functionality of the device. A the frequency requirement of 
the device increase variou method are u ed to allow for a chip to operate atthi higher 
frequency. The e method , which often require an increa e in ilicon due to faster and 
larger buffer being employed a well a optimi ing place and route for the critical path , 
affecting the ize and location of route for other non-critical path wi thin the y. tern. 
These method employed by the backend tool to reach higher maximum clock 
frequencie have there rever e affect on both power and area. When tudying the change 
in area over frequency, ca ling for a single VLMAX, the trend of increasing area with 
frequency would be more apparent. In thi study, and in Figure 5-49 and Figure 5-50, the 
emphasi is on the relation hip between the area of the device for each VLMAX and for 
each de ign methodology. When tudying at thi cale, the affect of the operating 
frequency on the area of a device i le s apparent a the affect of the dramatic increa e in 
logic required for each change in VLMAX. The affect of vector length on the quanti ty of 
logic required i evident when the RTL code is studied. With each increa e in VLMAX an 
additional vector element i instantiated within the de ign. T hi increa e can aL o be , een 
within the pack_merge unit a it grow to include an ever increa ing vector length and 
5. !-lartlwarl' Techniques for Exploiting DLP 189 
within the adder tree where ex tra adder and rows are required to add up the e additi onal 
byte . The increa e in area required for an increa ed vector length completely 
o ershadow. the increa. e due to frequency increase . . Due to the c reason · both graph 
. how a near parallel et of line~ for both RT L and SystcmC de. igns. What can be seen 
though i the dramatic di fference between these two design methodologies. From the 
re ult obtained it i een that the SystemC de ign i:, a con tant 30% greater than it RTL 
counterpan . 
08 
01 
0 6 1-----.-
os 
03 
02 
0 I 
--70 
t=l SysremC Area - RTL Area 
SysremC Power ATLPower 
50 
10 
0 
1000 \20D ~0 ~0 ~ 0 2000 
0 
2200 
Reques1ed Frequency, MHz 
Figure 5-51 Power and area of both 
length of 32. 
design methodologies for vector 
From thi tudy, and illustrated in Figure 5-51, it i clear that. at the current moment in 
time, SystemC is not yet a va l id challenger to the dominance of RTL a the pri mary 
de ign tool for hardware designer. fl i my view that the move from an RTL level of 
de ign will eventualJy be ovenaken by the higher, algorithm level of Sy temC or other 
electronic ystem level (ESL) languages. By moving de ign from the RTL to ESL, 
designers will be able to concentrate more on the algorithm and le on the specific 
hardware implementation. 
5 .5. Combining thread and data-level para llel i m 
Chapters 4 and 5 have di cu sed and demonstrated the individual power ofTLP and DLP, 
re pectively . In ection 5.5, however, the combined u e of both TLP and DLP will be 
5. Hardll'are Techniones for Exploi1ing DLP /90 
i llustrated. Initial work on producing a thread and data parallel encoder focused 
combining each individual parallel implementation into a ingle application. Thi proce. s 
of combining paralleli m was straight forward ince each form focused on different 
level. of the encoding algorithm and hence independent of each other. The eparately 
modi fied vector and multi-proce sor imu lation en ironment, each based on the 
SimpleScalar tool. et[ 18], were combined allowing for the imu lation of mult iple vector 
proce or in a hared memory ystem .. 
60 
50 
40 
30 
20 
10 
Thread-Data-parallel MPEG-4 XViD Performance 
foreman 352x244, 25 frames, 25 fps, 1 Mbitls, Quality 2 
---
---
---
-
-
VLMA:X.S 
0 
4 contexts 
8 contexts 
Figure 5-52 Reduction in DIC count obtained with the combined TLP and DLP MPEG-4 
encoders for the Foreman test sequence. 
Figure 5-52 illustrate the DIC of each vector proce sor executing the TLP/DLP MPEG-4 
encoder relati ve to the unmodified uni - calar proce or executing the tock encoder[ 19, 
20]. It can be seen that the reduction obtained through each form of parallel i m are 
orthogonal and therefore are combined to produce addi tional reduction in DIC than are 
ob erved from each form of parallelism alone. 
5. Hard111nre Techniques [or Exploiting DLP 191 
The final step in combining TLP encoders and DLP vector datapath produced within this 
work was wi th the successfully incorporation of the MPEG-4 vector accelerated datapalh 
into the SS_SPARC ASIC Processing Platform. As discussed in section 3.3. 1, the 
SS_SPARC processing platform has been designed by the ESD group at Loughborough 
Univers ity to exploit three forms of paralle lism, TLP DLP and ILP. The vector datapath 
described in section 5.2, was instantiated within the streaming eo-processors of the 
SS_SPARC, to work in tandem with each of the super-sca lar processing kernels, forming 
a s imultaneous multithreaded (SMT) network of processing power. 
5. Hcmlll'are Techniques for £reloiting DLP 
70000 
600.00 
50000 
~ 40000 
~ 
~ 30000 
a. 
200.00 
10000 
0.00 
250 
200 
'E 150 
E 
~ 
c{ 100 
050 
Frequency vs Vector Length (VLMAX) vs Processor Count (CNTXT) vs Power for 
SS_SPARC with Vectorlsed MPEG-4 Co·Processor 
Clfl"XT1 CNTXT2 CNTXT4 CNTXTB 
VLMAX18 
CNTXT1 Clfl"XT2 CNTXT4 CNTXTB 
VLMAX32 
-+- tOO MHt 
125 MHz 
--1666MHz 
200MHz 
-250 MHt 
333 MHz 
Frequency vs Vector Length (VLMAX) vs Processor Count (CNTXT) vs Synthesis area 
for the SS_SPARC with Vectorlsed MPEG-4 eo-Processor 
• 100 MHz 
125 MHz 
• 1666 MHz 
200MHt 
- 250MHz 
333 MHz 
0 .00 '---- ----
c~~' c~1'2. c#• cl'f\"1-'16 
VLMAX16 
ct-~-t~' cl'fl"i-1'2. c;\'1-t~• c~~e 
VLMAX32 
/ 92 
Figure 5-53 Power consumption and synthesised a r ea for the SS_SPA R with vcctod sed 
MPEG-4 co-proces or a t. varying operating frequencies, vector length of 16 and 32, a nd with 
1, 2, 4 and 8 processors. 
A can be een from Figure 5-53, the combined SS_SPARC proce or and MPE0-4 
vector co-proces or had full front and back-end de 1gn now applied to it and the power 
and area re ult · obtained for variou configurations. 
5. HardiVare Techniques fo r Exploiting DLP /93 
5.6. Conclusion 
Within this chapter both DLP and [LP techniques have been exami ned. The first half of 
the chapter was concerned with the creation of a custom vector datapath that wi ll execute 
31 vector ari thmetic instructions to accelerate the X viD video encoder. The design 
methodology for creating the datapath was carried in using two differe nt techniques. 
Us ing traditional RTL based designi ng the datapath was fi rst designed and verified using 
the VI-IDL hardware description language. Concurrently with thi s the datapath was also 
designed using the ESL language of SystemC. By performing synthesis and place a nd 
router each datapath was evaluated for maximum ope rationa l speed, area and power 
consumption. Comparing RLT with SystemC at comparable operati onal frequencies it 
was fo und that the SystemC design occupied an area of 30% greater than RTL and 
consumed on average a third more power. 
5. 7 . References 
[1] C. Lampert, M. Militzer, P. Ross, E. Gomez, and R. Czyz. "XviD MPEG4 Core," 
www.xvid.org. 
[2] V. A. Chouliaras, J. L. Nunez, K. Koutsomyti , S . R. Parr, D. J. Mulva ney, a nd S. 
Datta, "Development of c ustom vector accelerator for high-performance speech 
coding," lEE Elecrronic Lellers, vol. 40, pp. 1559- 1561,2004. 
[3] V. A. Chouliaras, J . L. Nunez. D. J. Mul vaney, F. S. Rovati , and D. Alfonso, "A 
multi-standa rd video accelerator based on a vector architecture." Consumer 
Elecrronics, IEEE Transacrions on, vol. 51 , pp. 160-167.2005. 
[4] V. A. Chouliaras, J. L. Nunez, F. S. Rovati, and D. Alfonso, "A multi-standard 
video coding accelerator based on a vector architecture," in Consumer 
Electronics, 2005. ICCE. 2005 Digesr ofTechnical Papers. lnrem arional 
Conference on 2005, pp. 135 - 136. 
[5] V. A. Chouliaras, J. L. Nunez-Yanez, and S. Agha, "Silicon Implementation of a 
Parametric Vector Datapath for Rea l-Time MPEG2 Encoding," in lASTED 
Conference on Signal and Image Processing, 2004, pp. 98-303. 
(6] V. A. Chouliaras, S. Agha, T. R. Jacobs, and V. M. Dwyer. "Quantifying the 
benefit of thread and data paralle lism for fa st motion estimation in MPEG-2," 
lEE Elecrronics Leners vol. 42, pp. 747-748, 2006. 
17] T . R. Jacobs, V. A. Chouliaras, and J. L. Nunez-Yanez, "A thread and data-
parallel MPEG-4 video e ncoder for a system-on-chip mul tiprocessor." in 
5. Hardware Techniques for Exploiring DLP 194 
Application-Specific Systems, Architecture Processors, 2005. ASAP 2005. 16th 
IEEE llllernational Conference on Samos, Greece, 2005, pp. 405 - 410. 
[8] A. Rose, S. Swan, J. Pie rce, and J.-M. Femandez, "Transactio n Level Modeling 
in SystemC," TLM Whitepaper 2004. 
[9] "Sys temC Version 2.0 User's Guide," lmp://www.systcmc.org 2002. 
[10] V. A. Chouli aras, K. Koutsomyti , T . R. Jacobs. S. Parr, D. Mulvaney, a nd R. 
Tho mson, "SystemC-defined SIMD instructions for a CMP/SMT ASIC 
platform," NorCHIP 2006. 
[11] "G++," in GNU Compiler Collection (GCC): Free Software Foundation. 
[ 12] "SystemC Language Reference Manual ," Open SystemC Initiati ve (OSCI), 2005. 
[13] "Agility 1.0," Celoxica Ltd . 
[ 14] "Mode iSim 5.7g," Mentor G raphics Corp. 
[15] "Advanced Logic Technology- O.l3J..Lm." Taiwan Semiconductor Manufac turing 
Company, 2006. 
[1 6] "Design Compiler 2003.06," Synopsys Inc. , 2003. 
[ 17] "Firs t Encounter," Cadence. 
[1 8] "SimpleScalar," S impleScalar LLC. 
( 19] V. A. Chouliaras, K. Koutsomyti, T . Jacobs, S. Parr, D. Mu lvaney, and R. 
Thomson. "SystemC-defined SIMD instructions for a CMP/SMT ASIC 
platform," in Norchip conference in ASIC design, Proceedings of the 24th IEEE 
Linkoping, Sweden, 2006, pp. 285-288. 
[20] V. A. Chouliaras, K. Koutso myti, T . Jacobs, S. Parr, D. Mulvaney, and R. 
Thomson , "SystemC-defined SIMD instructions for high performance SoC 
arch.itectures," in Electronics, Circuits and Systems, 13th IEEE International 
Conference on. Nice, France, 2006. 
CHAPTER 6: 
CONCLUSIONS & FUTURE WORK 
6.1. Para lle li sati on of Video E ncode rs 
Within this thesis. the use of various paralle l computing techniques have been examined 
in conjunction with the study o f state-of-the-art video compression algorithms used 
w ithin modem video compression standards. The work on para lle l computing has focused 
on two of the predominate forms of paralle lism, thread-level paralle li sm (TLP) and data-
level para lle li sm (DLP), illustrating how both can be app lied to modern embedded 
systems in o rder to increase its video coding performance. [n sections 6. L.J a nd 6.1.2, a 
summary o f the design methodologies undertaken for each of these parallel techniq ues 
are described. and the findings of each suc h method presented in section 6.2. This chapter 
concludes with a look at how the knowledge obtained in thi s research can be further used 
and continued. 
6.1.1. Thread-Level Paralle lism 
T he thread-level parallelism section of the work focused on the paralle lisation of three 
state of the art video encoders. The process by which thi s was undertaken was through 
manually partitioning the contro l-flow-graph (CFG) of each of the encoders and 
di stributing the parti tioned CFG to di stinct processor contexts w ith in a mu lti-processor 
instruction-set s imulator. Through software profi ling, the computationally expens ive 
functions were highlighted and thus these were subsequently targeted as functions that 
could potentially yie ld savings through the exploitation of TLP. 
T his targeting of potential functions focused on three separate levels of granu larity w ithin 
the compressed stream structure, namely the block, the macroblock and the slice level. 
The bloc k, whic h was targeted by both FDCT and IDCT in MPEG-2, is an 8x8 pixel a rea 
of the frame and is the smallest structure w ithin the compressed bitstream. ln the thread-
para lle l MPEG-2 encoder, the blocks present w ithjn a given MB were distributed 
amongst ava ilable processor contexts. The unit data quantity which is targeted for 
195 
6. Conclusions & Fwure Work 196 
transformati on in ME in MPEG-4 and MPEG-2, is a 16x l 6 luminance pixel area . In both 
the TLP MPEG-2 and MPEG-4 encode rs, each of the MBs within a single row o f the 
input f rame where di stributed amongst contexts, with each row be ing complete ly encoded 
before the next row is started. The slice, whic h is targeted for H.264, is cl assed as a 
continuo us group of MB within a given frame which can be encoded independe ntly from 
other slices within that frame. In the TLP H.264 encoder, the slice groups a re sequenti al 
groups of MB evenly di viding up the whole frame and are independently e ncoded on 
separate processors contexts. 
The method by which the CFG has been sliced and allocated to processors is through 
partitioning the high-level functional loops present within each encoder and manua lly 
assigning data structures to each context for conc urrent process ing. To enable concurrent 
processing o f seque ntial loop iterati ons, a number of methods have been developed to 
re move data depe nde ncies between iterations ensuring cohe rency of all variables across 
contexts. 
6 .1 .2. Data-Level Para llelism 
The data-level paralle lism aspect of this work has focused on the SIMD infrastruc ture 
required to support the execution of custom vector instructions. T his continues the work 
o f the ESD group at Loughborough Univers ity [1 -6] where computationally expensive 
audio and video algorithms have been vecto1ised and simulated for functiona l accuracy. 
Through creating functional mode ls o f that vector hardware, custom vector instructions 
were produced to perform the required arithmetic operations. Using this vectorised 
programmers model, substanti al optimisation of the performance me trics of these 
algorithms has been achieved. 
After successfull y validating the mode l at the functional level, an accurate re presentation 
of the vector instructions and the hardware that executes them was produced. T he process 
of producing cyc le accurate representation of that model was unde1t ak en us ing two 
different me thodologies in tandem. Firstly, following the tradit ional RTL-based design 
methodology, a two-stage datapath was designed in VHDL. Through the creation of 
functi onal logic blocks that perform arithmetic operations on a small subsecti on of the 
input vectors operands and instantiating multiple instances of these, a highl y configurable 
6. Conclusiom & Future Work 197 
vector datapath was implemented. Thi s configurable vector datapath was instantiated in 
the exposed vector unit of the SS_SPARC processor. 
Designed in tandem but independently of the RTL-based model was an ESL-based 
datapath. This datapath was imple mented in SystemC and ad heres to the sa me inte rface 
and two-stage design as the RTL-based model. The procedure of modelling a concurrent 
hardware enti ty in a sequentia l programming language resulted in each cycle being 
d ivided into two distinct functional blocks . These were the transient processes w here 
internal combinato rial computations are performed and the c lock process where these 
newly calc ulated values are driven to the appropriate sta te holding e le ments. 
6 .2 . Experimentation Findings 
During this work several key points on the use of parallel computing techniques for 
accelerating multimedia workloads have been illustrated . It has been shown that through 
exploiting vari ous forms of paralle li sm it is poss ible to significantly reduce the 
computational workload of the processing eleme nts within a mult i-processor system, 
compared to a scalar uni-processor e nvironment. 
6.2.1. Thread-Level Parallelism 
The results obtained through exploiti ng T LP on each of the video encoders showed 
dramatic reductions in the dynamic instructi on count metric per processor context. A 
configurable, extensible PRAM multi-processor simulator known as sim-system was used 
as the TLP and DLP test environme nt. Thi s simulator allows for the execution of 
speciall y modified source code within a configurable multi-processor environment with 
the production of system statistics, including the metric used within this work, the 
dynamic in struction count (DJC). By using this measure for eac h available processor 
context it is shown how, through explo iting TLP, the overall workload of the system is 
evenly distributed amongst contexts and hence the per context workl oad was reduced . 
6.2. 1.1. MPEG-2 
When evaluating how successful the th readi ng process has been in reducing the per 
processor DJC for the MPEG-2 encoder. a number of test sequences were e ncoded. These 
6. Conclusions & Fuwre Work 198 
were encoded at various differe nt 'qua lity' settings whilst changing the numbe r of 
available processor contexts present within the simulated environment. T he 'quality' 
setting chosen for experimentation in MPEG-2 was the search range used in the full-
search implementation of ME. Using the results presented in section 4.5.2, it can be seen 
that there is a substantia l reduction of the relative DIC pe r processor, for all search 
ranges. It can be seen that for the smallest search range evaluated (3x3). the relative DIC 
pe r processor in a 32 processors system was reduced to - 42% of that of the s ingle 
threaded un i-processor system. Whe n the search ra nge was inc reased to 40x40 pixels, the 
re lative DIC subsequently dropped to 8%. 
6.2.1.2. MPEG-4 
As with MPEG-2, the MPEG-4 encoder' s control flow graph (CFG) was modified to 
expose its inherent TLP. After modification and verification, the TLP MPEG-4 e ncoder 
was subjected to multiple test input seque nces at a variety quality of settings, and with a 
varying numbers of available processors. These quality settings represent e ncoder 
features that are selectively made avai lable to the encoding engine in order to decrease 
the bitrate of a given seque nce whilst ma intai ning a constant visual quality. As wi th 
MPEG-2, the TLP MPEG-4 encoders produced dramatic reductio ns in relati ve DIC per 
processor when compared to the s ingle threaded uni-processor system. W ithin a 22 
processor syste m, this relative DIC was reduced to 12% of that of a uni-processor sys te m. 
6.2.1.3. H.264 
The exploitation of TLP within the H.264 encoder foc used on the reduc ing the DIC 
metric over multiple quality settings and number of processor contex ts. T he q ua lity 
setting chosen for experimentat ion in the TLP H.264 encoder was the nwnber of 
quantisation steps. These ranged between 20 and 28 and specify the number of distinct 
values avai lable to represent the transformed pixel data. The re lat ive DIC metric was 
reduced to 22% with a four processor system and was further reduced to 8% whe n the 
available processor increased to sixteen. 
6. Conclusions & Future Work / 99 
6.2.2. Data-Level Paralle lism 
Chapter 5 di scuses the exploita tion of a n equally important form of paralle lism, data-level 
paral le lism. In this case two differe nt desig n methodo logies were used to produce vector 
datapaths that implement 3 1 extended vector instructions for accelerating the MPEG-4 
video coding standard . The aim of tha t work on DLP was two fo ld. Firstly, to 
demonstrate that these instructi ons can be eas il y designed and synthesised into a vec tor 
datapath and to perform a fu ll design fl ow; secondly, a comparative study between the 
two design methodologies, conventional RTL VHDL-based and ESL-based language was 
performed. 
The vector datapath that was designed us ing traditional RTL was capable of reaching 
speeds of over 400MHz whi le consuming less than 20mW for a vecto r length of 32. Thi s 
is in comparison to the SystemC based designs that consumed 50 mW at ha lf that 
frequency, 200MH z. On average it was seen that the SystemC design consumed three-
fold the power of the RTL-based designed . Similarly, when examining the s ilicon area 
required for each design, the RTL-based design required 0.4mrn2 compared to 0.6rnm2 for 
the SystemC-based design for a vector length of 32 and at a freq uency of 200MHz, 
dem onstrating the average 30% extra area re quired by the SystemC-based design. 
6.3. Future work 
This thesis has de monstrated that through exploiting TLP there is s ignificant potential for 
accelerating state of the art video e ncoders. Thi s pote ntial, as il lustrated from the 
experimenta l results, has been quantified by parting each of the TLP encoders onto a 
simul ated multi-processor environment that maps c loser to a fi nal SoC arc hitec ture. 
Through undertaking furthe r research. the practical issues such as me mory management, 
and the trade off between performance gains obtained from increases the number of 
processors and vector length and power/area conside rati ons of the syste m, would be 
exami ned. Through the use of a simple yet concise paralle l API consisting of processor 
ide ntification, barrier mec hanism and comp ile-time knowledge of the number of 
processors within the syste m. the des ign effort of parting each of the encoders to a 
dif ferent simulation environment was kept to a minimum. 
6. Conclusions & Furure Work 200 
In additio n to the work o n TLP, the DLP section demo nstrated the practica l feasibility 
and speed/power/area characteristics of custo m vecto r accele rators fo r video co mpression 
desig ned wi th two methodologies. T he datapaths could be e ither impleme nted as a s ingle 
vector datapa th processor such as the early CRA Y[7] super-computers, o r incorporated 
into an exis ting sca lar processor by acting as a eo-processor as in ARM's NEON 
mu lt imedia co-processorr8]. The next step with in the o ngoing research into paralle li sm 
and accele ration of video encode rs was achieved with the successfully incorporation of 
the MPEG-4 vector datapath into the SS_SPARC ASIC Process ing Platform. Here the 
datapath was introduced within the streaming eo-processors of the SS_SPARC, to work 
in corporation with the super-scalar processi ng kernels, forming a s imultaneous 
multithreaded (SMT) network of processing power. 
Through explo iting thread-level techniques to distribute the CFG of state-of-the-art video 
encoder to a number of processing units or thro ugh exploiting data-leve l techniques to 
create vector instructio ns and impl ementing vector archi tecture; thi s thesis has 
demonstrated the substantial potential pe rfo rmance improvement that can be achieved 
when applying these para lle l computing methods to computationally expensive video 
e ncoders. 
6.4. References 
[1] V. A. Cho ulia ras, J. L. Nunez, K. Koutsomyti , S. R. Parr, D . J . Mu lvaney, and S. 
Datta, "Development of custom vector accele rator for h igh-performance speech 
coding," lEE Electronic Lellers, vol. 40, pp. 1559 - 1561,2004 . 
[2] V. A. Cho uliaras, J. L. Nunez, D. J . Mulvaney, F. S . Rovati , and D. Alfonso. "A 
multi-s tandard video accele ra tor based on a vector architecture ," Consumer 
Electronics, IEEE Transactions 0 11, vol. 5 1, pp. 160-167,2005. 
[3] V. A. Chouliaras, J . L. Nunez, F. S . Rovati , and D . Alfonso, "A mu lti-standard 
video coding accelerator based on a vector architecture," in Consumer 
Electronics, 2005. ICCE. 2005 Digest ofTechnical Papers. International 
Conference on 2005, pp. 135- 136. 
[41 V. A. Chouliaras, J. L. Nunez-Yanez, and S. Ag ha, "S ilicon Imple mentation o f a 
Parametric Vector Datapath for Rea l-Time MPEG2 Encodi ng," in lASTED 
Conference on Signal and Image Processing, 2004, pp. 98-303. 
6. Conclusions & Future Work 
[5) V. A. Chouliaras, S. Agha, T. R. Jacobs, and V. M. Dwyer, "Quantify ing the 
benefit of thread and data parallelism fo r fa st motion estimation in MPEG-2," 
I EE Electronics Le tiers vol. 42, pp. 747 - 748, 2006. 
201 
[6] T . R. Jacobs, V. A. Chouliaras. and J. L. Nunez-Yanez, "A thread and data-
paralle l MPEG-4 video encoder for a system-on-chip multiprocessor," in 
Application-Specific Systems, Architecrure Processors, 2005. ASA P 2005. 16th 
IEEE International Conference on Samos, Greece, 2005, pp. 405 - 410. 
f7l R. M . Russell , "The CRAY -1 computer system," Communicarion of the ACM, 
vol. 2 1, pp. 63-72, 1978. 
[8] "ARM Architecture Reference Manual - ARMv7-A and ARMv7-R edition," 
ARM DDI 0406A, ARM Ltd. 2007. 
BIBLIOGRAPHY 
(1997) lnte l Architecture Optimizati on Manual. Tnte l Corp. , 242816-003. 
(1998) 3D Now! Technica l Manual. 2 1928G/O, Advanced Micro Devices. Inc. 
(1998) Informati on Technology - Portable Ope rating System Interface (POSIX). IEEE 
1003. 
(1999) AltiVec Technology Programming Interface Manual. Freescale Semiconductors, 
ALTIVECPJM/D. 
(1999) AMBATM Specification (Rev 2.0). ARM Ltd., ARM IJ-!1 00 1 lA. 
( 1999) AMD Extensions to the 3DNow! and MMX Instruction Sets Manual. 224660/0, Advanced 
Micro Devices, Inc. 
(2000) AMD-K6-2 Processor Data Sheet. 218501/0, Advanced Micro Devices, Inc. 
(2000) Enhanced 3DNow!TM Technology for the AMD AthlonTM Processor. Advanced Micro 
Devices. lNC. 
(2000) Informational technology-- Coding of audio-visual objects. ISO/IEC 14496. 
(2001) Desktop Performance and Opti mization for Inte l Pentium 4 Processor. Intel Corp. , 
249438-01 . 
(2002) Advanced Video Coding. ITU-T Rec. H.264 I ISO/IEC 
(2002) Portable Operating System Interface for Computer E nvironments. IEEE J 003. l. 
(2002) SystemC Version 2.0 User's Guide. http://www.systemc.org. 
(2003) The AMD AthlonTM XP Processor with 512KB L2 Cache. Ad vanced Micro 
Devices, INC. 
202 
~B~ib~l~io~r ~~~~v~--------------------------------------------------------203 
(2003) I nformational techno logy -- Portable Operating System Interface (POSlX). ISO/lEC 9945. 
(2003) Standardizatio n o f Gro up 3 facsimile terminals fo r document transmission. Internat io nal 
Telecommunicatio n Union. 
(2004) AltiVec Technology. Freescale Semiconductor ALTIVECFACT Rev.4. 
(2004) AMBA™ 3 AXI System Compo nents Data Sheet. ARM Ltd .. ARM DOl 0 194-l/09.04. 
(2004) Digital Video Broadcast ing (DVB); Framing structure, channel coding and modulation for 
digi ta l terrestrial television. European Telecommunications Standards Institute. 
(2004) Informatio n techno logy-- Coding of audio-visual objects-- Part 2: Visual. lSO/IEC 14496-
2. 
(2004) NEON Technology Data Sheet. ARM Ltd., ARM DOl 0192-1109.04(5). 
(2005) Advanced video coding for generic audiovisual services. 1TU-T Recommendation H.264. 
(2005) Archi tecture and Implementa tion of the ARM® CortexTM_A8 Microprocessor. ARM Ltd., 
White paper. 
(2005) Information techno logy -- Coding o f audio-visual objects - Part lO: Advanced Video 
Coding. ISO/IEC 14496-10. 
(2005) SystemC Language Reference Manual. Open SystemC Initiat ive (OSCI). 
(2006) ARM1176JZF-Sn1 Revision: r0p2 Technical Reference Manual. ARM Ltd. , ARM DDI 
030lD. 
(2006) CortexTM_A8 Revision: rlpO Technical Reference Manual. ARM Ltd., ARM DDI 0344A. 
(2007) ARM Architecture Reference Manual - ARM v7-A and ARMv7-R edit ion. ARM Ltd ., 
ARM DD! 0406A. 
ACOST A, C., FALCON, A., RAMIREZ, A. & V ALERO, M. (2005) A Complexity-Effective 
Sim ultaneous Multithreading Architecture. Parallel Processing, 2005. ICPP 2005. lmemarional 
Conference on. 
~B~ib~l~io~,~~~~~v~--------------------------------------------------------204 
AGARWAL, A., S1MONI, R., HENNESSY, J. & HOROWITZ, M. (1988) An evaluation of 
directory schemes for cache coherence. CompuTer ArchiTecture, I 988. Conference Proceedings. 
I 5th Annual lmemaTiona/ Symposium on. 
AHMAD. 1. , HE, Y. & LIOU, M. L. (2002) Video compression with parallel processing. Parallel 
compwing in image and video processing, 28, I 039 - 1078 
AI-IMED, N .. NATARAJAN , T. & RAO, K. R. (1974) Discrete Cosine Transform. Compwers, 
IEEE TransacTions on, C-23, 90- 93. 
AIMAR, L., MERR1TT, L., PETIT. E., CHEN, M., CLAY, J., RULLGAAD, M., CZYZ. R. , 
HElNE, C., IZVORSKI, A. & WRIGHT ., A. (2006) x264. http://developers.videolan.org. 
AKRAMULLAH, S. M. , AHMAD, 1. & LIOU, M. L. (1995) A data-parallel approach for real-
time MPEG-2 video encoding. Jouma/ of Parallel and DisTribuTed CompuTing, 30, 129- 146 
BABU, D., BABU, M. 1. , SARAVANA. M. , GOVINDAN, S. & PARTHASARATHI, R. (2001) 
Functional Unit Usage Based Thread Selection in a Simultaneous Multithreaded Processor. High 
Pe1jonnance Compllling, InTernational Conference On 
BARNEY, B. (2006) Introduction to OpenMP. Livermore Computing. 
BATEMAN, A. & PATERSON-STEPHENS, I. (200 1) The DSP handbook algorirluns, 
applications and design Techniques, P rentice Hall. 
BELLARD, F. (2006) FFMPEG. http://ffmpeg.mplayerhq.hu. 
BERNE, R. M. & LEVY, M. N. ( 1996) Principles of Physiology, Mosby Publishers. 
BILAS, A., FRITTS, J. & SINGH, J. P. ( 1997) Real-ti me parallel MPEG-2 decoding in software. 
BOZOKI, S., WESTEN, S. J. P. , LAGENDIJK, R. L. & BIEMOND, J. (1996) Parallel algorith ms 
for MPEG video compression with PVM. EUROSIM 1996. 
BRASH, D. (2002) The ARM Architecture Version 6 (ARMv6). ARM Ltd. 
CARDINAL, J. (2001) Fast fTactal compression of greyscale images. Image Processing, IEEE 
Transactions on, I 0, 159- 164. 
Bibliograph y 205 
C ASCAVAL, C., CASTANOS, J . G., CEZE, L., DENNEAU , M., GUPTA, M ., LIEBER, D ., 
MORElRA, J. E., STRAUSS, K. & WARREN, H. S., JR. (2002) Evaluat ion o f a multithreaded 
a rchitecture for cellular computing. High-Pe,fo rmance Computer Architecture, 2002. 
Proceedings. Eighth International Symposium on. 
C HANDRA, R., DAGUM , L. , KOHR, D ., MAYDAN, D .. MCDON ALD, J. & MENON , R. 
(2001) Parallel Programming in OpenMP. 
C HANG-S U, K., RIN-CHU L, K. & SANG-UK, L. ( 1998) Fracta l coding of video sequence using 
c ircular predic tio n mapping and noncontractive interframe mapping. Image Processing, IEEE 
Transactions on, 7, 601-605. 
CHOULIARAS. V . A. , AGHA, S., JACOBS. T . R. & DWYER, V. M. (2006) Quantifying the 
benefit o f thread and data paralle lism for fast mo tion estimatio n in MPEG-2. lEE Electronics 
Leuers, 42, 747 - 748. 
C HOULIARAS, V. A. , KOUTSOMYTI, K., JACOBS, T., PARR, S., MULVANEY, D . & 
THOMSON, R. (2006) SystemC-defined SIMD instruc tions for a CMP/SMT ASlC pla tform. 
Norchip conference in ASIC design, Proceedings of the 24th IEEE Linkoping, Sweden. 
CHOULIARAS. V. A., KOUTSOMYTI, K. , JACOBS, T ., PARR, S., MULVANEY, D . & 
THOMSON, R. (2006) SystemC-defined S1MD instruc tio ns for high pe rfo rmance SoC 
architectures. Electronics, Circuits and Systems. 13th IEEE !mem ational Conference on. Nice, 
France. 
CHOULfARAS, V. A., NVNEZ, J. L., KOUTSOMYTl. K., PARR, S . R., MULVANEY, D . J. & 
DATTA. S. (2004) Development of c ustom vector accele ra tor for hig h-perfo rmance speech 
coding. /££ Electronic Le11ers, 40, 1559 - 156 1. 
CHOULIARAS, V. A., NUNEZ, J. L., MULVANEY, D . 1., ROVATI, F. S. & ALFONSO, D. 
(2005) A multi-standard video accelerator based on a vec tor architecture. Consumer Electronics, 
IEEE Transactions on, 5 1, 160-167. 
CHOULIARAS, V. A., NUNEZ, J. L., ROVATI, F. S. & ALFONSO, D . (2005) A multi-standa rd 
video coding accelerator based o n a vector architecture. Consumer Electronics, 2005. ICCE. 2005 
Digest of Technical Papers. International Conference on. 
~B~ib~l~io~g~,n~e~l~n~·---------------------------------------------------------206 
CHOULIARAS. V. A .. NUNEZ-YANEZ, J . L. & AGHA. S. (2004) Silicon Implementatio n o f a 
Parametric Vector Datapath for Real-Time MPEG2 Encoding. lASTED Conference on Signal and 
Image Processing. 
CIIUNG , Y .. PARK, K. , HAHN, W., PARK, N. & PRASANNA, V. K. (2000) Performance of 
On-Chip Multiprocessors for Visio n Tasks. 15 !POPS 2000 Workshops on Parallel and 
Distrilmted Processing, Proceedings of the 
COTE, G., EROL. B., GALLANT, M. & KOSSENTlNI, F. (1998) H.263+: video coding at lo w 
bit rates. Circuits and Systems for Video Technology, IEEE Transactions on, 8 849-866. 
COTE, G. & WlNGER. L. (2002) Recent Advances in Video Compression Standards. lEE 
Ctmadian Review 2 L-24. 
CRAWFORD. J. ( 1990) The executio n pipeline o f the Intel i486 CPU. Compcon Spring '90. 
'Intellectual Leverage', Thirry-Fifth IEEE Computer Society lntemnrional Conference. 
CULLER, D. E .. SlNGH, J. P. & ANOOP GUPTA (1 998) Parallel Computer Architecture: A 
1/ardware/Sofiware Approach. 
EKMAN , M., DAHLGREN, F. & STENSTR6M, P. (2002) Evaluation o f S noop-Energy 
Reductio n Techniques for Chip-Multiprocessors. Workshop on Duplicating, Deconstructing, and 
Debunking WDDD-1, In Proceedings of 
ERCAL, F .. ALLEN, M. & FENG. H. (2000) A systo lic image difference algorithm for RLE-
compressed images. Parallel and Distributed Systems, IEEE Transc1ctions on, 11, 433-443. 
FAlRCHll..D , M. D. (2005) Color appearance models, Chichester, John Wiley. 
FARRENS, M. K. & PLESZKUN, A. R. ( 1991 ) Strategies for Achieving Improved Processor 
Throughput. Comp111er Arclritecture, 1991. The 18th Annual International Symposium on. 
FLYNN. M. J. ( 1972) Some Computer Organizatio ns and Their Effectiveness. Computers, IEEE 
Transactions on, C-2 1, 948-960. 
FREEK. C .. SOUSA, J. M. M .. HENTSCHEL. W. & MERZ KJRCH. W. ( 1999) On the accuracy 
o f a MJPEG-based digita l image compression PlY-system. Experiments in Fluids, 27, 3 10 -320. 
RJLLER, S. (1999) Motoro la ' s AltiYecTM Technology. Motorola Inc., AL T IVEC WP/D. 
~B~ib~l~io~g~~~ae~l~lv~--------------------------------------------------------207 
GHANBARJ , M. ( 1990) The cross-search algorithm for motion estimation Communications, lE££ 
Transactions 011, 38, 950 - 953 
GSCIIWfND. M. & MAURER, D. (1996) An exte ndible MlPS-1 processor kernel in YHDL for 
hardware/software eo-design. Europea11 Design Automation Co11jerence. Geneva 
HAMMOND. L.. HUBBERT. B. A .. SIU. M .. PRABHU. M. K., CHEN, M. & OLUKOLUN, K. 
(2000) The Stanford Hydra CMP . Micro, IEEE. 20, 7 1-84. 
HART. J. C. ( 1996) Fractal image compression and recurrem irerated function systems. Compmer 
Graphics a11d Applications, IEEE 16,25- 33 
1-IENNESSY. J . L. & PATIERSON. D. A. (2003) Computer architecture a qunmitative 
approach. 
1-IUFFMAN, D. A. ( 1952) A Method for the Construction of Minimum-Redundancy Codes. 
Proceedings of the IRE 40, 1098- 1101. 
JACOBS, T. R .. CHOULlARAS, V. A. & NUNEZ-YANEZ. J. L. (2005) A thread and data-
parallel MPEG-4 video encoder for a system-on-chip mulliprocessor. Application-Specific 
Systems, Architecture Processors, 2005. ASAP 2005. 16th IEEE lmemational Conference on 
Samos, Greece. 
JACQULN, A. E. (1993) Fractal image coding: a review. Proceedings of the IEEE. 8 1, 1451- 1~65 . 
JAJN. J . & JA!N, A. (1981) Displacement Mea. urement and lis Application in lmerframe Image 
Coding. Communicatiolls. IEEE Transactions on, 29 1799- 1808 
JI N-SOO. K., SOONHOI, H. & CHU Sl-UK, J. (l997) Evaluation of various node configurations 
for line-grain mullithreading on stock processors. H PC Asia '97. High Pe1jormance Compuring on 
the l11jormation Superhighway. 
KARCZEWICZ, M. & KURCEREN. R. (2003) The SP- and SI-frames design for H.264/AYC. 
Circuirs and Systems for Video Technology, IEEE Transactions on, 13, 637-644. 
KASSIM. A. A .. PINGKUN. Y .. WEI STONG, L. & SENGUPTA. K. (2005) Motion compensated 
lossy-to- lossless compression of 4-D medical images using integer wavelet transforms. 
lnformarion Technology in Biomedicine. IEEE Transactions on, 9, 132-138. 
~B~w~l~w~g~ra~e~'~"~' ---------------------------------------------------------208 
KATHAJL. V. , SCHLANSKER, M. S. & RAU . B. R. (2000) HPL-PD Architecture Specification: 
Version 1. 1. Hewlett Packard. 
KOUTSOMYTI, K., PARR, S. R. , CHOUUARAS, V. A. & NUNEZ, J. (2005) Applying Data-
Parallel and Scalar Optimizations for the efticicnt implementation of the G.729A and 0 .723. 1 
Speech Codi ng S tandards. Signal and Image Processing, Sevenrh lASTED lnrernarional 
Conference on (SIP 2005). Honolul u, Hawaii . 
LAMPERT. C .. MILITZER, M. , ROSS. P., GOMEZ, E. & CZYZ. R. (2006) XviD MPEG4 Core. 
www.xvid.org. 
LE GALL. D. ( 199 1) MPEG: A video compression standard for multimedia applications. 
Com1111111icmion of rhe ACM, 34,46-58. 
LEE, R. (2000) Subword permu tation instructions fo r two-dimens iona l mult imedia processing in 
MicroSIMD urchitectures. Applicarion-Speciftc Sysrems, Archirecw res, and Processors. 
LENOSKI. D .. LAUDON, J .. GHARACHORLOO. K .. GUPTA, A. & HENNESSY. J. ( 1990) The 
directory-based cache coherence protocol for the DASH mulliprocessor. Compurer Arcltirecrure, 
1990. Proceedings. 17rh Annual lmemari01WI Symposium on. 
LI. R.. ZENG, B. & LIOU, M . L. (1994) A new three-step search algorithm for block motion 
estimation. Circuits and Systems for Video Technology, IEEE Transactions on, 4, 438 - 442 
LU , G. & YEW, T. L. (1994) Image compression using partitioned iterated function systems. 
Image and Video Compression, Proceedingsfo SPI£. 
LU, N. ( t 997) Fracral imaging, San Diego ; London, Academic Press. 
LUTHER. A. C. ( 1997) Principles of digital audio and video, Norwood, Mass. ; London, Artech 
House. 
MADON. D., SANCHEZ. E. & MONNIER. S. ( 1999) A Study of a S imultaneous Mullithreaded 
Processor Implementation. Proceedings of the 5th lmemational Euro-Par Conference on Parallel 
Processing. 
MAHLKE, S. A., CHEN, W. Y., GYLLENHAAL, J . C .. HWU, W. W .. CHANG, P. P. & 
KlYOHARA. T. ( 1992) Compiler code transfo rmations for superscalar-bused high-performance 
syste ms. 
~B~ih~l~w~g~ra~e~l~"~' ---------------------------------------------------------209 
MARPE, D., SCHW ARZ, H. & WlEGAND. T . (2003) Context-based adaptive bi nary arithmetic 
coding in the I-!.264/ A VC video compression standard. Circuits and Systems for Video 
Technology, IEEE Transactions 0 11, 13, 620-636. 
MARTINEZ. J. F. & TORRELLAS, J. (2002) Speculati ve synchronization: applying thread-level 
speculation to explicitly parallel applications. lOth intemational conference on Architectural 
support for programming languages and operating systems. San Jose, Cat i fornia. 
MARTUCCI, S. A. (1 994) Symmetric convolution and the di crete sine and cosine transforms. 
Signal Processing. IEEE Transactions 011, 42, 1038- 105 1 
MCNAffiY. C. & BHAT IA, R. (2005) Montecito: a dual-core. dual-thread ltanium processor. 
Micro. IEEE. 25, 10-20 . 
MICHAEL, M. M. & NANDA, A. K. ( 1999) Design and performance of directory caches for 
scalable sharedmemory multiprocessors. High-Performance Compwer Architecture, 1999. 
Proceedings. Fifth lntemationol Symposium On. 
MORAD. T. Y .. WEISER. U. C., KOLODNYT. A., VALERO. M. & AYGUADE. E. (2006) 
Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors. 
Computer Architecture Leners, IEEE, 5 14- 17 . 
MOUDGILL. M., PINGALI. K. & VASSILI ADLS. S. ( 1993) Register renami ng and dynamic 
speculation: an alternative approach. 
MUKHERJEE, J., KUMAR, P. & GHOSH, S. K. (2000) A graph-theoretic approach for studying 
the convergence of fractal encoding algori thm. Image Processing, IEEE Transactions on, 9, 366-
377. 
NAKADA, A., SHIBATA, T., KONDA. M., MORIMOTO. T. & OHM1, T. ( 1999) A fu lly 
para llel vector-quantization processor for real-time motion-picture compression. Solid-Stare 
Circuits, IEEE Joumal of, 34, 822-830. 
NASRABADI. N. M. & KING, R. A. ( 1988) Image coding using vector quantization: a review. 
Commtmicotions, IEEE Transactions on, 36, 957-97 1. 
~B~ib~l~io~g~~ ~e~l~n~· --------------------------------------------------------~210 
NUNEZ, J. L. & CHOULIARAS. V. A. (2005) High-performance arithmetic coding VLS I macro 
for the H264 video compression tandard. Consumer Electronics, IEEE Transactions on. 51, 144-
15 1. 
NUNEZ-YANEZ, J. L., CHOULIARAS, V. A., ALFONSO, D. & F.S .ROVATI (2006) Hardware 
Assisted Rate Distortion Optimizatio n with Embedded CABAC Accelerator for the H.264 
Advanced Video Codec. Consumer Electronics, IEEE Transactions on 52, 590- 597. 
OEHRfNG. H .. S IGMUND, U. & UNGERER. T . ( 1999) Simultaneous Multithreading and 
Multimedia. Mufti-Threaded Execution. Architecture and Compilation (MTEAC 99), Workshop on 
OLUKOTUN, K., NAYFEH, B . A., HAMMOND. L., W[LSON, K. & CHANG, K. ( 1996) The 
Case for a Single-Chip Multiprocessor. Architectural support for programming languages and 
oper(lling systems. ASP LOS-VII: Proceedings of the sevemh international conference 011 
PK, A., H. K. & R. P. ( 1987) Jdentification of a subtype o f cone pho toreceptor, likely to be blue 
sensi ti ve. in the human ret ina. The Journal ofcomparcuil•e neurology, 225, 18- 34. 
PO. L.-M. & MA, W.-C. ( 1996) A novel four step search algorithm for fa t block matching. 
Circuits and Systems for Video Technology. IEEE Transactions 0 11. 6, 3 13-317. 
PURl, A .. HANG. 1-1.-M. & SCH[LUNG, D. L. ( 1987) An efficient blockmatching algorithm for 
mo tio n compensated coding. Acoustics. Speech, and Signal. IEEE lmernarional Conference on. 
RIC HARDSON. I. E. G. (2002) H.264 White Papers. Video & Image Compressio n Re o urccs and 
Research. 
RIOUL. 0 . & VETTERLI, M. ( 199 1) Wavelets and signal processing. Signal Processing 
Maga:ine, IEEE, 8, 14 - 38. 
ROSE. A .. SWAN, S .. PIERCE, J. & FERNANDEZ, J .-M. (2004) Transaction Level Modeling in 
SystemC. TLM Whitepaper. 
ROTA. J. (1998) Oivx ;) 3. 11 a. 
RUSS, J. C. (2002) The image processing handbook, Boca Raton. Fla.; London. Crc. 
RUSSELL, R. M. ( 1978) The CRA Y - I computer system. Communication of the A CM, 2 1, 63 - 72. 
~B~ib~l~io~g~'~~l~11~' v~--------------------------------------------------------211 
SALDANHA, C. & UPASTI. M. H. (200 1) Power Efficient Cache Coherence. Workshop 011 
Mem01y PeJfomwnce Issues,. 
SALOMON. D. (2000) Data compression : rhe complere reference. New York ; Londo n. Springer. 
SANKARALLNGAM, K., KECKLER, S. W., MARK. W . R. & BURGER. D. (2003) Universal 
mechnnisms for dnta-paral lel architec tures. 36Th A11111WI IEEEIACM lnTemaTional Symposium on 
MicroarchiTecwre. IEEE Computer Society. 
SANKARALINGAM, K.. NAGARAJAN, R., HATMING, L.. CHANGKY U, K., JAEHYUK. H., 
BURGER, D ., KECKLER, S. W . & MOORE, C. (2003) Exploiting ILP. TLP, and DLP with the 
polymorphous trips architecture. Micro, IEEE, 23, 46-5 1. 
SCHM!TZ. B. E. & STEVENSON. R. L. ( 1995) Enhancement of sub- ampled chrominance image 
data. 38rh MidwesT Symposi11111 on CircuiTs a11d SysTems. 
SEAL, D. (2000) ARM ArchirecT11re Reference Manual. 211d ediTion. Addison-Wesley Longman 
Publishing Co., [ne. 
SHAN, Z. & KAJ-KUANG. M. (2000) A new diamond search algorithm fo r fast block-matching 
motion estimatio n. lnwge Processing. IEEE TransacTions 011, 9, 287-290. 
SHEN. K .. ROWE. L. A. & DELP, E. J. (2005) A paralle l implementation of an MPEG I enco der: 
Faste r than rea l-time! Digiwl Video Compression: AlgoriThms cmd Technologies, Proceedings of 
SPIE Conference on. San Jose. 
SOLARI . S. J. (1997) Digiwl video and audio compressio11. New Yo rk, McGraw-Hill. 
SORIN, D. J., PLAKAL, M., CON DON, A. E ., HiLL, M. D .. MARTIN, M. M. K. & WOOD, D. 
A. (2002) Specifyi ng and verifying a broadcast and a multicast snooping cache coherence 
protocol. Parallel and DisTributed Systems, IEEE Transactions on, 13,556-578. 
SPRACKLEN, L. & ABRAHAM , S. G. (2005) Chip Multi threading: Opponuni ties and 
Challenges. High-Pe,formance Comp11Ter Arcltirec111re, Proceedings of rhe 1/Tir lnTemmional 
Symposium on. 
SRlN IVASAN. R. & RAO, K. ( 1985) Predict ive Coding Based on Efficient Motion Estimation. 
Communicarions. IEEE Transactions on, 33, 888 - 896 
~B~ib~t~io~'~~~~~v~--------------------------------------------------------212 
STOLLNITZ, E. J ., DEROSE. A. D. & SALESIN, D. H. ( 1995) Wavelets for computer graphics: 
a primer. Computer Graphics and Applications, IEEE, 15,76- 84. 
TUDOR, P. N. ( 1995) MPEG-2 video compression. lEE Electronics & Communication 
Engineering Joumal, 7, 257-264. 
TULLSE N, D. M ., EGGERS, S. J. & LEVY. H. M. (1995) Simultaneous multithreading: 
Maximizing on-chip parallelism. Compmer Architecture. 1995. Proceedings. 22nd Annual 
lntemationtlf Symposium on. 
UNGERER, T., ROBIC, B. & SILC, J. (2002) Multithreaded Processors. The Computer Joumal. 
45. 
V AN DER PAS. R. (2005) An Introductio n into OpenMP. First lmemational Workshop on 
OpenMP, IWOMP 2005. Eugcne, Oregon USA. 
WATKINSON. J. ( 1996) The Engineer's guide to compression. Peters field, Snell & Wilcox. 
WELCH. T. A. ( 1984) A Technique fo r High-Perfo rmance Data Compression. IEEE Compmer. 
17,8- 19 . 
WrEGAND, T .. SULLIV AN. G. J. , BJNTEGAARD. G. & LUTHRA, A. (2003) Overview o f the 
H.264/AVC video coding standard. Circuits and Systems for Video Technology, IEEE 
Transactions 0 11 . 13, 560-576. 
WOI-U..BE RG, B. & DE JAGER , G. (1999) A review o f the fractal image coding lite rature. Image 
Processing, IEEE Transactio/Is 0 11, 8, 17 16-1729. 
ZAHARIADIS. T. & KAU VAS, D. ( 1996) Fast algorithms for the estimatio n o r block motion 
vectors. Electronics, Circuits, and Systems ICECS '96., Third IEEE fnt em atiollnl Conference on 
ZIV, J. ( 1978) Coding theorems for individual sequences. lnformatioll The01y. IEEE Transactions 
on 24, 405-4 I 2. 
ZTV, J. & LEMPEL. A. ( 1977) A Universal Algorithm for Seq uential Data Compressio n. 
lllformatiOII Theory. IEEE TransactiO/IS 0 11. 23, 337 - 343 
ZIV, J. & LEMPEL, A. (1978) Co mpressio n o f individual sequences via variable-rate coding. 
Information Th eory, IEEE Transactions 011. 24, 530 - 536. 
~B~ib~il~o~g~,~~~e~'~'v~--------------------------------------------------------213 
ZUO-DJAN, C., RUEY -FENG. C. & WEN-HA, K. (1999) Adaptive predictive multiplicative 
autoregressive model for medical image compression. Medical lmnging, IEEE Transactions on, 
18, 18 1- 184. 
LIST OF PUBLICATIONS 
T .R. Jacobs. V.A. Chouliaras. DJ. Mulvaney, and J.L. Nunez. "The Application of 
Thread-Level Paralleism to the Architectural Complexity Reduction of an MPEG2 
Encoder" . lEE ACM Soc Design. Test and Technology Postgrad uate Seminar , lEE , 
Loughborough University, 15111 September 2004, 49-52, TSSN 0 86341 460 5 
T .R. Jacobs, V.A. Chouliaras. and D.J. Mul vaney, "In vestigation of Thread-Level 
Paral lelism in the Architectural Complex ity Reduction of MPEG2, XVfD Video 
Encoders" , Postgraduate Research conference in Electronics. Photonics, Communication 
and Networks and Computing Science, Lancaster University, March 2005. 140-141 . 
A.K. Kumaraswamy. V.A. Chouliaras. T.R. Jacobs, and J.L. Nunez-Yanez, "System-on-
Chip Design Framework (SDF) Unifyi ng Specification Capture and Design Modeling" , 
2005 Electronic Design Processes (EDP) Workshop. Monte rey CA USA, April2005 
V.A. Chouliaras, T.R. Jacobs, A.K. Kumaraswamy, and J.L. Nunez-Yanez. "Configurable 
Multiprocessors for High-Performance MPEG-4 Video Coding" , IEEE Computer Society 
Annual Symposium on VLSI : New Fronliers in VSLl Design (fSVLS I '05) , IEEE , 
Tampa, Florida, USA, May 2005, 274-275, ISBN 0 7695 2365 X 
T.R Jacobs, V.A. Chouliaras, and J.L. Nunez, "A Thread and Data-Paralle l MPEG-4 
Video Encoder fo r a System-On-Chip Multiprocessor" , IEEE Application Specific 
System. Architectures and Processors (ASA P 2005). IEEE , Samos, Greece. July 2005. 
23-25, ISBN 0 7695 2407 9 
T .R. Jacobs, V.A. Chouliaras, and DJ. Mul vaney. "Thread-Parallel MPEG-4 and H .264 
Coders for System-on-Chip Multi-Processor Architcctures" , IEEE 2006 International 
Conference on Consumer Electronics (ICCE 2006) . IEEE , Las Vegas USA, 11th 
January 2006, 9 1-92. ISBN 0 7803 9459 3/0 
2 14 
Lisl o[Publicmion 215 
T.R. Jacobs. V.A. Chouliaras. and D.J . Mul vaney. "Thread-paralle l MPEG-2. MPEG-4 
and H.264 video encoders for SoC multi-processor architectures," Consumer Electronics, 
ffiEE Transactions on , vol.52. no. l pp. 120- 126. Feb. 2006 
V.A. C houliaras, S. Agha, T .R. Jacobs, and V .M. Dwyer 'Quantifying the benefit o f 
thread and data pa ralle lism fo r fast moti on estimation 111 MPEG-2', lET E lectronics 
Letters. vol.42. issue 13 pp 747-748. 22 June 2006 
V.A. C houliara s, K . Koutsomyti , T .R. Jacobs. S. Parr. O.J. Mulvaney, and R. Thomson, 
"Syste mC-defined SIMD instructions fo r high performance SoC architectures." in 
Electro nics. Circuits and Systems, 13th ffiEE Inte rnational Conference on. Nice. France. 
2006. 
V.A. C houliaras, K . Koutsomyti , T .R. Jacobs. S. Parr. D.J . Mulvaney. and R. Thomson, 
"SystemC-de fined SIMD instructions for a CMP/SMT ASIC platform," in Norchip 
confere nce in ASlC design, Proceedings of the 24th ffiEE Linkoping. Swede n, 2006. pp. 
285-288. 
V.A. C houliaras. T.R. Jacobs. J.L. Nunez-Yanez, K. Manolopoulos. K. Nakos. and D. 
Re isis '1'hread-Paralle l Mpeg-2 and Mpeg-4 Encode rs fo r Shared-Memory Syste m-On-
Chip Multiprocessors," Computers and Applications. Internationa l Journal of. vol. 29, 
issue 4 , 2007 


