A configurable vector processor for accelerating speech coding algorithms by Konstantia Koutsomyti (7201031)
University Library 
•• Loughborough 
.., University 
AuthorlFiling Title ....... ~-{9..0:-:I~.9..~.fl.!.i .. .K. 
T Class Mark .................................................................... . 
Please note that fines are charged on ALL 
overdue items. 
I I 1111" 111111111 1111111111111 

A Configurable Vector Processor for Accelerating 
Speech Coding Algorithms 
By 
Konstantia Koutsomyti, MSc, BEng (Hons) 
A Doctoral Thesis submitted in partial fulfilment of the requirements for the 
award of Doctor of Philosophy of Loughborough University 
September 2007 
r" r, f.l D::..:~:hnr()Unh 
" " , .. ! 
Uni\'!~t·~~~y V Pilki:1gton Librr:ry 
Date 1/3/01 _ 
.-
Class I 
~~~ Ctto1 61 tr"7 '+V 
j 
j 
j 
j 
j 
j 
j 
j 
j 
j 
j 
j 
j 
I 
To my family 
ABSTRACT 
The growing demand for voice-over-packer (VoIP) services and multimedia-rich 
applications has made increasingly important the efficient, real-time implementation of 
low-bit rates speech coders on embedded VLSI platforms. Such speech coders are 
designed to substantially reduce the bandwidth requirements thus enabling dense multi-
channel gateways in small form factor. This however comes at a high computational cost 
which mandates the use of very high performance embedded processors. 
This thesis investigates the potential acceleration of two major ITU-T speech coding 
algorithms, namely G.729A and G.723.1, through their efficient implementation on a 
configurable extensible vector embedded CPU architecture. New scalar and vector ISAs 
were introduced which resulted in up to 80% reduction in the dynamic instruction count 
of both workloads. These instructions were subsequently encapsulated into a parametric, 
hybrid SISD (scalar processor)-SIMD (vector) processor. This work presents the research 
and implementation of the vector datapath of this vector coprocessor which is tightly-
coupled to a Sparc-V8 compliant CPU, the optimization and simulation methodologies 
employed and the use of Electronic System Level (ESL) techniques to rapidly design 
SIMD datapaths. 
1 
ACKNOWLEDGEMENTS 
I would like to thank my supervisors Dr. Sekhrujit Datta and especially Dr. Vassilios A. 
Chouliaras for the continuous guidance and support throughout all the stages of this work. 
Their advice has been invaluable. 
I would deeply like to thank a very special person in my life, Vasilis, for his continuous 
support and understanding through all these years. He drove me away from my little 
home in Rafina and gave me the greatest opportunity of all, to open my mind, to learn and 
believe in myself. Without him none of these would have happened. 
I would like to express my deep gratitude and love to my family for always standing by 
me and supporting me to follow my dreams and especially my beloved mother who 
taught me to always aim high. 
I acknowledge all my colleagues in Loughborough University and especially Tom Jacobs 
for their support and companionship throughout these years. All the wonderful people I 
met during my years in Loughborough and have become good friends have contributed 
even without knowing to this work by making these years special. 
Finally, I would like to express my gratitude to the EPSRC for providing me with 
financial support during the course of this thesis. 
11 
TABLE OF CONTENTS 
List of Figures .................................................................................................... IX 
List of Tables .................................................................................................. XIII 
List of Abbreviations ........................................................................................... xv 
Chapter 1 Introduction ............................................................................... 1 
1.1 Problem Fonnulation .................................................................................... 1 
1.2 VolP .............................................................................................................. 4 
1.2.1 Description of the VolP process ............................................................. 5 
1.2.2 VolP Applications .................................................................................. 8 
1.2.3 Current state of the art ......................................•...................•....•............ 8 
1.3 Programmable Architectures ......................................................................... 9 
1.3.1 General Purpose Processors ................................................................... 9 
1.3.2 DSP Processors .................................................................................... 10 
1.3.3 ASIC (Embedded) processors .............................................................. 10 
1.3.3.1 Configurable processors ................................................................ 10 
1.3.3.2 Reconfigurable Processors ............................................................ 11 
1.3.3.3 Fixed Processors ............................................................................ 12 
1.4 Hardwired Architectures ............................................................................. 12 
1.5 Research contribution and overview ........................................................... 13 
1.6 Thesis Outline ............................................................................................. 16 
1.7 References ................................................................................................... 18 
Chapter 2 Speech Coding Theory •....•.•...•••.••.•.......•...••.•..•...•.•.•..•••...•..•..• 22 
2.1 Introduction ................................................................................................. 22 
2.2 Speech Coding Objectives and Requirements ............................................ 22 
2.3 Speech production system ........................................................................... 24 
2.4 Coding strategies ......................................................................................... 26 
111 
Table of Contents iv 
2.4.1 Wavefonn Coders ................................................................................ 26 
2.4.2 Voice Coders 01 ocoders) ..................................................................... 27 
2.4.3 Hybrid Coders ...................................................................................... 28 
2.4.3.1 Analysis by Synthesis .................................................................... 29 
2.5 G.729A Speech Coding Standard ............................................................... 31 
2.6 G.723.1 Speech Coding Standard ............................................................... 33 
2.7 Summary ..................................................................................................... 36 
2.8 References ................................................................................................... 37 
Chapter 3 Software and Hardware Parallelism ..................................... 38 
3.1 Overview of Parallelism .............................................................................. 38 
3.2 Data Dependences ....................................................................................... 39 
3.2.1 Name Dependences .............................................................................. 40 
3.2.2 Control Dependences ........................................................................... 40 
3.3 Types of Parallelism .................................................................................... 41 
3.3.1 Instruction Level Parallelism ............................................................... 42 
3.3.1.1 Superscalar Processors .................................................................. 44 
3.3 .1.2 VLlW Processors .......................................................................... 46 
3.3.2 Data Level Parallelism ......................................................................... 48 
3.3.2.1 Advantages of vector architectures ............................................... 48 
3.3.2.2 Vector Processors .......................................................................... 50 
3.3.3 Thread Level Parallelism ..................................................................... 52 
3.3.3.1 Shared-Memory Architecture ........................................................ 53 
3.3.3.2 Distributed-Memory Architecture ................................................. 54 
3.3.3.3 Multithreading Architecture .......................................................... 55 
3.3.4 Hybrid Approaches and Research ........................................................ 56 
3.4 Summary ..................................................................................................... 57 
3.5 References ................................................................................................... 58 
Chapter 4 Methodology and Architectural Results ............................... 62 
4.1 Introduction ................................................................................................. 62 
Table of Contents v 
4.2 Simulation Infrastructure ............................................................................. 62 
4.2.1 SimpleScalar Toolset. ........................................................................... 66 
4.2.2 Customizing the SimpleScalar Toolset ................................................ 68 
4.3 Workload Optimization ............................................................................... 69 
4.3.1 Profiling ................................................................................................ 69 
4.3.2 Vector ISA Development and Experimentation Methodology ............ 74 
4.3.3 Identification of Data Parallel Loops ................................................... 78 
4.3.4 Implementation of vector loop using custom ISA ............................... 80 
4.3.5 Scalar Optimization .............................................................................. 83 
4.3.6 Validation Tests .................................................................................... 84 
4.3.7 The extended ISA (Scalar and Vector Extensions) .............................. 86 
4.3.8 Inline Assembly .................................................................................... 87 
4.4 Architectural Results ................................................................................... 88 
4.5 Summary ................................................................................................... 102 
4.6 References ................................................................................................. 104 
ChapterS Vector Processor Architecture ............................................. 107 
5.1 Vector Architectural State ......................................................................... 107 
5.2 Prograrmners Model .................................................................................. 109 
5.3 Vector Processor Instruction Set Architecture .......................................... 110 
5.3.1 Vector ISA .......................................................................................... 111 
5.3.1.1 Load/Store Instructions ...................................................... ......... 111 
5.3.1.2 Move Instructions ........................................................................ 112 
5.3.1.3 Arithmetic Instructions ................................................................ 113 
5.3.1.4 Shift Instructions ......................................................................... 116 
5.3.1.5 Miscellaneous Instructions .......................................................... 117 
5.3.2 ScalarISA .......................................................................................... 118 
5.3.2.1 Load/Store Instructions ............................................................... 118 
5.3.2.2 Move Instructions ........................................................................ 118 
5.3.2.3 Arithmetic Instructions ................................................................ 119 
5.3.2.4 Shift Instructions ......................................................................... 119 
Table ofeontents vi 
5.3.2.5 Miscellaneous Instructions .......................................................... 120 
5.4 Leon3 CPU ................................................................................................ 120 
5.5 Overall System Architecture ..................................................................... 124 
5.5.1 Processor-coprocessor progranunable unit ........................................ 125 
5.5.2 DMA taps ................................................. : ......................................... 125 
5.5.3 PCI IIF ................................................................................................ 125 
5.5.4 External Memory Controller .............................................................. 125 
5.5.5 APB Subsystem .................................................................................. 126 
5.6 Summary ................................................................................................... 126 
5.7 References ................................................................................................. 127 
Chapter 6 Vector Processor Implementation ....................................... 128 
6.1 Overview ................................................................................................... 128 
6.2 Vector Decode Stage (VDEC) .................................................................. 130 
6.3 Vector Registers Stage (VREG) ................................................................ 133 
6.3.1 Reverse Data Process ......................................................................... 134 
6.3.2 Splat Data Process .............................................................................. 135 
6.3.3 Masking Process ................................................................................. 135 
6.3.4 Bypass process ................................................................................... 137 
6.3.5 Operands Selection ............................................................................. 139 
6.3.6 Register enable ................................................................................... 139 
6.3.7 Vector Register File (gxx_vreg_fiIe) ................................................. 140 
6.3.7.1 Parameterisation .......................................................................... 140 
6.3.7.2 The vector register file implementation ...................................... 140 
6.3.8 Scalar Register File (gxx_sreg_fiIe) ................................................... 143 
6.3.8.1 Parameterisation .......................................................................... 143 
6.3.8.2 Scalar register file implementation ............................................. 143 
6.3.9 Vlen register ....................................................................................... 145 
6.3.10 Overflow and Pred Flags .................................................................. 146 
6.4 Vector Load/Store Unit (gxx_vlsu) ........................................................... 146 
6.5 Vector Datapath Stage (VDP) ................................................................... 149 
Table of Contents vii 
6.5.1 Vector Adder Unit (gxx_vadd_dp) .................................................... 152 
6.5.2 Vector Multiplier Unit (gxx_ vrnult_dp) ............................................. 153 
6.5.3 Vector Shifter Unit (gxx_vshift_dp) .................................................. 155 
6.5.4 Vector Miscellaneous Unit (gxx_vrnisc_dp) ...................................... 158 
6.5.5 Reverse Data Logic ............................................................................ 158 
6.5.6 Masking Process Logic ...................................................................... 159 
6.5.7 Bypassing network of the first VDP stage ......................................... 160 
6.5.8 Register Enable for the input VDP2 registers .................................... 160 
6.5.9 Second stage adder ............................................................................. 160 
6.5.10 Vector Accumulator File (gxx_ vaccs) ............................................. 161 
6.5.10.1 Parameterisation ........................................................................ 162 
6.5.10.2 The vector accumulator implementation ................................... 163 
6.5.11 Vector Adder Tree (gxx_adder_tree) ............................................... 164 
6.5.12 VLSU unit interface with VDP2 ...................................................... 165 
6.5.13 Overflow and Predicate Flags .......................................................... 165 
6.5.14 Bypassing network of the second stage ............................................ 166 
6.5.15 Write Back. ....................................................................................... 166 
6.6 Output Register Bunch .............................................................................. 166 
6.7 Leon3 ......................................................................................................... 166 
6.7.1 Decode Stage ...................................................................................... 167 
6.7.2 Register Access stage ......................................................................... 167 
6.7.3 Execute Stage ..................................................................................... 168 
6.7.4 Memory Stage .................................................................................... 168 
6.7.5 Exception Stage .................................................................................. 168 
6.8 Summary ................................................................................................... 170 
6.9 References ................................................................................................. 171 
Chapter 7 Vector Processor VLSI Implementation ............................. 172 
7.1 Design Verification ................................................................................... 172 
7.2 Synthesis and Place & Route Design Flow ............................................... 174 
7.2.1 Design Compiler Stage (Logical Synthesis) ...................................... 175 
Table of Contents viii 
7.2.2 SoC Encounter script Stage (Place and Route) .................................. 176 
7.2.3 Statistical Power Analysis Stage (Design Compiler) ......................... 176 
7.3 Implementation Campaign for Vector Datapath ....................................... 177 
7.4 Implementation Campaign for Vector Coprocessor .................................. 179 
7.5 VLSI Layout.. ............................................................................................ 182 
7.5.1 Vector Datapath Layout for VLMAX 16 ........................................... 182 
7.5.2 Vector Datapath Layout for VLMAX 32 ........................................... 184 
7.5.3 Vector Processor Layout for VLMAX 16 .......................................... 185 
7.6 ESL Implementation ................................................................................. 187 
7.6.1 SS_SPARC Platform .......................................................................... 187 
7.6.2 ESL Methodology .............................................................................. 191 
7.6.3 Micro-Architecture Results ................................................................ 191 
7.7 Summary ................................................................................................... 194 
7.8 References ................................................................................................. 195 
ChapterS Conclusions ............................................................................ 196 
8.1 Contribution of this thesis ......................................................................... 196 
8.2 Suggestions for future research ................................................................. 198 
8.3 References ................................................................................................. 200 
Appendix A Vector and Scalar ISA ................................................................. 201 
Appendix B Signal Description ........................................................................ 244 
Appendix C G.729A and G.723.1 Function Results ....................................... 246 
Author's Publications ....................................................................................... 260 
LIST OF FIGURES 
Figure I-I: Traditional voice and data networks Cal and VoIP network Cb)... ................................... 1 
Figure 1-2: The architectnre ofH.323 protocol stack ........................................................................ 2 
Figure 1-3: Simplified representation of possible JP telephony network connections ..................... .5 
Figure 1-4: V oIP signalling and transport flow between endpoints .................................................. 6 
Figure 1-5: Open Systems Interconnection COSIl and network protocols ......................................... 7 
Figure 2-1: Diagram oflhe human organs involved in speech production and the Spectral Range of 
Speech .................................................................................................................................... 24 
Figure 2-2: General speech production model ................................................................................ 25 
Figure 2-3: Analysis by Synthesis Code ......................................................................................... 30 
Figure 2-4: G.729A Encoder ........................................................................................................... 32 
Figure 2-5: G.729A Decoder .......................................................................................................... .33 
Figure 2-6: G.723.1 Encoder ........................................................................................................... 35 
Figure 2-7: G.723.1 Decoder. .......................................................................................................... 35 
Figure 3-1: Code snippet that shows the data dependences ............................................................. 39 
Fignre 3-2: Multiple-issuing of instructions in an ILP architectnre ............................................... .43 
Figure 3-3: Dvnamic Instruction Scheduling ................................................................................. .45 
Figure 3-4: Static Instruction Scheduling ....................................................................................... .46 
Figure 3-5: Basic Vector Processor Architectnre ............................................................................ 52 
Figure 3-6: The basic architectnre of a centralised shared-memory multiprocessor system ........... 53 
Figure 3-7: The basic architectnre of a distributed-memory multiprocessor system ....................... 54 
Figure 4-1: SimpleScalar Infrastructnre .......................................................................................... 67 
Figure 4-2: Machine instruction count for the BASOP.C functions ................................................ 71 
Figure 4-3: Experimentation Methodology ................................................................. uo ................. 75 
Figure 4-4: The extended processor state as defined in the configuration file vstate.h ................... 76 
Figure 4-5: Example of a C macro Instruction Definition ............................................................... 77 
Figure 4-6: Example of a non-vectorizable loop as the statement S5 depends on a previous result 
of the S5 execution. The same dependency appears to the statement S9 ............................... 79 
Figure 4-7: Example ofa vectorizable loop with statements S2 and S3 being independent from 
previous results of their execution ......................................................................................... 79 
Figure 4-8: Example ofloop with DLP within the original C code ................................................ 80 
Figure 4-9: Assign pointers and load the vlen rregister. ................................................................ 81 
Figure 4-10: Main vector loop ......................................................................................................... 81 
Figure 4-11: Strip mining loop ........................................................................................................ 82 
IX 
List o(Figures x 
Figure 4-12: Sc.l.r optimiz.tion ex.mple ....................................................................................... 84 
Figure 4-13: Instruction Definition in Vector.def... ......................................................................... 86 
Figure 4-14: Inline Assembly Instruction Definition ...................................................................... 87 
Figure 4-15: G. 729 A Encoder (Vector Only) Results ..................................................................... 89 
Figure 4-16: G.729A Decoder (Vector Only) Results ..................................................................... 90 
Figure 4-17: G.729A Encoder (Full Optimiz.tion) Results ............................................................ 90 
Figure 4-18: G.729A Decoder (Full Optimiz.tion) Results ............................................................ 91 
Figure 4-19: G.723.1 Encoder Vector Optimization Results ........................................................... 92 
Figure 4-20: G.723.1 Decoder Vector Optimization Results .......................................................... 92 
Figure 4-21: G. 723.1 Encoder Full Optimiz.tion Results ............................................................... 93 
Figure 4-22: G.723.1 Decoder Full Optimization Results ............................................................... 93 
Figure 4-23: Cor h x (Full Optimization) Results ......................................................................... 94 
Figure 4-24: Syn filt (Full Optimization) Results ........................................................................... 95 
Figure 4-25: Pitch 01 fast (Full Optimizatiou) Results .................................................................. 96 
Figure 4-26: Residu (Full Optimization) Results ............................................................................ 96 
Figure 4-27: Autocorr (Full Optimization) Results ......................................................................... 97 
Figure 4-28: Lsp pre select (Full Optimization) Results ............................................................... 98 
Figure 4-29: Age (Full Optimization) Results ................................................................................. 98 
Figure 4-30: Find Best (Full Optimization) Results ....................................................................... 99 
Figure 4-31: Estim Pitch (Full Optimization) Results .................................................................. 1 00 
Figure 4-32: Comp Lpc (Full Optimization) Results ................................................................... 101 
Figure 4-33: Decod Acbk (Full Optimization) Results ................................................................ 101 
Figure 4-34: Comp Pw (Full Optimiz.tion) Results ..................................................................... 102 
Figure 5-1: Example of an operation that is performed in two vector registers with vector length 
64-bits. Each functional unit is driven by the pair of the corresponding slices (vector 
elements) ofthe source vector registers. The produced results are stored back to the 
corresponding slices (vector elements) of the destination vector register ............................ 108 
Figure 5-2: Vector and Scal.r coprocessor programmer's model ................................................. 1 09 
Figure 5-3: Vector Short Addition ................................................................................................ 114 
Figure 5-4: Vector Short Multiplication for even/odd elements .................................................... 114 
Figure 5-5: Vector multiply-addlsub ............................................................................................. 115 
Figure 5-6: Instruction Formats ofLeon3 ..................................................................................... 122 
Figure 5-7: Unimplemented Instruction ........................................................................................ 122 
Figure 5-8: Overall system .rchitecture ........................................................................................ 124 
List of Figures xi 
Figure 6-1: The vector speech coprocessor microarchitecture with the four-stage pipeline: Vector 
Decode Stage (VDEC), Vector Register Access Stage (VREG) and two stages for the Vector 
Datapath Stage (VDP I and VDP2) """,,,,,,,,,,,,,,,,,,,,,,, ."" "."". "",,,.,,,,. ,,,,,,,,,,,,,, ",,,,, """"" 129 
Figure 6-2: The electrical interface of the VDEC Stage"""""""""""""""""""""""""""""""" 131 
Figure 6-3: The Unimplemented instruction format of the Sparc V8 architecture """",,,,,,,,,,,,,,,.131 
Figure 6-4: Different types of instruction formats of the vector processor ISA """"" ... "",,,,,,,,,,, 132 
Figure 6-5: Vector Register Access Stage (VREG) microarchitecture """"""""."""""""""."".134 
Figure 6-6: Reverse Data Process"".""" """ ""'''''''''''''''''' """.""""" ... """.""" """"""""""""" 134 
Figure 6-7: Splat Data Process """".""."".""""""""""""""""""" " ... """"""""""" ... "" ... """" 13 5 
Figure 6-8: Mask width function "". '''''''''''''''''''' """"""""""""""""" ......... """""""""""""" ... 136 
Figure 6-9: Mask extract function """"""""""" .. """ .. """"" .. """ .. """"""" .. "" .. """ .. """"""".136 
Figure 6-10: Vector bypass process for one ofthe vector source operands and the intermediate 
result of one of the two VDP stages""""""""""""""""""" .. """""" .. """ .. "" .. "" .. """"". 138 
Figure 6-11: Scalar bypass process for the selection of one of the scalar operands (first) .. """ .. " 138 
Figure 6-12: Electrical Interface of Vector Register File """"""""""""""" .. "" .. """""""""" .. " 141 
Figure 6-13: Detailed microarchitecture of the Vector Register File with RIW conflict avoidance 
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" .. ,142 
Figure 6-14: Electrical Interface of Scalar Register File """"""""""." .. """"""""""""".""""", 143 
Figure 6-15: Detailed microarchitecture of the Scalar Register File with RIW conflict avoidance 
.""""""""""" ... """ .. """""""""""""""""""""""""".""""".""""'"'''''''''''''''''''''''''''''''' 144 
Figure 6-16: VLSU Electrical Interface .. "."""""""" .... """."""""""".""" .. """""" .. "." .. ".""".147 
Figure 6-17: Parallel TAG/DATA configuration and Cascade TAGIDATA configuration caches 
"""."""""""""""""""" .. """"""""""""."""""""""." .. """"""""."""""".""""""""" .. " 148 
Figure 6-18: Microarchitecture ofVLSU in cascade TAGIDATA configuration""".""""""."", 149 
Figure 6-19: Microarchitecture of the VDP stage """""""".""."" .. """"""""." ... """".""""".".150 
Figure 6-20: Electrical interface of the vector adder unit .. " ... ""."" ....... ".""".""" ..... """ .. ""."".152 
Figure 6-21: Microarchitecture of a functional unit of the vector adder """ .. """"""""." .. """ .. ,, 153 
Figure 6-22: Electrical interface of the vector multiplier unit..".""""" .. " ..... " ..... """""""""" .. " 154 
Figure 6-23: Microarchitecture of a functional unit of the vector multiplier."".""""""""""""".155 
Figure 6-24: Electrical interface of the vector shiner uniL"""""" .. """""".""" ... ""."""" ... " ... " 156 
Figure 6-25: Two Barrel Shiners connected in series for short or long shift operations"""""."" 156 
Figure 6-26: Microarchitecture ofa functional unit of the vector shifter """"""""".""" .. """ ..... 157 
Figure 6-27: Electrical interface of the vector miscellaneous unit " ... " .. ""."""".""" ... "."""" .. ".158 
Figure 6-28: Masking process logic for low (Yrf opr2 r='I') or high Cvrf opr2 r='0') deposit for 
the even elements of the input vectors to the accumulator " ... """.""."""."" ..... "" ... """"" 159 
Figure 6-29: Electrical interface of the second VDP stage vector adder """."."""""""""""""".161 
List of Figures xii 
Figure 6-30: Electrical Interface of Vector Accumulator File ....................................................... 162 
Figure 6-31: Write data and write-enable selection logic for tbe vector accumulator file ............ 163 
Figure 6-32: Adder tree configuration for VLMAX 16 ................................................................. 164 
Figure 6-33: Leon3 integer unit and vector coprocessor datapath diagram ................................... 169 
Figure 6-34: Leon3 processor core block diagram ........................................................................ 170 
Figure 7-1: Example of recording the inputs and the outputs of the L mult operation C macro .. 172 
Figure 7-2: Test bench for the vector mult unit of the vector datapath ......................................... 173 
Figure 7-3: Vector coprocessor testbench configuration ............................................................... 174 
Figure 7-4: Script in a pseudocode for the design flow ofthe vector coprocessor. ....................... 175 
Figure 7-5: Statistical power results of vector datapath for different vector lengths ..................... 177 
Figure 7-6: Statistical area results of vector datapath for different vector lengths ........................ 178 
Figure 7-7: Frequency results of vector datapath for different vector lengths ............................... 179 
Figure 7-8: Statistical power results of vector coprocessor for different vector lengths ............... 180 
Figure 7-9: Statistical area results of vector coprocessor for different vector lengths ................... 181 
Figure 7-10: Frequency results of vector coprocessor for different vector lengths ....................... 182 
Figure 7-11: Vector Datapath macrocell for VLMAX 16 ............................................................. 183 
Figure 7-12: Vector Datapath macrocell for VLMAX 32 ............................................................. 185 
Figure 7-13: Layout for the whole vector processor (vector datapath and VLSU unitl ................ 186 
Figure 7-14: High level view of a 3-instance SS SP ARC kernel ................................................. 187 
Figure 7-15: Superscalar SMT pipeline organisation .................................................................... 188 
Figure 7-16: Scalar core (SCORE) pipeline organization ............................................................. 189 
Figure 7-17: Dual-pipeline vector unit organization ..................................................................... 190 
Figure 7-18: ITU VCore Power Results ........................................................................................ 192 
Figure 7-19: ITU VCore Area-Delay Results ................................................................................ 193 
Figure 7-20: Two-context. 256-bit 'TU vector engine .................................................................. 193 
LIST OF TABLES 
Table 1-1: !TU Standards for Voice Compression ............................................................................ 4 
Table 4-1: SimpleScalar baseline simulator models ........................................................................ 67 
Table 4-2: Relative amount of time spent outside the basic instructions ........................................ 70 
Table 4-3: Relative number of total instructions executed outside the DSP emulation instructions 
............................................................................................................................................... 70 
Table 4-4: G.723.1 Unmodified Workloads Instruction Count ....................................................... 72 
Table 4-5: G.729A Unmodified Workloads Instruction Count.. ..................................................... 72 
Table 4-6: Profiling the G.729A functions by using the speech workload ...................................... 73 
Table 4-7: Profiling the G.723.1 functions by using the 6.3kbits/s workload ................................. 74 
Table 4-8: G729 Encoder Test Vectors ........................................................................................... 84 
Table 4-9: G729 Decoder Test Vectors ........................................................................................... 85 
Table 4-10: G.723.1 Encoder and Decoder Test Vectors ................................................................ 85 
Table 5-1: Vector Load/Store Instructions .................................................................................... 112 
Table 5-2: Vector Move Instructions ............................................................................................ 113 
Table 5-3: Arithmetic Instructions ................................................................................................ 116 
Table 5-4: Vector Shift Instructions .............................................................................................. 117 
Table 5-5: Vector Miscellaneous Instructions ............................................................................... 117 
Table 5-6: Scalar Load/Store Instructions ..................................................................................... 118 
Table 5-7: Scalar Move Instructions ............................................................................................. 118 
Table 5-8: Scalar Arithmetic Instruction ....................................................................................... 119 
Table 5-9: Scalar Shift Instructions ............................................................................................... 120 
Table 5-10: Scalar miscellaneous instructions .............................................................................. 120 
Table 5-11: Enhanced op2 Encoding (Format 2) .......................................................................... 123 
Table 6-1: Compile-time vector processor parameters for its architectural and microarchitectural 
state that are contained in gxx config.vhd file .................................................................... 130 
Table 6-2: The allowed silicon technologies that are used for synthesis and place and route 
contained in gxx config.vhd file ......................................................................................... 130 
Table 6-3: Compile-time vector register file parameters for its architectural and microarchitectural 
state that are contained in gxx config.vhd file .................................................................... 140 
Table 6-4: Compile-time vector accumulator file parameters for its architectural and 
microarchitectural state that are contained in gxx config.vhd file ...................................... 163 
Table 7-1: VLSI Layout physical parameters forVDP with VLMAX 16 ................................... 183 
Table 7-2: VLSI Layout physical parameters forVDP with VLMAX 32 .................................... 184 
Xlii 
List of Tables xiv 
Table 7-3: VLSI Layout physical parameters for veop with VLMAX 16 ................................. 186 
LIST OF ABBREVIATIONS 
Abbreviation 
ABI 
AbS 
ADL 
ADM 
ADPCM 
AHB 
ALU 
AMBA 
APB 
ASIC 
ATC 
ATM 
BASOP 
CAS 
CATV 
CenT 
CELP 
CISC 
CLB 
CMP 
CNG 
CPI 
CPU 
CS-ACELP 
DLP 
DMA 
Expansion 
Application Binary Interface 
Analysis by Synthesis 
Architecture Description Language 
Adaptive Delta Modulation 
Adaptive Differential Pulse Code Modulation 
Advanced High-speed Bus 
Arithmetic Logic Unit 
Advanced Microprocessor Bus Architecture 
Advanced Peripheral Bus 
Application Specific Integrated Circuit 
Adaptive Transfonn Coding 
Asynchronous Transfer Mode 
Basic Operations 
Cycle Accurate Simulators 
Cable TV 
International Telephone and Telegraph Consultative 
Committee 
Code Excited Linear Prediction 
Complex Instruction Set Architecture 
Configurable Logic Block 
Chip Multi-Processing 
Comfort Generation Noise 
Cycles per Instruction 
Central Processing Unit 
Conjugate-Structure Algebraic Code Excited Linear 
Prediction 
Data Level Parallelism 
Direct Memory Access 
xv 
List of Abbreviations 
Abbreviation 
DSVD 
DSL 
DSM 
DSP 
EDA 
EPIC 
ESL 
FEC 
FLI 
FLOPS 
FPGA 
FITH 
ILP 
JP 
ISA 
ISDL 
ISPS 
ISS 
!TU 
LAN 
LISA 
LPC 
LSP 
MAC 
MBEN 
MELP 
MIMD 
MJPS 
MISD 
MOS 
MPEG 
MP-MLQ 
Expansion 
Digital Simultaneous Voice and Data 
Digital Subscriber Line 
Distributed Shared Memory 
Digital Signal Processing 
Electronic Design Automation 
Explicitly Parallel Instruction Computing 
Electronic System Level 
Forward Error Correction 
Foreign Language Interface 
FLoating point Operations Per Second 
Field Programmable Gate Array 
Fibre to the Home 
Instruction Level Parallelism 
Internet Protocol 
Instruction Set Architecture 
Instruction Set Description Language 
Instructions Set Processor Specification 
Instruction-accurate Simulator 
International Telecommunication Union 
Local Area Network 
Language for Instruction Set Architecture 
Linear Predictive Coding 
Line Spectral Pair 
Multiply and Accumulate 
Multi-Band Excited Vocoder 
Multi-pulse Excited Linear Prediction 
Multiple Instruction Multiple Data 
Million Instruction Per Second 
Multiple Instruction Single Data 
Mean Opinion Score 
Moving Picture Experts Group 
Multi-Pulse Maximum Likelihood Quantization 
xvi 
List o[Abbreviations xvii 
Abbreviation Expansion 
NUMA Non Uniform Memory Access 
OS Operating System 
OSI Open Systems Ioterconnection 
PCI Peripheral Component Ioterconnect 
PISA Portable Instruction Set Architecture 
PCM Pulse Code Modulation 
PSVQ Predictive Split Vector Quantizer 
QoS Quality of Service 
RAS Registration! Admission!Status channel 
RAM Random Access Memory 
RISC Reduced Iostruction Set Computer 
PCM Pulse Code Modulation 
psrn Public Switched Telephone Network 
RELP Residual Excited Linear Prediction 
RTL Register Transfer Level 
RTP Real Time Protocol 
RTCP RTP Control Protocol 
SBC Sub-Band Coding 
SDRAM Synchronous Dynamic RAM 
SISD Single Iostruction Single Data 
SIMD Single Iostruction Multiple Data 
SIP Session Ioitiation Protocol 
SMT Simultaneous Multi-Threading 
SMP Symmetric Multi-Processing 
SoC System on Chip 
SPARC Scalable Processor Architecture 
SRAM Static RAM 
SREGS Scalar Registers 
SRF Scalar Register File 
TSMC Taiwan Semiconductor 
TCP Transport Control Protocol 
TLP Thread Level Parallelism 
List of Abbreviations 
Abbreviation 
UART 
UDLII 
UDP 
ULIW 
UMA 
VACC 
VDEC 
VDP 
VHDL 
VLIW 
VLMAX 
VLSU 
VoIP 
VREG 
VREGS 
VRF 
WAN 
WiFi 
XST 
Expansion 
Universal Asynchronous Receiver Transmitter 
Unified Design Language for Integrated circuit 
User Datagram Protocol 
Ultralong Instruction Word 
Uniform Memory Access 
Vector Accumulator 
Vector Decode Stage 
Vector Datapath Stage 
Very high speed integrated circuit HDL 
Very Long Instruction Word 
Vector Length MAXimum 
Vector Load/Store Unit 
Voice over Packet Internet 
Vector Register access Stage 
Vector Registers 
Vector Register File 
Wide Area Network 
Wireless Fidelity 
Xilinx Synthesis Technology 
xviii 
CHAPTER 1 
INTRODUCTION 
1.1 Problem Formulation 
Ever-advancing technologies have enabled the worldwide convergence of voice and data 
communications in a single network infrastructure. This is the domain of packet-switched 
networks such as the Internet Protocol (1P) which lead to significant savings in cost and 
infrastructure deployment as well as to bandwidth efficiency [1]. Voice over Internet 
Protocol (V 01P) is such an example which uses IP to send digitised voice/data as a 
reliable alternative to traditional circuit-switched communication. In VolP, the voice 
network is integrated into the Local Area Network (LAN) and is connected to the 
traditional Public Switched Telephone Network (PSTN) through a gateway. The gateway 
is a special piece of equipment which handles the translation of signals from the PSTN 
into IP packets, required for the transmission across the Internet and vice versa [2]. Figure 
1-1 depicts the general model of traditional voice and data networks that are separated (a) 
and a VolP network that encompasses both in the same infrastructure. 
Laptop computer 
a: Traditional voice and b: VolP Network 
Figure 1-1: Traditional voice and data networks (a) and VoIP network (b) 
The transition from circuit-switched to packet-switched networks enables applications 
that go beyond simple voice transmission, embracing other forms of data and allowing 
them to all travel over the same infrastructure [2]. Packet-switched networks such as 
1 
J. JWro(/lIctiOIl 2 
Lnternet. Inl ranets, LA s and WANs encode the message and transmit it in the form of 
packets that are block of daw wi th added header and trailer information. Packet networks 
don' t need a ded icated l ink between transmiller and receiver hence there is lower cost per 
communication session as most illlerconnection charges are avo ided. Add itionall y, the 
required infrastruclllre is minimal because all the rea l-time app lications use the ex isting 
network. onsolidation of the different networks in one simplifies the equipment, 
prOlocols, software and hence enables beller service to be provided at low cost and with 
more efficient use of the resources. In the last few years, there has been a shift in large 
corporations migrating their communications into a single network in fra tmcture. The 
Japanese government decided in 2002 to establish an environment for the widespread use 
of IP telephony services_ This dec ision initiated the development of key tcchnologie for 
[P telephony [3]. BT began, since ovember 2006, to replace it ex isting telephone 
network wi th one based enlirely on the Internet Protoco l ( IP). When thi s is completed, the 
telephone system and the internet will share the very same network in frastructure [4]. 
Since the early days of VolP it became clear the need for the creation of a C01l11l10n 
protoco l stack in order to enable the development and spreading of the former. In 1996 
the H.323 [5] recommendation was issued by the Internat ional Telecom1l1unication Union 
(ITU) and revised in 1998 at which time the framework of an IP network was defined. 
H.323 was the basis for the first widely used YolP systems. Lt specifies a number of 
protocols for speech cod ing, call sewp, signalling, data transpon and other areas [6]. The 
archi tecture of the H.323 protocol stack is depicted in Figure 1-2. 
RTP/RTCP 
, 
TCP 
IP .,:. 
, , 
Data link Layer 
PhyaIcaI Layer 
figure 1-2 : The a rchitecture of H.323 protocol stack 
I. Introduction 3 
The H.323 standard incorporates the following ITV protocols: 
• Audio Codecs: G.7xx Series 
• Video Codecs: H.26x Series 
• RTP: Real Time Transport Protocol 
• RTCP: RTP Control Protocol 
• RAS (H.225): Registration/Admission/Status channel controlled by the 
H.225 gatekeeper protocol 
• H.245: Call (connection) Control, selects the compression algorithms, bit 
rate etc 
• Q.931 (H.255): Call signalling 
• UDP: User Datagram Protocol 
• TCP: Transport Control Protocol 
H.323 provides a complete protocol stack for real-time multimedia, conferencing (voice 
and video) and data transfer [2]. It played a key role in the widespread use of VolP 
services as H.323 gateways are the interface between the PSTN and packet-switched 
networks [7]. These gateways employ speech coding algorithms that encode the audio 
signal prior to transmission and decode it during reception. VolP specifies a significantly 
smaller voice bandwidth than a traditional PSTN that operates at a constant 64kbits/sec 
rate. Speech coding is the process of digitally encoding speech in order to reduce the bit 
rate of its representation during digital transmission, while maintaining an acceptable 
speech quality. Speech coding or compression algorithms provide good quality 
communication over packet based networks and reduce network bandwidth requirements. 
Hence efficient coding of the human speech is of paramount importance. The H.323 
multimedia standard supports a number of common ITV codecs such as G.711 [8], G.726 
[9], G.728 [10], G.729A [11] and G.723.1 [12] for interoperability reasons. These codecs 
have different bit rates, implementation complexity coding delay and voice quality. G.711 
is a compulsory recommendation that specifies a simple Nf.l-law codec that produces toll 
quality speech with low computation complexity, typically of 1MIPS, but requires up to 
64kbits/s bandwidth. G.729A and G.723.1 are the most popular for bandwidth limited 
transmission channels. G.729A was designed for simultaneous voice and data 
applications while G.723.l was indented for low-bit rate videophones [13]. These speech 
1. Introduction 4 
coding algorithms are very computationally intensive and consist of a number of sections 
of code executing in tight loops and processing arrays of data. More details about these 
codecs and speech coding theory are given in Chapter 2. Table 1-1 shows the 
characteristics of the aforementioned codecs that are widely employed in VoIP services. 
With the growing demand for VoIP services, it has become increasingly important to 
implement efficiently these algorithms. Codec optimization minimizes the processor 
loading and enables the system to support more voice channels per silicon area, while 
maintaining low power consumption [7][14). 
Table 1-1: ITU Standards for Voice Compression 
ITV Specification Transmission Rate Computation Mean 
(kbits/s) Complexity Opinion 
(MIPS) Score 
G.711 56/64 I 4.1 
G.723.1 5.3/6.3 16 3.65/3.9 
G.726 32 2 3.85 
G.728 16 30 3.61 
G.729/G.729A 8 20111 3.92/3.7 
This research presents the design and implementation of a high performance custom 
vector processor to accelerate these speech coding algorithms that are used typically for 
voice compression at the gateway of a VoIP network or for multimedia applications. 
More specifically, a controlling CPU (Leon3) and a closely-coupled, configurable, 
extensible vector coprocessor was researched and developed as SoC components [15). A 
vector processor was selected as it is generally accepted that for multimedia processing, 
SIMD execution units with wide datapaths are able to achieve significant speedups 
compared to existing scalar architectures without much of complexity cost [16). The 
vector coprocessor is a hybrid SISD (scalar processor)-SIMD (vector processor). The idea 
of vector coprocessors to be closely coupled to a superscalar CPU has been expressed in 
the late 80's [17). This combined scalar/vector architecture can lead to an order of 
magnitude improvement in workload performance and result in reduced area/power/cost 
per voice channel compared to the existing solutions. 
1.2 VoIP 
VoIP supports near-real-time, multidirectional voice exchanges by employing the Internet 
Protocol as transport technology. VoIP is an exciting technology that has changed the 
J. Jlll roc/IlClioll 5 
way that people communicate and its power and versat ilit y make it increasingly pervasive 
in embedded applications [ 18]. By merging the two traditional network infrastructures; 
Data (LAN) and voice (PSTN) the requi red equipment and ex perti se for the ir 
mai nte nance is simplified. Figure 1-3 shows possible IP te lephony network connections 
and components of a typical YolP syste m. 
T 
Analogy 
• Analogy 
Phone 
• IP Phone 
PC 
• Mobile 
Phone 
\ 
Gatewa~ 
• 
GateWa~ 
• 
."Oh 
Gateway 
• 
Analogy 
• Analog, 
• 
Gateway 
• 
Switch 
• IP Phone 
PC 
• 
Figure 1-3: Simplified representation of possible tP t.elephony network connections 
1.2.1 Description of the VoIP process 
T radit ional voice networks such as PSTN employ di gital switching technology to 
establish a dedicated link (circuit) between nodes and terminals for cO lllmunicati on [2]. 
Each such dedicated circuit cannot be used by other ca llers even if it is not acti ve until it 
is released and a new cal l is set up_ On the other hand, in packet switched networks, the 
digital information is encapsulated in packets th at are routed between nodes over data 
links shared with other packet rrafFic. In each network node, packets are queued or 
buffered resulting in variable delay whereas in c ircuit switching the re is constant delay 
and tran mission bit rate between the node . Packet switching is categorized into 
datagram (connection less) such as Ethernet and IF networks and virtual circuit switc hing 
(connection o rientated) such as Asynchronous Transfer M ode (ATM), X.25 etc [19]. 
I . IlItroduction 6 
The connecti on stages between twO endpoil1ls in a VolP system are illustrated in Figure 
1-4 . These stages incorporate the following functions: signalling proce s, 
encoding/decoding, the transport mechanism, and the switching gateway. In the 
beginning the signall ing process takes place and establi shes the commu nication between 
the handset and the phone network. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
Transpon L_ 
(e-o UOP) 
IP 
........ 
-
(8.0 RTP) 
Encoderl 
decoder 
(eg G .7231) 
----, 
____ ____ _________________ ____ ______ J 
- - - Signalling 
--DataFIow 
Figure 1-4: VolP signalling and transport n ow between endpoints 
The signalling process i responsible for maintaining and terminati ng the connection 
between the nodes and hence it is acti ve for the whole duration of the communicat ion. As 
Vol:!' transmission is packet-based the data/voice message that is sent during the 
communication is digiti zed and separated to frames whi h are encoded by the chosen 
speech co ler to reduce bandwidth requ irements. The resulting bi tstream is then 
packetized and i inserted into the IP network where it follows one or more transpoll 
protocols. Afterwards, it goes through a number of switches and eventually reaches the 
receiving gateway. The switching gateway ensures the packet set' interoperabi lity with a 
differel1l destination IJ>-ba ed system or a PSTN system. At the receiving end, the 
bitstream set is de-packetized, decoded and converted back to an audio signal, after go ing 
through the equi valent speech decoder [2] . 
The communicat ion protocol enab le interoperability of the system and are part of H.323 
or SJ.P protocol Slacks. SIJ> or Se sion Initiation Protocol is an alternati ve to the large, 
I . I lIIroduCfiol1 7 
complex and innexi ble H.323. h was developed spec ilicall y for 1P te lephony and other 
[nternet services but is si mpler than H.323 and can adapt more ea ily to future 
applications. There are several types of ignalling protocol runn ing concurrent ly at 
various levels. The various levels of protocol are categori sed according to their function 
in a standard ised seven layer model that is call ed Open Systems Int erco nnection (051) 
and is depicted in Figure 1-5 [20] . 
OSt Model 
Application layer 
Presentation 
layer 
Session layer 
Transport layer 
Netwrok layer 
Data link layer 
-
Physical layer 
TCPIIP Model 
AfI~I_n 
.~ 
~ 
'j;W-
,',' 
Networ1l: 
TCPIIP Protocols 
Ethemot 
IEEE 802.3 
TWIsted Optical 
pair fibre 
Ethemel 
IEEE 80211 
W·fi 
Figure 1-5: OpeD Systems Interconnection (OSO and network protocols 
The three upper layers (Appl ication, Presentation and Se s ion) upport users' appli cati ons 
which are moving through ne twork to be defined in an ab tract higher-level way in order 
to be exchanged between different users. The four lower layers (physica l, Data link, 
Network and Transport) are used for formatting, encoding and transmission of the data 
over the network . An LP network operates in the first three layers and the transpon layer 
passes the data from above to the network layer. The tran port layer iso lates the upper 
layers from changes of the hardware and controls the movement of the packets, performs 
error checki ng ele [6]. Voice and video for real-time co mmunications use UDP (U er 
Datagram Protocol) packet transport instead of TCP (Tran mission Control Protocol) as 
the shonest delivery time is more critical than packet loss. However, the media deli very 
using UDP is sensitive to packet delay and los hence QoS (Qua lity of Service) for 
muhimedi a communications is very imporw.nt [21] . More specificall y, the level of 
intrinsic QoS (latency, jitter, dropped packet rate) for the packet-switched services must 
be determined in order to assure the adequate percei ved QoS [18] . 
1. Introduction 8 
1.2.2 VoIP Applications 
Even though VoIP is a technology for transferring voice over IP packets it is not 
restricted only to that. Broadband IP networks using xDSL (Digital Subscriber Line), 
FTTH (Fibre-to-the-home), and CATV (Cable TV) lines have increased the available 
bandwidth and hence the voice quality in VoIP has improved while making additional 
concurrent visual communication possible [3]. The VoIP infrastructure facilitates an 
entirely new set of networked real-time applications, such as: videoconferencing, remote 
video surveillance, analog telephone adapters, Multicasting, illstant messaging, Gaming, 
Electronic whiteboards etc. Other features added from IP services are automatic rerouting 
of phone calls on the PS1N to a user's VoIP phone connected to a network node. ill this 
way a global-enabled cellphone network is enabled without roaming charges as the user's 
location is seen as just another network connection point. IEEE 802.11 enabled VoIP 
handsets to allow conversation in worldwide WiFi hotspots without compatibility issues 
[2]. 
1.2.3 Current state of the art 
VoIP implementation depends heavily on the evolution of hardware and software 
technology. Much effort has focused on developing techniques to meet the QoS 
requirements and ensuring the performance and reliability of PS1N networks at 
significantly lower cost. Many protocols and standards have emerged in the last year and 
made the VoIP feasible. At the same time, many factors need to be balanced to produce a 
cost effective product with toll quality voice [19]. 
ill a VoIP ASIC, processor selection is very important as this has a direct consequence on 
the allocation of time critical (speech coding, voice activity detection, echo cancellation) 
and non-critical (signalling protocols, operating system, user interface) tasks. In addition, 
it affects significantly the ASIC cost as the core CPU system is typically the most 
expensive piece of silicon IP. The processor is usually a stand-alone 32-bit RISC engine 
with a) custom instruction extensions b) large capacity DSP on board processing and c) a 
loosely-coupled external DSP/coprocessor. The custom instruction extensions or the 
DSP/coprocessor perform the voice processing operations while the RISC processor 
1. Introduction 9 
handles only the control functions enabling this way the main processor to support more 
than one channels. 
A popular architecture in VoIP gateways is the dual core processor organization (c) that 
integrates both RISC and DSP cores within a single package. The software development, 
debugging and the management of inter-processor communication for this solution is 
complicated and time consuming. Another popular solution is the RISC/DSP (b) dual 
execution units but a single instruction set architecture. In this way, there is no need for 
inter-processor communication and hence smaller overhead and better voice quality [19]. 
A targeted architecture therefore that can perform efficiently the mathematically intensive 
operations, has zero-overhead loops, barrel shifters, modulo addressing can improve the 
system performance dramatically. Dedicated on-chip DSP/coprocessor memories keep 
the algorithm coefficients and voice sample data on-board, maintaining the processing 
throughput. Additionally, an integrated solution simplifies the overall complexity and 
reduces time to market [22]. 
1.3 Programmable Architectures 
1.3.1 General Purpose Processors 
In the past, general-purpose processor design was driven mostly for non-real-time, stand-
alone applications which were largely nonnumeric with little inherent parallelism. The 
proliferation of multimedia-rich applications that involve significant real-time processing 
of continuous media data streams has forced profound changes in computer architecture. 
Since there are no limitations in the semiconductor technology, general-purpose 
processors can significantly accelerate media-intensive processing with relatively simple 
architectural support and the addition of instruction set extensions [16]. Over the last 
years the major vendors of general-purpose processors have announced the addition of 
instruction extensions in their ISAs to increase the performance of the multimedia 
applications. These instruction extensions are based on a subword execution model. This 
model uses the whole width of the processor datapath by processing smaller data types, 
typically found in signal processing (8- or 16-bits) in parallel by executing common 
multimedia operations [23]. Examples of general-purpose processors with added 
multimedia extensions are Intel's x86 with MMX [24] and SSE [25] extensions, Sun's 
1. Introduction 10 
UltraSparc enhanced with VIS [26], PowerPC with Altivec [27], Silicon Graphics' MIPS 
V with MDMX [28], Compaq's Alpha with MVI, and Hewlett-Packard's PARISC with 
MAX2 [29] extensions. 
1.3.2 DSP Processors 
The software implementation of speech codecs on DSP processors is a popular choice as 
these processors are more "tuned" to signal processing algorithms better than general-
purpose processors. This is due to the advances in DSP architecture that effectively 
execute the repetitive computations on data streams present in these algorithms through 
mUltiple functional units that operate in parallel and SIMD operations. These techniques 
are performed using mechanisms with lower complexity than general-purpose processors 
and speed up significantly the execution of these applications while keeping power 
consumption low [30]. Several projects [31] [32] [33] employ the Texas Instruments 
DSPs to implement G.723.1 in real time after applying some iterative refinement and 
optimization on the reference C code. Motorola implements the G.729A on the StarCore 
SCI40 [34] after optimizing the C reference code of the algorithm and Samsung with its 
SSPI820 DSP implements the G.723.1 [35]. Another optimized solution is integrating 
conventional general-purpose rusc processors and DSP cores with dedicated 
functionality into a single, unified architecture such as the Hitachi SHx-DSP [36] and the 
Infineon TriCore [37]. 
1.3.3 ASIC (Embedded) processors 
In general, application specific hardware design is the most popular candidate to meet 
cost, performance and power demands for VoIP applications. ASIC implementations can 
be divided in to the following three categories: 
1.3.3.1 Configurable processors 
In high-speed communication system design the simplest and most common architecture 
use embedded 32-bit processors such as (ARM, MIPS, PowerPC etc) or DSPs in either 
discrete or integrated form. Though this provides a lot of flexibility and general 
applicability, the processing of some software-based algorithms limits the system 
performance to a great extent. In addition, DSPs may not be as attractive in 
I. Introduction 1I 
computationally expensive operations such as error correction algorithms or filters where 
hardware implementations tend to be more efficient. On the other hand, ASICs achieve 
very high performance but require significant design cost and effort and offer no 
flexibility [38, 39]. An alternative architecture that promises high performance, 
extendibility, flexibility, code size and power dissipation reduction and also lower cost is 
the configurable processor. Configurable processors can be modified and their ISAs 
extended to target a specific application domain by changing the processor's feature set in 
order to accelerate the critical parts of the algorithm. A processor can be configured in 
three general ways: 
• By altering the processor's predefined architectural framework such as cache 
size, number of registers, multipliers or barrel shifters etc. 
• By adding custom, high-performance interfaces and streaming memories 
• By adding custom instruction extensions to optimally map to the target 
application 
Configurable processors are typically delivered as synthesizable RTL ready to be 
synthesized and integrated into an FPGA or SoC design. They usually come with vendor 
tools, EDA synthesis scripts and verification environments to verify the correct operation 
on a target system [40]. Examples of configurable processors employed for audio 
processing apart from the vector processor of the current work are the Tensilica'a Xtensa 
[41] 32-bit microprocessor that has the ability to run any C or c++ programs and add 
execution units for the implementation of the instruction extensions to speed the targei 
application. A pioneer in this field is ARC International with its ARC 700 family [42] 
architecture with 128-bit SIMD configuration. Other vendors include Silicon Hive [43] 
with UltraLong Instruction Word (ULIW) architecture and so on. 
1.3.3.2 Reconfigurable Processors 
Reconfigurable processors adapt dynamically their microarchitecture to address the 
application requirements. This type of processor utilizes microcode and custom 
configured hardware to improve performance. The microcode is utilized to perform both 
the reconfiguration process and the 'execution of the code and its frequently used parts are 
located permanently in a fixed part of on-chip storage [44]. In the past reconfigurable 
1. Introduction 12 
architectures referred exclusively to the gate level (fine-grain) with every computation 
being built up from the Boolean gates. An example of such a device which functions at 
this level is the FPGA device. An architecture can also be reconfigured on 
microarchitecture or architecture level. These levels of computational hierarchy are 
implemented by coarser basic computational units that are incorporated in FPGA devices. 
The FPGA can contain hard (e.g. multipliers) or soft (e.g. components of a standard 
library) macros to customize its functionality. The hard macro is a fixed ASIC core 
embedded into the fabric of the FPGA while the soft macro is a sequence of computations 
implemented as fixed entities on the FPGA fabric [45). Examples of reconfigurable 
architectures used for multimedia applications at the microarchitecture level are the 
PipeRench [46) and RaPiD [47) processors whereas examples at the architecture level are 
the RAW project [48) and Pleiades of Berkeley University [49). Reconfigurable 
architectures offer flexibility, functional efficiency of hardware and software 
programmability, logic capacity of programmable devices and advanced automated 
design techniques. 
1.3.3.3 Fixed Processors 
This category incorporates fixed architecture processors typically integrated in an ASIC 
infrastructure (buses, local memories, coprocessors). In order to achieve high 
performance modifications are usually performed on either the C code or the assembly of 
the application in order to take full advantage of the processor architecture. Examples of 
fixed ASIC processor that realise speech codecs are the ARM9 which implements the 
G.723.lNG.729AB codecs [50) or the G.729E codec [51) by using optimized ARM 
assembly code. Another example is a low power DSP core that implements G.723.1 
codec within the H.324 standard [52). 
1.4 Hardwired Architectures 
There are very few instances of research projects focused on the acceleration of the 
G.723.1 and G.729 standards using configurable, extensible, vector architectures as 
proposed in this work. A suggested architecture for the hardware implementation of parts 
of both codecs was proposed by Olausson and Liu [14). Their paper briefly discusses 
three hardware structures to accelerate conditional moves and branches before or after the 
1. Introduction /3 
calculation of the 32-bit absolute value (L_abs) of the 6.3kbitsls G.723.1. Another more 
focused approach was the hardware/software co-design of the G.723.l by Mishra et al 
[7]. In that work parts of the codec (pitch estimator, formant perceptual weighting filter 
and harmonic noise shaper) were implemented in hardware using a single MAC unit that 
operates in parallel to a DSP processor which executes the rest of the algorithm. 
Additionally the normalisation operation is implemented in hardware. 
The hardware implementation of the speech codecs is not a common practice as the C 
reference codes have to be ported to VHDL and this is a quite tedious and time 
consuming task. Another problem is that the arithmetic logic and especially the 
multipliers are very complex and their implementation in an FPGA will require many 
CLBs (160 CLBs on a XILINX Virtex FPGA per mUltiplier approximately). Since the 
codecs are typical DSP codes, their execution on DSP processors generally leads to much 
better performance. On the other hand, ASICs seem a better solution for multi-channel 
codec implementation but the integration of several DSP cores on an ASIC to offer 
multiple-channel capabilities is a more effective and appealing solution [53]. 
1.5 Research contribution and overview 
The main objective of this work was to research and develop a configurable, extensible 
vector embedded CPU architecture for accelerating speech coding algorithms employed 
in VoIP networks. This research was funded by the Engineering and Physical Sciences 
Research Council (EPSRC) under grant GRlS44976/01. The contributions of this project 
are outlined in this section. 
At the beginning of this research and in order to investigate the potential acceleration, 
both C reference codes were profiled to identify the computation workload distribution. 
This is described in section 4.3.1 of this thesis. The results showed that the most CPU-
intensive parts of the code were in the DSP emulation functions of the reference 
implementation. Further studying of the code revealed that a significant number of the 
basic operations appear in data-parallel loops. It was apparent that the creation of vector 
instructions that closely match these basic operations could lead to high performance. 
This is a major contribution of this work. 
1. Introduction 14 
The next task was to define the custom vector instructions and the data-level-parallel 
architecture of the vector coprocessor. Parallel exploitation is essential for the efficient 
execution of DSP codes. However, the reference implementations have to be fully 
vectorized in order to benefit from data-parallel processing which is the primary 
capability of the proposed vector architecture. The custom vector instructions were 
represented by C macros and were introduced into the C reference codes to implement the 
data-level-parallel inner loops. As speech coding algorithms consist of small loops or 
kernels that dominate overall processing time it was important to perform manual vector 
assembly coding and hand optimization of such tight loops [16]. In order to check the 
correct operation of the vectorized speech codecs after the vectorization of every loop the 
codes were verified against the ITU test vectors by comparing the output bitstream of the 
optimized code with the original one. The vectorization methodology is described in 
sections 4.3.2 -4.3.4 and the full vectorization of both the G.723.1 and G.729A speech 
coders and decoders is another major contribution. 
The remainder of the code that consists of the non-vectorizable loops and other parts of 
the code which contain basic DSP operations was optimized through the addition of 
custom scalar instructions. Again algorithmic equivalence between the optimized and the 
original (reference) code was established. The scalar optimization and the verification are 
presented in sections 4.3.5 and 4.3.6 respectively. In addition, both vector and scalar 
instructions are described in Chapter 5 and are listed in more detail in Appendix A. The 
joint scalar optimization and vectorization of the reference ITU-T codes is a third 
contribution of this work. 
The next step was to evaluate the performance of the vector architecture before it is 
implemented in hardware. For this purpose, the SimpleScalar toolset was used to evaluate 
the coprocessor architecture under study. The simulator was modified and extended to 
include the added state (coprocessor scalar and vector state) and the scalar and vector 
extensions. The extended instructions that were represented in C macros were replaced 
with inline assembly and executed on the simulator. The modifications of the 
SimpleScalar simulator are described in sections 4.3.7 and 4.3.8. Simulations were run for 
all ITU-T input vectors and for vector lengths of up to 128 16-bit elements. Results, in the 
form of relative dynamic instruction count, were taken for the vector only and for full 
optimization (scalar and vector) of both speech coding algorithms. These results show the 
1. Introduction 15 
perfonnance metric improvement which the instruction-accurate model of the vector 
coprocessor achieves. The results are presented on section 4.4 and Appendix C. 
Methodologies for the introduction of scalar and vector state and addition of instructions 
in the SimpleScalar infrastructure are another contribution of this work. 
Another task of this project was the modelling in SystemC of the vector instruction set 
extensions and its subsequent synthesis to Iow-level RTL in order to be introduced to the 
multi-parallel, configurable SS_SPARC processor. This work was undertaken to study 
faster routes to silicon of the SIMD extensions, compared to the established RTL flow 
and is presented in paper [54] and is discussed in section 7.8. The SystemC model is the 
behavioural description of the same vector instructions that were introduced in the speech 
codecs. The "packing" of the SIMD ISA was verified by using the !TU test vectors to 
validate their functionality. The obtained results from the statistical power analysis results 
for both the SystemC-accelerator and the RTL-accelerator synthesis are presented in 
section 7.8.3. This is a major contribution of this work as it compares the benefits of 
synthesizing a configurable, extensible SIMD datapath with that of a highly optimized 
RTL-based implementation. 
The main author's contribution to the research project was the full design and 
implementation of the proposed vector datapath of the vector processor. The vector 
datapath was verified by using an FLI-based testbench and this process is described in 
section 7. I. The vector processor was attached to the fifth stage (memory stage) of the 
main Leon3 scalar processor. Modifications were made to the pipeline of the scalar core 
and extra decode logic was added to accomodate the vector unit. The microarchitecture of 
the vector datapath and its interfacing to the Leon3 is explained in Chapter 6. Finally, 
statistical power analysis was perfonned for the vector datapath and the vector 
coprocessor as a whole for different configurations (VLMAX, frequency) in order to 
explore their effects on area/power/frequency results. These results along with the layouts 
ofthe vector datapath and vector processor are presented in sections 7.4 to 7.7. This is the 
final and major contribution of this work. 
I. Introduction 16 
1.6 Thesis Outline 
The remainder of this thesis is organized as follows. In Chapter 2 a background section in 
speech coding is given describing the general models of speech representation, coding 
schemes and types of speech coders that exist. In addition, the characteristics and 
principles of the two !TU standards that are used in this project namely, the G.729A and 
G.723.1 standards are presented. Chapter 3 gives an overview of parallelism including the 
limitations imposed from dependences and description of their types. Additionally the 
three different types of parallelism are introduced along with the appropriate processor 
architectures for their efficient exploitation. Emphasis is given to DLP which is the 
primary form of parallelism addressed in this project. This form of parallelism is most 
effectively exploited with vector architectures. Chapter 4 discusses the optimization 
methodology and the performance improvement achieved with the introduction of custom 
scalar and vector ISA extensions in both speech coding standards. Following that it 
presents the modifications made to the SimpleScalar instruction-set simulator to 
incorporate a large number of scalar and vector instruction extensions. Finally this 
chapter presents the performance benefits achieved via the introduction of the 
aforementioned instructions for different vector lengths and workloads. In Chapter 5 the 
vector coprocessor architectural state and programmer's model are presented followed by 
the introduction of the Leon processor and the overall system architecture. Chapter 6 
gives a detailed description of the pipeline organization and its constituent components. 
This is followed by a brief description of the VLSU which is part of another research 
work. The modifications to the Leon3 pipeline are then presented to enable the tight-
coupling of the vector coprocessor. Chapter 7 deals with the verification, synthesis and 
back-end flow of the vector datapath and vector processor as a whole. This is followed by 
the SystemC modelling and the parametric ESL implementation of the vector datapath. 
The latter was then inserted in the exposed vector engine of the SS _ SP ARC processor. 
The Chapter 7 also includes a detailed description of the SS _ SP ARC ASIC processor. 
Finally this chapter presents the statistical power analysis results for both the SystemC 
and RTL-designed vector datapaths. The Conclusions chapter discusses suggestions for 
further research, potential applications and additions to this work. Appendix A includes 
the details of the vector processor instruction set. Each instruction is presented 
individually with its format, a short description of the instruction's operation and a 
I. Introduction 17 
software example. Appendix B includes the internal control and data signals and their 
combinations as used in the vector pipeline. Finally, the performance improvement results 
at function level of both speech codecs obtained from the first year's work are presented 
in Appendix C. 
I. Introduction 
1.7 References 
[I] Todd Wynia, " Laying the foundation for VolP: A perspective on platfonns, 
protocols and technologies," in Embedded Computing Design, Spring 2001. 
[2] Jim Doherty and Neil Anderson, Internet Phone Services Simplified (VoIP): 
Cisco Press, 2006. 
18 
[3] M. Mineo, A. Niimura, H. Ooboshi, et a!., "IP Telephony Tenninal Solutions for 
Broadband Networks," HitachiReview, vo!. 51, June 2002. 
[4] Steven Cheny, "Nothing but Net," in IEEE Spectrum. vo!. 44, January 2007, pp. 
18-21. 
[5] ITIJ-T Recommendation H.323, "Packet-based Multimedia communication 
systems," 1998. 
[6] Andrew S. Tanenbaum, Computer Networks, 4th ed.: Pearson Education 
International, pp. 685-691, 2003. 
[7] S. M. Mishra and A. Balaram, "Efficient hardware-software co-design for the 
G.723.1 algorithmtargeted at VolP applications," in IEEE International 
Conference on Multimedia and Expo, 2000, pp. 1379-1382. 
[8] ITIJ-T Recommendation G.71I, "General Aspects of Digital Transmission 
Systems," 1989. 
[9] ITIJ-T Recommendation G.726, "40, 32, 24, 16 kbitls Adaptive Differential Pulse 
Code Modulation (ADPCM)." 
[10] ITIJ-T Recommendation G.728, "Coding of Speech at 16 kbitis using Low-Delay 
Code Excited Linear Prediction." 
[11] ITU-T Recommendation G.729A, "Coding of speech at 8 kbitls using conjugate-
structure algebraic-code-excited linear-prediction (CS-ACELP)," 3/96. 
[12] ITIJ-T Recommendation G.723.1, "Dual Rate Speech coder for multimedia 
communications transmitting at 5.3 and 6.3 kbitis," 3/96. 
[13] R. V. Cox and P. Kroon, "Low bit-rate speech coders for multimedia 
communication," in IEEE Communications Magazine. vo!. 34, December 1996, 
pp. 34-41. 
[14] M. Olausson and D. Liu, "Instruction and hardware accelerations in 
G.723.1(6.3/5.3) and G.729," in the 1st IEEE International Symposium on Signal 
Processing and Information Technology, 2001, pp. 34-39. 
[15] V. A. Chouliaras, "Vector Coprocessor for Speech Coding: Case of Support," 
Engineering and Physical Sciences Research Council (EPSRC) - GRlS44976/01, 
Loughborough University 2002. 
[16] K. Diefendorffand P. Dubey, "How Multimedia Workloads Will Change 
Processor Design," in IEEE Computer. vo!. 30, September 1997, pp. 43-45. 
[17] Francisca Quintana, Roger Espasa, and Mateo Valero, "A Case for Merging the 
ILP and DLP Paradigms," in 6th Euromicro Workshop on Parallel and 
Distributed Processing, Madrid, Spain, 1998, pp. 217-224. 
1. Introduction 
[18) William C. Hardy, VoIP Service Quality: Measuring and Evaluating Packet-
Switched Voice: McGraw-Hill Networking, 2003. 
[19) J. Dionne and B. Davis, "Embedded VoIP implementations using SIP," inEE 
Times Asia, 16 September 2004, 
www.eetasia.com/ART_8800346844_499491_TA-e55eI221.HTM. 
[20) Andy Bateman, Digital Communications: Design for the real world: Addison-
Wesley, 1999. 
[21) Heory Sinnreich and Alan B. Johston, Internet Communications Using SIP: 
Delivering VoIP and Multimedia Services with Session Initiation Protocol, 
Second ed.: WHey, 2006. 
[22) A. M. Kondoz, "Digital Speech: Coding for Low Bit Rate Communications 
Systems," John Wiley & sons, 1994, pp. 117-123. 
[23) T. M. Conte, P. K. Dubey, M. D. Jennings, et aI., "Challenges to Combining 
General-Purpose and Multimedia Processors," in IEEE Computer. vol. 30, 
December 1997, pp. 33-37. 
19 
[24) A. Peleg and U. Weiser, "MMX Technology Extension to the Intel Architecture," 
in IEEE Micro. vol. 16, August 1996, pp. 42-50. 
[25) K. Diefendorff, "Pentium HI = Pentium IT + SSE: Internet SSE Architecture 
Boosts Multimedia Performance," in Microprocessor Report. vol. 13, March 
1999. 
[26) Marc Tremblay, J. Michael O'Connor, Venkatesh Narayanan, et aI., "VIS Speeds 
New Media Processing," in IEEE Micro. vol. 16, August 1996, pp. 10-20. 
[27) K. Diefendorff, P. K. Dubey, R. Hochsprung, et aI., "AltiVec Extension to 
PowerPC Accelerates Media Processing," in IEEE Micro. vol. 20, March 2000, 
pp. 85-95. 
[28) "MIPS Digital Media Extension," Instruction Set Architecture Specification, 
http://www.mips/MDMXspec.pc, October 1997. 
[29) R. B. Lee, "Subword Parallelism with MAX-2," in IEEE Micro. vol. 16, August 
1996, pp. 51-59. 
[30) J. H. Moreno, V. Zyuban, U. Shvadron, et aI., "An innovative low-power high-
performance programmable signal processor for digital communications," IBM 
Journal of Research and Development, vol. 47, pp. 299-326, 2003. 
[31) AZ.R. Langi, "Rapid development ofa real-time speech coder on a 
TMS320C54x DSP," in Proceedings of the IEEE Canadian Conference on 
Electrical and Computer Engineering, 2002, pp. 1045-1048. 
[32) Y. Choi, C. Ahn, and T. Kang, "Implementation of a Multi-channel G.723.1 
Annex A using DSP," in International Conference on Consumer Electronics 
(ICCE), 2002, pp. 320-321. 
[33) Y. Huang, Y. Juan, S. Zhang, et aI., "Implementation ofITU-T G.723.1 Dual 
Rate Speech Codec based on TMS320C601 DSP," in the Proceedings of the 5th 
International Conference on Signal Processing (ICSP), Beijing, China, August 
2005. 
1. Introduction 
[34] R. Ungureanu, B. Costinescu, and C. nas, "ITU-T G.729A Implementation on 
StarCore SCI40," Application Note, Motorola 200l. 
[35] S. Lee, S. Park, and Y. Jang, "Cost-effective implementation ofITU-T G.723.1 
on a DSP chip," in Proceedings of J 997 IEEE International Symposium on 
Consumer Electronics, December 1997, pp. 31-34. 
20 
[36] M. Schlett, "The RlSC challenge in signal processing," in Proceedings of the 3d 
of the IEEE International Conference onElectronics, Circuits, and Systems, 
October 1996, pp. 550-553. 
[37] H. Shi, "RlSC+SlMD=DSP," in Proceedings of the IEEE International 
Conference on Acoustics, Speech and Signal Processing (ICASSP), June 2000, 
pp. 3211-3214. 
[38] A. Wang, E. KiIIian, D. Maydan, et aI., "Hardware/software instruction set 
configurability for system-on-chip processors," in Proceedings of the 38th IEEE 
conference on Design automation, Las Vegas, United States, 2001, pp. 184-188. 
[39] S. Leibson and J. Kim, "Configurable processors: a new era in chip design," in 
IEEE Computer. vol. 38, July 2005, pp. 51-59. 
[40] David Fritz, "Configurable Processors: Ready for Prime Time," in RTC, 
http://www.rtcmagazine.com/home/article.php?id=100066. January 2004. 
[41] R. E Gonzalez, "Xtensa: A configurable and extensible processor," in IEEE 
MIicro, MarchlApril2000, pp. 60-70. 
[42] "ARC Cores Ltd, www.arc.com/subsystems ... 
[43] Tom R. HalfhiIl, "Silicon Hive breaks out," in Microprocessor Report, December 
2003. 
[44] G. Kuzmanov, G. Gaydadjiev, and S. Vassiliadis, "The MOLEN processor 
prototype," in Proceedings of the 12th Annual IEEE Symposium on Field-
Programmable Custom Computing Machines, 2004, pp. 296-299. 
[45] R. Kastner, A. Kaplan, S. Ogrenci Memik, et aI., "Instruction Generation for 
Hybrid Reconfigurable Systems," ACM Transactions on Design Automation of 
Electronics Systems, vol. 7, pp. 605-627, October 2002. 
[46] Y. Chou, P. PiIlai, H. Schmit, et aI., "PipeRench Implementation of the 
Instruction Path Coprocessor," in Proceedings of the 33rd annual ACM/IEEE 
international symposium on Microarchitecture, Monterey, California, 2000, pp. 
147-158. 
[47] C. Ebeling, D. C. Cronquist, and P. FrankIin, "RaPiD-reconfigurable pipelined 
datapath," in Proceedings of the 6th International Workshop on Field-
Programmable Logic, Smart Applications, New Paradigms and Compilers, 1996, 
pp. 126-135. 
[48] M. B. Taylor, J. Kim, J. Miller, et aI., "The Raw Microprocessor: A 
Computational Fabric for Software Circuits and General-Purpose Programs," in 
IEEE Micro. vol. 22, March 2002, pp. 25-35. 
1. Introduction 
[49) M. Wan, H. Zhang, V. George, et aI., "Design Methodology of a Low-Energy 
Reconfigurable Single-Chip DSP System," Journal ofVLSI Signal Processing 
Systems, vol. 28, pp. 47-61, May 2001. 
[50) Y. Choi and G. Lee, "Real-time implementation ofG.723.1NG.729AB on a 
RISC processor for personal JP telephony devices," in Proceedings of the 9th 
International Symposium on Consumer Electronics(ISCE), South Korea, 2005, 
pp. 20-24. 
[51) A. Tripathi, S. Verma, and D. D. Gajski, "G.729E Algorithm Optimization for 
ARM926EJ-S Processor," University ofCaIifomia, lrvine 2003. 
[52) H. Okuhata, M. H. Miki, T. Onoye, et aI., "A Iow-power DSP core architecture 
for Iow bitrate speech codec," in Proceedings of the IEEE International 
Conference on Acoustics, Speech and Signal Processing, Seattle, USA, May 
1998, pp. 3121-3124. 
[53) C. Plessl and S. Maurer, "Hardware/Software Codesign in speech compression 
applications," in Institut for Technische Informatik und Kommunikationsnetze 
Zurich: Eidgenossische Tecbnische Hochschule, February 2000. 
21 
[54) V. A_ ChouIiaras, K. Koutsomyti, T. Jacobs, et aI., "SystemC-defined SIMD 
instructions for high perfonnance SoC architectures," in I 3th IEEE International 
Conference on Electronics, Circuits and Systems, Nice, France, December 2006, 
pp. 822-825. 
CHAPTER 2 
SPEECH CODING THEORY 
2.1 Introduction 
As already identified there is a major trend toward integrating voice-related applications 
in the context of multimedia applications such as VoIP networks, simultaneous voice and 
data (DSVD) applications, speech recognition, videoconferencing and so on [I]. This is 
consistent with the growing demand for wireless and satellite communications which 
require enhanced privacy and high bandwidth. To meet these needs the speech signal is 
transformed to digital format in order to be processed, stored and transmitted efficiently 
under software control. Digital speech exhibits flexibility, ability for 
encryptionldecryption and error correction, however requires high transmission 
bandwidth and storage capacity. To reduce these requirements, speech coding or speech 
compression has emerged on the research field concerned with efficient digital 
representations of voice signals for high-quality speech at low data rates [2]. Even though 
the sampling rate cannot be lower than twice the bandwidth of analog speech, the past 
decades several methods have been proposed to represent the sampled waveform with a 
minimum number of bits while preserving its perceptual quality. These methods have 
been adopted in a number of speech coders standards that are based on an optimum 
tradeoff between efficient low-bit transmission, perceptual quality for the available 
bandwidth and a combination of other objectives according to the requirements of every 
application [2] [I]. In the next sections, a brief description of the speech coding 
objectives and requirements will be given along with the speech production system. In 
addition, the main coding strategies will be introduced and the two ITV standards used in 
this research will be presented. 
2.2 Speech Coding Objectives and Requirements 
There are several objectives and requirements that a speech coder must meet for specific 
target applications. These requirements define the basic bitrate, speech perceptual quality, 
22 
2. Speech Coding Theory 23 
algorithmic complexity, cost and system delay of the selected speech codec. Therefore, 
these influencing factors require careful consideration in order to converge towards an 
optimum compromise between these often conflicting objectives. Speech quality and bit 
rate are two factors that directly conflict with each other. The lower the bit rate of the 
speech coder the higher the signal compression and the more the speech quality 
degradation. Public Switched Telephone Network (pSTN) and associated systems such as 
CCITT require high quality of encoding usually referred to as 'toll quality'. For private 
commercial networks and military systems, the quality factor may be reduced to lower 
processing and bandwidth requirements. Although absolute quality is often specified, 
sometimes it is compromised for a lower standard if other factors are allocated a higher 
overall rating. In general, in a mobile radio system it is the overall average quality that is 
the deciding factor and takes into account both good and bad transmission conditions. 
Other important factors for the choice of a speech coding algorithm is the coding delay, 
the immunity to error, the algorithm complexity and the implementation cost [3]. 
Coding delay includes algorithmic (the buffering of speech for analysis), computational 
(time taken to process the stored speech samples) and transmission contributions. Only 
the first two concern the speech coding subsystem though sometimes the transmission can 
be initiated before the algorithm has completed processing all the information in the 
analysis frame. In this case, the encoder starts transmission of the spectral parameters as 
soon as they become available. Low delay is essential if the major issue of echo is to be 
minimised. For mobile system applications and satellite communication systems echo 
cancellation is already included as substantial propagation delays exist. In PSTN, where 
the delay is very small, extra echo cancellers will be required if coders with long delays 
are introduced [3]. The other problem with the delay is the subjective annoyance factor. 
Therefore, all the standardised speech coders have specific requirements for the delay. As 
it is known [4], the speech coding bandwidth occupies only a small fraction of the total 
channel capacity, the rest is used for Forward Error Correction (FEe) and signalling. For 
mobile connections which suffer from both random and burst errors, a coding scheme's 
built-in tolerance to channel errors is essential for acceptable communication quality. By 
utilising built-in robustness, less FEe can be used resulting in higher source coding 
capacity. This trade-off between quality and robustness is a difficult task and it is 
considered from the beginning of the speech coding algorithm design. In order to achieve 
2. Speech Coding Theory 24 
a good average overall perfonnance more sophisticated algorithms are created with 
increased computational complexity. Therefore, the real-time implementation of such 
algorithms under the additional constraints of size and power consumption is a major 
issue and several techniques are employed to minimize multiple conflicting objectives 
[4]. Before we describe the basic model of a vocoder and the various methods that exist, 
we need to describe the principles ofthe speech production model. 
2.3 Speech production system 
The diagram of the main organs of the human anatomy involved in the speech production 
mechanism is shown in Figure 2-1. The compressed air forced from the lungs to the vocal 
apparatus pushes apart the vocal cords and creates an opening !mown as the glottis. When 
the air passes through the glottis the pressure decreases and the opening closes. The 
repetition of this process causes the vibration of the vocal cords and a high-energy quasi-
periodic speech wavefonn is produced and sent into the mouth and nose cavities. The 
excitation of the vocal cords is filtered through the vocal apparatus which operates like a 
spectral shaping filter with a transfer function that represents the spectral shaping action 
of the glottis, vocal tract (pharynx and mouth cavity), lip radiation characteristics and so 
forth [4] [5]. 
Hz 
Nasal 8192 ""~ 
4096 
2048 Typical 
T' """" 1024 range of """" 512 T,,,,,, 
"" 
""'~I 
256 vocal cord 
128 Contralto 
d~~Ofl 
"""" 64 
32 
Figure 2-1: Diagram of the human organs involved in speech production and the Spectral 
Range of Speech 
The excitation of the vocal apparatus with glottal vibrations generates voiced sounds and 
the vibration fundamental frequency is !mown as pitch frequency. The unvoiced sounds, 
such as whisper or aspirate, are lower-energy signals as the vocal cords do not participate 
2, Speech Coding Tll em'v 25 
and the excitation behaves like noise generator. These sounds are produced by the 
deli berate ly constricted air fl ow through rhe mouth . Constrict ions can be produced by the 
tongue, the position of the velum, the coupling of the voca l tract wit h the nasa l cavity, the 
teeth and the lips [4] [5]. 
Speech can be class ified as voiced (e .g. laI, IkI, etc). un voiced (e.g. /sh/, /hl etc) or mi xed. 
As mentioned above, voiced speech is quasi-peri odic in the time-domain while un voiced 
speech is random-like. The pitch pe riod that is identifi ed by the positions of the largest 
peaks of the quasi-peri odic segment s of the voiced signa ls, consists of approximate ly 80 
samp les [2]. Pitch frequency that is used alre rnati vely wit h the term pitch period, typica ll y 
ranges for male speaker between 40- 120Hz whereas for fema le speakers is much higher 
and ranges between 300-400Hz [3]. In the frequency-domain the voiced speech is 
harmonica ll y structured and its spectrum is characteri zed by it s fine and formam 
structure. The fine harmonic structure, a lso known as long-term correlation, is attributed 
to the vibrating vocal cords. The formant structure or spectral envelope or short-term 
correlation is attributed to the interaction of the excitation and the vocal tract and is 
characterized by a set of widened but distinctive spectral needles (peaks) that are multiple 
of the pitch period and are called fo rmants. Typica lly for an average vocal tract, three to 
fi ve spectral envelope peaks can be observed which appear usuall y around 500Hz, 
1500Hz and 2700Hz and represent the resonances of the vocal tract. The amp litudes and 
locations of the first three formants are vita l for the speech synthesis and perception [2]. 
In contrast, the un voiced speec h does not have a formant structu re and exhibit a more 
high-pass natu re wit h peak around 2500Hz. In addi tion the energy of un voiced speec h is 
generall y lower than that of voiced speec h [4]. 
Excitation model Spectral shaping Radiation model Spee 
Elz} filter Hlz) R(z) -
ch Slz) 
Figure 2-2: General speech production model 
The speech reproduction i based on the elttracti on of the key information of the speech 
s ignal [5]. The model of the peech production proce s i based on digi tal techniques and 
a simplified bl ock di agram is shown in Figure 2-2. rn thi s model, the input is the 
exc itation signal wh ich is generally approximated by an impulse sequence for voiced 
2. Speech Coding Theory 26 
speech or random noise for unvoiced speech. The excitation signal denoted by E(z} is 
filtered through a time-varying linear digital filter that represents the combined spectral 
contributions of the glottis, vocal tract and lip radiation characteristics. The filter has a 
transfer function H(z} that can be approximated by an all-pole model and whose 
coefficients directly depend on the time varying geometry of the vocal apparatus. This 
speech production model can produce high quality synthetic speech if the underlying 
model parameters, speech power spectral envelope and the excitation model are 
appropriately chosen (3) (5). Even though the process of speech production is known, the 
perception of the speech by the human auditory system remains a puzzle. It is stilI 
unexplained how the recognition between voiced and unvoiced sounds takes place, the 
ability to locate the position of a sound source (binaural hearing) or to separate a specific 
voice from a noisy background (cocktail party effect) (2). Hence there is ongoing' 
research in all these areas. 
2.4 Coding strategies 
Speech coding schemes can be broadly divided into three main categories: Waveform 
coders, Hybrid coders and Vocoders. The general operations that these coding schemes 
perform are to analyse the signal, remove the redundancy and efficiently code its non-
redundant parts in order to preserve its perceptual quality. These coding schemes are 
classified based on their encoding methodology and each has optimal operation within a 
certain bitrate region (3). 
2.4.1 Waveform Coders 
Waveform coders are signal independent as they don't exploit any specific properties of 
speech. They are designed to work with any input signal that is appropriately limited in 
amplitude and bandwidth. This has the advantage that waveform coders can also encode 
other types of information such as signaIling tones, voice-band data, or even music. By 
preserving this generality, their coding efficiency is quite modest and limited to rates 
above 16kbitls (4). However, they are stiIl popular due to their simplicity and ease of 
implementation. Waveform coders are further divided into time-domain and frequency-
domain. The most weIl known representative for the time-domain is the first speech 
encoding standard 64kbitls Pulse Code Modulation (PCM), the 32kbitls Adaptive 
2. Speech Coding Theory 27 
Differential PCM (ADPCM) that has been standardised by the nu Recommendation in 
G.72l and the Adaptive Delta Modulation (ADM) [3]. Time-domain coders utilise the 
redundancy in the speech waveform by exploiting the correlations between adjacent 
samples and encode only the difference between them. In addition, they use predictors at 
the receiver end to reduce the variance of the encoded signal and consequently the 
number of bits needed to represent it. Frequency-domain waveform coders exploit the 
redundancy of the signal in the transform domain. The signal is split into a number of 
sub-bands and each sub-band is encoded by using a different number of bits. The various 
methods differ in the way they represent the short-time power spectrum of speech and 
also in the perceptual properties of the human ear. The most well known frequency-
domain coders are the Sub-Band Coding (SBC) and the Adaptive Transform Coding 
(ATC) [4]. 
2.4.2 Voice Coders (Vocoders) 
Vocoders lie at the opposite end of waveform coders. They deal with speech-specific 
signals and in particular the physical principles behind speech and as such they do not 
attempt to reproduce the input waveform [2]. Hence, the performance of vocoders 
degrades significantly for nonspeech signals. The design and implementation of a 
vocoder is based on the speech production model that described in section 2.3. This 
model represents the human speech production mechanism and specifies the basic 
parameters needed to be extracted from the input speech signal in order to reproduce it as 
faithfully as possible [4]. Vocoders traditionally operate at rates below 4.8kbitsls which is 
their main advantage however the produced speech sounds often crude and synthetic. 
The preservation of the speech power spectral envelope and the preservation of the 
voicing information are the two factors that vocoder engineers use when designing speech 
codecs. These can then be used to re-synthesise speech sounds [3]. A vocoder consists of 
two parts: analysis and synthesis. The analysis takes place in the encoder where the 
parameters that describe the vocal excitation and the vocal transmission are extracted 
from the speech signal. At the decoder the received information is utilised to synthesize 
the signal that sounds like the original speech. The concepts that are associated with the 
vocoders were introduced as early as 1939. These concepts incorporate the two-state 
excitation (pulse/noise), voicing and pitch detection, and filter-bank representation. The 
2. Speech Coding Theory 28 
simple excitation model is related to very low bit-rates but at the same time it is 
responsible for the synthetic quality of speech that is one of the main disadvantages of 
vocoders. In addition, the estimation parameters that describe the spectral envelope need 
reliable envelope estimators. Estimators based on linear prediction and homomorphic 
signal processing were developed around 1960 and this challenging area provoked further 
research and spawned the development of several methods with improved quality and 
increased complexity. Channel vocoder is one of the first vocoding systems. It uses a 
bank of band-pass filters (typically 16 channels) to represent the speech spectrum. The 
two-stage excitation is utilised and if it is voiced the fine structure is represented using 
pitch-periodic pulse-like waves while if it is unvoiced it is reproduced using noise-like 
excitation. Even though the resulting speech is intelligible, the quality is quite synthetic. 
The Formant vocoders use a similar method to the channel vocoders but the 
representation of the spectrum needs only the frequencies and the spectral amplitudes of 
the formants. As a result, they achieve further band savings. Another category is the 
Homomorphic vocoders that are based on the idea that convolution of the vocal tract 
impulse response and the vocal excitation can represent the speech log-magnitude 
spectrum. The output speech has good quality and by applying predictive encoding the 
transmission rate can be reduced to 4kbitls. In general, frequency-domain vocoders are 
more robust to channel errors and background noise but with low, synthetic speech 
quality. Time-domain vocoders however such as the Linear-Predictive vocoders produce 
highly intelligible speech making them one of the most popular techniques for speech 
coding but they are very sensitive to channel errors and noise [2]. 
2.4.3 Hybrid Coders 
Hybrid coders fill the gap for coding rates between 4.8-16kbitls by incorporating the 
advantages of both vocoders and waveform coders in order to provide acceptable and 
natural speech at lower bit rates [4]. These codecs model the spectral properties of speech 
and exploit the perceptual properties of the ear for the minimal representation ofthe voice 
signal like the vocoders. Hybrid codecs produce more faithful waveform representation 
and as a result, more robust and better quality speech as the waveform coders [2]. 
Hybrid coders are broadly divided into two main categories: frequency domain and time 
domain. The frequency domain coders divide the speech spectrum into frequency bands 
2. Speech Coding Theory 29 
or components by using a filter bank or block transform respectively. These coders are 
based on the assumption that the signal is slowly time-varying. Hence the short-time 
segment of the input signal can be modelled with a short-time spectrum. The most 
commonly known coding schemes in this category are Sub-band Coding (SBC) and the 
Adaptive Transform Coding (ATC) that operate at bit rates between 9.6 to 16kbits/s. 
Another frequency-domain codec is the Multi-band Excited Vocoder (MBEV). This 
codec with effective pitch modelling can produce good quality speech for bit rates as low 
as 4.8kbitsls. A lower bit rate can be achieved by using a modified version of MBE V that 
represents the harmonic magnitudes by an LPC filter [3]. 
Time domain hybrids coder are very similar to the Linear-Predictive coders with a portion 
of the original signal to be transmitted instead of pitch and voicing information. They 
employ the speech source model described in section 2.3 in which the excitation is 
represented by a linear time-varying filter with a periodic pUlse-train for voiced speech or 
a random noise for unvoiced speech. Though there are several forms of time domain 
hybrid coders, the most successful and commonly used are time-domain Analysis-by-
Synthesis (AbS) codecs. Examples of AbS codecs are the Residual Excited Linear 
Prediction (RELP), the Code-Excited Linear Prediction (CELP), the Voice Excited Linear 
Prediction (VELP) and the Multipulse Excited Linear Prediction (MELP) coders [3]. 
2.4.3.1 Analysis by Synthesis 
Analysis-by-Synthesis speech coders have been widely adopted as they produce good 
quality speech while maintaining a low bit-rate (between 4.8-16kbitls) at the cost of high 
computational complexity [4]. In the AbS approach, the encoder (analysis) incorporates 
the decoder (synthesis) to determine the excitation signal and uses linear prediction 
techniques to calculate the coefficients of the speech synthesis filter. The basic structure 
of an AbS-LPC coding system is depicted in Figure 2-3. There are three main sub-blocks 
in the model that are used to obtain a good synthesised speech signal [3]. 
• Time-varying filter (synthesis filter) 
• Excitation generator 
• Perceptually based minimisation procedure 
2. Speech Coding Theory 30 
In the analysis procedure, the input speech is partitioned into blocks of samples (frames) 
whose length and update rate determines the bit rate of the coding scheme [4]. The 
decoded speech is produced by filtering the signal produced by the excitation generator 
through both a long-term (pitch synthesis) filter and a short-term (LPC synthesis) filter. 
The excitation signal is found by minimising the mean-squared error over a block of 
samples. The error signal is the difference between the original and decoded signals and it 
is perceptually weighted by a weighting filter. In the end, the quantized filter parameters 
and the vector quantized excitation are transmitted to the decoder. As shown in Figure 2-3 
the decoder uses an identical structure with the encoder, where the synthesized speech is 
generated by filtering the decoded excitation signal through the synthesis filter. The long-
term predictor filter models the long-term correlation (spectral fine structure) in the 
speech signal and its coefficients are adapted at rates varying from 100-200 times/so An 
alternative structure for the pitch filter is the adaptive codebook in which the filter is 
replaced by a codebook that contains the previous excitation at different delays. The 
resulting vectors are searched and the one that best matches is selected and scaled with an 
optimal scaling factor. The short-term synthesis filter models the short-term correlation 
(spectral envelope) in the speech signal. This is an all-pole filter with an order between 8 
and 16 and its coefficients are determined using linear prediction techniques for each 
frame. 
Excitation u(n) Synthesis ~(n) e(n) 
G <:nc:ration Filter -
Error eJn) 
Minimisation 
Encoder 
Excitation u(n) Syntbe<i< g(n) 
Generation Filter 
Decoder 
Figure 2-3: Analysis by Synthesis Code 
Error 
Weighting 
Reproduced 
Speech 
2. Speech Coding Theory 31 
The synthetic speech is generated in the encoder and decoder in order that both ends 
contain identical conditions in their filter memories. In this way, all the parts of the codec 
remain synchronised without the need for the memory parameters transmission. 
Preserving the identical conditions in both ends is one of the biggest challenges as this 
type of codec is very sensitive to channel errors [3]. Another important factor is the 
representation of the excitation signal of the time-varying filter. Three main excitation 
models for Analysis-by-Synthesis Linear Predictive Coding (AbS-LPC) are the multi-
pulse model, the regular pulse excitation model and the vector or code excitation model 
[2]. 
The International Telecommunication Union (ITV) has created a number of speech 
coding standards for different voices qualities and bandwidth requirements. All current 
low-rate speech coders are based on AbS-LPC coding. In the following sections the two 
!TU standards, G.729A and G.723.1, studied in this research will be presented. 
2.5 G.729A Speech Coding Standard 
The G.729A [6] speech coding standard is a reduced complexity version of Conjugate-
Structure Algebraic-Code-Excited Linear-Prediction (CS-ACELP) coder of the !TU 
G.729 recommendation [7]. It is designed for multimedia digital simultaneous voice and 
data (DSVD) applications though its use is not limited to these areas. G.729A grew from 
the need for low complexity (around 10 MIPS) speech codecs with speech quality 
equivalent to G.726 at 32kbits/s and operation bitrate of 11.4kbits/s and lower, in 1995. 
G.729A produces high quality speech (almost toll quality), in most conditions equivalent 
to G.726 at 32kbits/s, at a low bit rate of 8kbitls. The complexity of this algorithm is 
typically 11 MIPS that is 50% less complex than G.729 (22 MIPS) with a small 
degradation in performance in the case of three tandems and in the presence of 
background noise [4]. The G.729A has a 5ms look-ahead, 10 ms processing delay, 10 ms 
transmission delay and the overall one-way system delay is 35ms. The amount of RAM 
that required is 3000 words [8]. This coder belongs to the time-domain Analysis-by-
Synthesis class of speech coders. The encoder and the decoder dataflows of G.729A are 
depicted in Figure 2-4 and Figure 2-5 respectively [9]. 
2. Speech Cot/ing Tlleor\' 
t- - -
I... T,ansmill ed 
- - " BitSbeam 
InlHJI 
Speech 
Figure 2-4: G.729A Encoder 
32 
The excitation for the synthesis filter is obtained by combining the outputs of two 
codebooks based on the ana lysis-by-synthes is search procedu re. An adapti ve codebook is 
used to model the long-term periodicities which represent the pitch (line) structure of 
voiced speech and a fi xed code book that models the rando m noise-li ke un voiced sounds 
such as nasal or plosive utterances. The excitation signal is then applied to a tenth-order 
synthesis filter whose transfer fun ction models the human vocal tract. The residual error 
between the reconstructed speech produced by the synthesis filt er and the original input 
peech is processed by a perceptual weighting fi lter in order to produce the perceptuall y 
weighted error. The minimi zati on of this e rror determines the adapti ve codebook index 
and gain for the optimum excitation sequence. The closed-loop search of the fi xed 
codebook is implemented by using an algebraic codebook that simplifie the 
detenninati on of the codebook parameters and makes real-time operation possible. The 
index and gains fo r both codebooks are assembled together wi th the synthes is fi lter 
coefficients to form the bit tream transmitted to the decoder. This entire process i 
repeated for every IOms frame of the speech signal [7]. 
2. Sveech Codillg Tlteorv 33 
Go 5peac h OUIllIIl 
Gp 
Figure 2-5: G.729A Decoder 
Al the decoder the received bitstream is used lO eX lraCl and decode the encoder 
paramelers corresponding lO a 10 ms speech frame. These parameters give the synlhes is 
fill er coefficients and selecl the el1lries for the adaplive and fi xed codebooks lO represenl 
the exci!ation !O !his filter. The exc ilation is conslrucled by adding the adap!i ve and fixed-
code book vec!ors scaled by lheir respeclive gains. The excilation is fillered af!erwards by 
the synlhesis filter and the speech is reconstnlcled. Addilional posl-processi ng of lhe 
reconstTUC!ed speech signa l is performed [ 0 enhance its perceplUa l qualily (7) (10) . 
MOSl of the G.729A codec is iden!icallo G.729 wi!·h changes lO the following parts of the 
codec in order to reduce complex i!y: 
• The perceptual weighling filler uses a more tradilional error weighling tilter. 
• The open-loop search for the pitch delay uses for !he calculari on of !he 
autocorrelation funct ion only !he even samples of the weigh!eci input. 
• The closed-loop pitch search is achieved by maximizing a simpler 
(approximated) term lhan in G.729 !hat causes some degrada!ion as the chosen 
adaptive codebook delay differs by 113 from !he chosen in G.729. 
• The algebraic codebook search is simplified by searchi ng only 640 codebook 
entries per frame compared !o 2880 codebook entrie in G.729, using a dep!h-
firs! tree search method. 
• The decoder post-processing is s implified by using onl y inleger delays and rhus 
the complex ity is reduced!o I MIPS compared to 2.5 MIPS of G.729 [6). 
2.6 G.723.1 Speech Coding Standard 
ITU Recommendation G.723. 1 (11) was designed for low-bi! rale videophone, in!erne! 
phone and panicularly as pan of the H.324 mul!imedi a standard . The G.723.1 has two 
2. Speech Coding Theory 34 
transmitting bit rates at 5.3 and 6.3kbitls. The higher bit rate has greater quality while the 
lower bit rate gives good quality and offers more design flexibility. It is possible to switch 
between the two rates at any 30ms frame boundary. The G.723.l dual-rate codec was 
initially referred to as G.723. However, because under this name coexisted the older 
ADPCM-based G.723 standard, this scheme was renamed G.723.1 in order to avoid 
confusion. The G.723.1 is based on Linear Prediction Analysis-by-Synthesis coding 
carried out for 30 ms or 240-sample speech segments with a look-ahead of7.5ms, giving 
a total delay of 37.5 ms [4]. This codec employs Algebraic-Code-Excited Linear-
Prediction (ACELP) for its 5.3kbitls rate and it has algorithmic complexity of 14.6MIPS. 
For its 6.3kbits/s mode of operation uses Multi-Pulse Maximum Likelihood Quantization 
(MP-MLQ) excitation and it has complexity of 16 MIPS. Both modes of operation use 
2200 words of RAM [8]. Its dual-rate principle is very useful for intelligent multimode 
transceivers which are reconfigured at each speech frame boundary to provide more 
robust but lower speech quality or higher speech quality with less immunity to error. In 
addition, the G.723.l utilises voice-activity controlled transmission (higher rate for active 
speech and lower rate for background) and comfort noise generation (CNG) for passive 
speech intervals [4]. 
The G.723.l encoder operates on blocks of 30 ms [11]. Each block is fITst high-pass 
filtered to remove the DC components and then divided into 4 sub frames of 60 samples 
each. For every subframe, the coefficients of the lOth order Linear Prediction Coding 
(LPC) filter are determined. The LP coefficients of the last subframe are converted to 
Line Spectral Pair (LSP) and quantized using a Predictive Split Vector Quantizer 
(PSVQ). The other subframes are used to construct the short-term perceptual weighting 
filter in order to obtain the perceptually weighted speech signal that is used for the open 
loop pitch period computation. The estimated open loop pitch period is used to construct 
a harmonic noise shaping filter. Then the combination of the LPC synthesis filter, the 
formant perceptual weighting filter, and the harmonic shaping filter produces the impulse 
response. An initial pitch period estimation is derived from the formant-weighted speech 
signal in an open-loop search. The impulse response along with the pitch period 
estimation is used for a more accurate closed-loop search which takes place in the fifth-
order pitch predictor. Consecutively, the pitch period is calculated as a small differential 
value around the open loop pitch estimate and the effect of the refined pitch predictor is 
2. Speech Coding Them"\' 35 
removed from the speech signal. Depending on the operation mode, the resultant res idual 
signal is subjected to either MP-MLQ for 6.3kbits/s rate or ACELP for 5.3kbi ts/s . Fina ll y, 
the pitch period and the differentia l va lue along with the LPC coefficients are transmitted 
10 Ihe decoder [ 11 ] . The detai led bl k diagram of Ihe G.723. 1 encoder is depicled in 
Figure 2-6. 
B LSP Quafltlse 'EJ-{SPI'~_'~ 
~'I ~ AI'I 
IF~ Impulse r-SI:!.,. High PUI FWler A{l } R.tponse M.mory Upd. le /"- elnl """"''''' r-'~I B + --.{' '-, W(z] p(z) LPCAn./ytis I- ~ Zero Inpul 1-- Pllctl Decode EJ:eil.11on R.sponse De<Ode 
AI.I 
F~", :.0 '-:-\-~ 
'-0 p_m",' [1 H~.N." - r'I P"" P,.."'" r-'- ~. MP·MLOI 
W""... Oh_""" ~ ACELP 
~~'I ""---r~ J 
~~, I '\: ~, 
Piw:tl Esomator 
FIgure 2-6: C.723.1 Encoder 
At the decoder, the quantized LPC indices are decoded and used to construcl the LPC 
synthesis filter. The adapti ve codebook excitation and fi xed codebook exc itation are 
decoded for every subframe and feed the synthesis filter. 
----oj LSP 00<0d0I i---------i LSP Jr!'.poI.r 1-- AI» 
110 Syntnesll Fitter i-'rl't-+ F .... "" poslfil1et ----oj PIid'! Decoder t4m] n} Pilth postNtet wl'[n 
P\ nl 
'--< Gm Kale .nt ~, 
Figure 2-7: C.723 . .l Decoder 
2. Speech Coding Theory 36 
The excitation signal input the pitch postfilter in order to improve the quality of the 
synthesized signal and the output of the postfilter feeds the synthesis filter consequently. 
The output of the synthesis filter feed the formant postfilter whose energy level is 
maintained by the gain scaling unit. The block diagram of the G.723.1 decoder is shown 
in Figure 2-7. 
2.7 Summary 
In this chapter a brief introduction of speech coding was given by discussing the coding 
objectives and requirements and presenting the basic speech source models. 
Consequently, the basic principles of the main coding techniques were introduced with 
more emphasis placed on the analysis-by-synthesis hybrid codecs as this type is 
employed in low bit-rate speech coders for multimedia applications. Finally, the 
characteristics and the basic operation of both nu standards, G.729A and G.723.1, 
employed in this research were discussed. 
2. Speech Coding Theory 
2.8 References 
[1) K. Diefendorff and P. Dubey, "How Multimedia Workloads Will Change 
Processor Design," in IEEE Computer. vol. 30, September 1997, pp. 43-45. 
37 
[2) A. S. Spanias, "Speech Coding: A tutorial review," Proceedings of the IEEE, vol. 
82, pp. 1541-1582, October 1994. 
[3) A. M. Kondoz, "Digital Speech: Coding for Low Bit Rate Communications 
Systems," John Wiley & sons, 1994, pp. 117-123. 
[4) L. Hanzo, C. Somerville, and J. Woodard, "Voice Compression and 
Communications: Principles and Applications for Fixed and Wireless Channels," 
Wiley-Interscience, 2001, pp. 3-10, 65-67, 269-274. 
[5) R. M. Nickel, "Automatic speech character identification," in IEEE Circuits and 
Systems. vol. 4, Fourth Quarter 2006, pp. 10-31. 
[6) ITU-T Recommendation G.729A, "Annex A: Reduced complexity 8 kbits/s CS-
ACELP speech codec," 11196. 
[7) ITU-T Recommendation G.729, "Coding of speech at 8 kbitls using conjugate-
structure algebraic-code-excited linear-prediction (CS-ACELP)," 3/96. 
[8) R. V. Cox and P. Kroon, "Low bit-rate speech coders for multimedia 
communication," in IEEE Communications Magazine. vol. 34, December 1996, 
pp. 34-41. 
[9) ITU-T Recommendation G.729A, "Coding of speech at 8 kbitls using conjugate-
structure algebraic-code-excited linear-prediction (CS-ACELP)," 3/96. 
[10) K. Koutsomyti, S. R. Parr, V. A. Chouliaras, et aI., "Scalar and parametric vector 
accelerators for the G.729A speech coding standard," in Proceedings of 
IEEIACM SoC Design, Test and Technology Postgraduate Seminar, 
Loughborough University, September 2004, pp. 53-57. 
[11) ITU-T Recommendation G.723.1, "Dual Rate Speech coder for multimedia 
communications transmitting at 5.3 and 6.3 kbitls," 3/96. 
CHAPTER 3 
SOFTWARE AND HARDWARE PARALLELISM 
3.1 Overview of Parallelism 
The proliferation of dynamic multimedia applications such' as videoconferencing, 
image/speech processing and compression, 3D graphics, animation, Virtual Reality 
Modelling Language, encryption etc has changed the processing workloads of embedded 
processors significantly [1]. In order to run these multimedia codes efficiently and in real 
time there is need for high-performance application-specific processors. One approach to 
improve processor performance is to increase the clock speed. Though this may seem 
easy at first, the increase of a circuit's clock speed is a direct function of the chosen 
implementation technology. More importantly, this causes a high increase in the dynamic 
(switching) power dissipation rendering high-frequency designs unusable for power-
constrained consumer applications. An altemative approach to improving processor 
performance is to increase the number of operations executed per clock cycle [2]. This 
approach yields very high performance and it is independent of the underlying circuit 
technology. In order to achieve this, multiple operations must be scheduled to execute in 
parallel in the extra functional units or processors. To make this, various techniques have 
been employed to exploit the inherit parallelism in modern applications and speed up 
their execution. The key to achieving high performance in current and emerging 
workloads is parallelism. The performance limit is set by the available parallelism in the 
application and the amount of the adaptation needed on the source code in order to allow 
the processor to exploit it [3]. 
The idea of parallelism to increase processor performance has been introduced as early 
as 1961 with the pipelining technique introduced by Stretch, the mM 7030 processor [4]. 
Pipelining is a micro-architectural technique to exploit the parallelism that exists among 
the actions (steps) needed to complete the execution of an instruction. In this way, 
different parts of multiple instructions in a sequential instruction stream are overlapped in 
execution and thus, their completion time decreases [5]. Pipelining is the first form of the 
Instruction Level Parallelism (lLP) even though it is considered nowadays a low-level 
38 
3. Software and Hardware Parallelism 39 
parallelism mechanism. Another form of parallelism (DLP, TLP) was exploited in 1964 
with the Control Data Corporation (CDq 6600 CPU [6]. This processor used ten 
functional units that could operate in parallel and could perform ten unrelated operations 
per cycle introducing with this way the concept of Data Level Parallelism (DLP) and !LP. 
In addition, it had ten identical peripheral processors that operated independently and 
simultaneously and could execute up to ten programs at the time introducing the idea of 
Thread Level Parallelism (TLP). Conclusively addressing all forms of parallelism 
however is a recent achievement enabled by advances in silicon technology and 
EDNtools/compilers. In the following sections an overview of the three main techniques 
of parallelism: !LP, DLP and TLP will be given with more emphasis in DLP as it forms 
the basic target of this research. Limitations in parallelism exploitation are imposed from 
dependences that are found in every code sequence. These dependences can cause 
structural stalls, data hazard stalls or control stalls thus reducing the performance [2]. 
There are three different types of dependences: data dependences, name dependences, and 
control dependences [2]. 
3.2 Data Dependences 
An instruction is data dependent on another instruction if its execution uses as input a 
value created by a previous execution of the latter [7]. The data dependence implies the 
two instructions cannot execute in parallel or be overlapped as it will affect the 
correctness of the program [5]. An example of this type of dependence, also known as 
true data dependence, is illustrated in Figure 3-1. 
S1 Loop: Id r1~ [aO] ; load array element a 
S2 add r4, r1, r2; add array element to r2 
t 
S3 st r4, [e, rO] ; store result to array b 
S4 add rO, rO, #1; increment counter 
t 
SS bnez rO, Loop; branch to loop if rO!=O 
Figure 3-1: Code snippet that shows the data dependences 
The data hazards caused from data dependence are known as RAW (Read after Write), 
referring to the order in which instructions are presented in the pipeline. This type of data 
3. Software and Hardware Parallelism 40 
hazard occurs when one instruction reads one register operand before that operand is 
produced from an earlier instruction, resulting in the use of the wrong register operand. 
Dependences are detected and data hazards are avoided within a processor with pipeline 
interlocks that force execution stall. In VLIW architectures a compiler performs the 
instruction scheduling and hides such data dependences rendering the use of interlock 
logic unnecessary [5]. 
3.2.1 Name Dependences 
There are two types of name dependence: antidependence and the output dependence. An 
antidependence occurs when an instruction reads from the same register or memory 
location that another instruction writes. This gives rise to WAR (Write after Read) data 
hazards as an instruction writes in a destination before this is read from another 
instruction resulting in the latter reading the new (incorrect) value. This violates program 
semantics [5]. 
An output dependence occurs when two instructions write to the same register or memory 
location. This type of dependence causes a W A W (Write after Write) data hazard which 
occurs when the value written in the destination was written from the wrong instruction. 
Again the order is important as the final value must be from the first (in chronological 
order) instruction. Both types of dependences are not true data dependences and the 
involved instructions can execute simultaneously or even be reordered as long as the 
common register name or the memory location are renamed statically by a compiler or 
dynamically by the hardware [5]. 
3.2.2 Control Dependences 
Control dependence determines the ordering of an instruction with respect to a branch in 
order for the instruction to execute the correct program order [5]. Hence an instruction 
that is control dependent on a branch (e.g. in the THEN statement of an IF conditional) 
cannot be moved before the branch so its execution is no longer controlled by this. In 
addition, an instruction that is not control-dependent on a branch (e.g. before an IF 
conditional) cannot be moved after the branch and its execution become controlled by the 
branch. Therefore, branches limit the ways that code can be re-arranged for optimum 
3. Software and Hardware Parallelism 41 
execution performance. According to Intel, 20-30% of the processor performance is left 
un-tapped due to branch mispredictions [8]. Branch prediction and predication are some 
of the methods to increase the parallelism without causing any exceptions or changing the 
data flow [5]. 
3.3 Types of Parallelism 
As mentioned above the objective behind exploiting parallelism at multiple levels is to 
maximise the execution performance of an application. Several architectural techniques 
have been employed to exploit effectively these forms of parallelism. Flynn's taxonomy 
(1986) [9] categorised computer architectures into four categories according to the 
parallelism in the instruction and data streams that they can handle: 
• Single instruction single data (SISD) 
• Multiple instruction single data (MISD) 
• Single instruction multiple data (SIMD) 
• Multiple instruction multiple data (MIMD) 
According to Flynn' s taxonomy a scalar uni-processor is classified as a SISD system as 
only one instruction is issued per cycle and that instruction operates on a single piece of 
data. MISD category hasn't been implemented in any commercial multiprocessor as it 
does not improve the performance of a system. However, it is expected to have 
application in fault-tolerant architectures for aerospace. Since it allows a degree of 
redundancy (it issues multiple instructions on the same dataset) it can be introduced in 
safety-critical systems. The other two categories correspond to the three different forms 
of parallelism: Data-Level Parallelism (DLP) is the case of SIMD where identical 
operations are applied in arrays of data. This form of parallelism is found typically within 
loops where the same transformations apply to arrays of data. Vector architectures are the 
most efficient means for exploiting this type of parallelism. Instruction-Level Parallelism 
(ILP) is a case of MIMD since it issues mUltiple instructions that operates in multiple 
data. This can be in the form of a number of microarchitectures differentiated by their 
instruction scheduling techniques and dispatch width. Finally, Thread-level parallelism 
(TLP) which is a different aspect of MIMD is regarded as one ofthe most profound forms 
of parallelism since it involves multiple processors operating in parallel. Within the TLP 
3. Software and Hardware Parallelism 42 
domain, separate instruction streams execute on separate functional units (processor 
contexts) on separate (multi-programming) or the same (multi-threading) datasets. 
Flynn's taxonomy does not apply precisely on today's architectures as modern embedded 
processors typically belong to more than one category in Flynn's taxonomy. It is a useful 
framework however in the processor design space. In the following sections an overview 
of the three different types of parallelism will be given along with the architectural 
techniques required to exploit these forms effectively. 
3.3.1 Instruction Level Parallelism 
Instruction-level parallelism (ILP) is the architectural technique that exploits the available 
parallelism at the instruction (operation) level and executes multiple such operations 
concurrently [2], [10]. This overlap in the execution of instructions is achieved by 
extracting independent instructions from a program sequence [11]. The idea of ILP 
appeared as early as 1960's in the IBM Stretch 7030 (1961) [4] and the Control Data 
6600 (1964) [6]. Even though Stretch was a commercial failure, it introduced ideas such 
as pipelining and dynamic instruction issue mechanism based on Tomasulo's algorithm 
[12] that are in use even today [4]. Pipelining is a primitive form oflLP as it allows the 
execution of multiple instructions in different stages of the processor simultaneously. 
Thus, different multicycle operations may share the same hardware by using different 
parts of it in different cycles. Nowadays pipelining is considered a low-level mechanism 
that has contributed significantly to the performance of modern computers and since 1985 
it is part of every processor architecture [5], [13]. Seymour Cray's CDC 6600 removed 
the instructions handling the memory and the I/O from the main CPU and implemented 
them in a set of peripheral processors. Additionally, it included ten functional units that 
performed arithmetical-logical instructions at the same time. In this way, the main CPU 
(arithmetical-logical instructions) and the peripheral processors (memory and I/O 
instructions) could operate in parallel improving considerably the performance and 
making it the world's fastest computer until 1969 [6]. In the following years there was a 
wide range of techniques that extended the idea of ILP and increased the amount of 
parallelism exploited among instructions. 
3. SoOlVare alld Hardware Parallelism 43 
A processor that employs [LP is typically ca lled multiple-issue and follows a simi lar 
execution model as a normal R1Se machine [10]. Resources operate in parallel and there 
may be a mu ltip licity of fun cti onal units that implement the same datapath functions in 
order to ena ble more para lle lism. Thus, ILP in volves twO ex tra factors to accelerate 
programs: multip le issue and ex tra functiona l units . More than one operation ca n be 
issued in a given cycle and executed by using replicated o r different functi ona l units. (LP 
is dependent on developments in hardware technology uch as c ircuit speed and power 
optimization 110] [ 14] . Since [LP is an architectu rd l technique fo r ac hieving hi gher 
performance by execUling mu ltiple low leve l operati ons (such as adds, multiplies, loads 
etc) at the same time it requires spec ial logic in the fetch stage of the processor ( 10). Thi s 
addi tional logic unroll the program equence and reschedu les the order of the 
instructions in order to arrange mu lti ple operations in a parallel manner be fore execution 
and avoid o r reduce the sta lls caused from data depe ndences while maintain ing the 
program data now [5]. 
Issue Slots 
Instructions wldtn.. 4 
ee 
~Ional~s~ 
Figure 3-2: Multiple-issuing of instructions in an ILP architecture 
These instructions are subsequently issued to the functional unil that operate in para llel 
[IS]. The number of the instructi ons that can be issued and executed each cycle 
determines the width of the proces or. Figure 3-2 shows mu ltiple-i ssue in a 4-wide lLP 
architeclllre. Special logic detects dependence, reorders the instruction sequence, 
3. Software and Hardware Parallelism 44 
unrolling loops, and ensures that instructions are committed in order to maintain a 
precise-exception environment to software [l0]. This is achieved using either dynamic or 
static scheduling [2]. The dynamic scheduling approach uses special logic to identify data 
dependences and rearrange the instructions dynamically in order to reduce stalls while 
maintaining the exception and data flow behaviour [2]. The disadvantage of this method 
is a large amount of extra hardware and thus, extra power consumption compared to an 
in-order processor. This approach is used in the superscalar and data flow processors [IO]. 
On the other hand, static scheduling uses the compiler instead of dedicated hardware to 
exploit the available parallelism and keep busy as many functional units as possible. 
Advances in compiler technology can achieve a similar result thus disposing with all 
these hardware structures. Very Long Instruction Word (VLIW) [15] [16] processors 
employ such static, compiler-intensive scheduling. A compiler's ability to perform static 
scheduling depends on the amount of the available ILP in the program, the latencies of 
the functional units in the pipeline and the number of registers (storage) in the processors. 
3.3.1.1 Superscalar Processors 
Superscalar processors are either statically scheduled (using compiler techniques) with in-
order execution or dynamically scheduled (using techniques based on Tomasulo's 
algorithm) with out-of-order execution [2]. Superscalar processors began to appear in the 
mid-to-late 1980s and for many years they were viewed as the logical next step in RISC 
movement [14]. There is a wide range of superscalar implementations with different 
degree of complexity ranging from the DEC Alpha [17] which has a strictly RISC ISA to 
the Intel X86 [18] that is considered a CISC ISA [14]. The first commercial single-chip 
superscalar microprocessors were the Intel i960CA (1988) [19] and the AMD 29000-
series 29050 (1990) [20]. 
The statically scheduled approach was used in the early superscalar processors in which 
instructions were issued in order and all the types of hazards were checked at the issue 
time [2]. The pipeline control logic ·detects data or structural hazard only across the 
instruction packet currently at the decode stage. This type of superscalar processor 
employs hardware to perform the instruction issuing and hazard detection but scheduling 
uses software techniques. Sun UltraSP ARC WIll are statically scheduled superscalar 
processors. 
3. SoOwlIre alld Hardware Partl llelislII 45 
CPO 
~ -""-=''--_____ ...J 
Figure 3·3: Dynamic Instruction Scheduling 
Dynamicall y cheduled superscalar processors employ hardware to rearrange the 
instructi on execution to reduce stalls and imuhaneously dispatch muhiple instructions 
per cycle to multiple functi onal units [14] [2] . This technique handles data dependences 
unknown at compi le ti me (e.g. a memory reference) simplifying the compiler at the 
expense of hardware complex ity. Additionally, it can reschedu le an already compi led 
code to run on a different pipeline to increase the processor performance [2] . Figure 3-3 
depicts the dynamic instruction scheduling of a supersca lar processor. Dynamica lly-
scheduled uperscalar proce sors don' t require any re-compilation of the source code as 
they adapt their execution behaviour dynamically according to the app lication binary. 
Such processors exhibit limited scalability due to their complexity. Superscalar 
processors employ speculati ve execution to overcome the limitation of control 
dependences caused by branches. Branch predicti on is not sufficient in rLP case as, for a 
wide issue processor, one or more branches may execute in every cycle. Speculati ve 
execution combines dynamic branch prediction to select the inslntction stream that will 
be fetched. There are hardware resources dedicated to undoing the effects of a 
misprediction and/or dynamic scheduling [2]. The dynamic schedul ing approach 
dominates the desktop and server markets and it is u ed in many uc essful processors 
such as Pentium II I and rv , MIPS R 10000/12000, AMD A thlon, PowerP etc [2]. 
3. SOfrwore and Hardware Parallelism 46 
3 .3. 1.2 VLlW Processors 
The alte rnative to the superscalar approach is to employ co mpiler technology to check for 
dependences across the instructions of a program sequence, reorder them to minimi ze the 
pote ntia l hazard sta lls and group them into fi xed-length packets that will be issued to the 
processor. Each fi xed-length packet resembles a very long instructi on that contains 
mul ti ple independe nt operati ons that can execute in paral lel and fo r thi s reason this type 
of architecture was named Very Long lnstructi on Word (VUW) [2]. Figure 3-4 illustrates 
static instructi on scheduling of a VLlW proces or. 
Figure 3-4: Static Instruction Scheduling 
An early fo rm of VLlW was processor usin g horizontal microcode, originally designed 
for signa l proce ing applications [10]. An exa mple of thi type of processor designed to 
accelerate fl oat ing-point computati ons was the Floati ng Po int Systems [2 1] FPS- I64 and 
FPS-264 CPUs. These processors were very fast but limited in programmability and 
applicati on a rea due to the ir complex ity. The term VLlW was introduced by J. Fi sher 
who developed a compi ler that relied on trace schedul ing in order to gene rat e horizontal 
microcode (LlWs) for ordinary programs [22J. Trace scheduling is an optimizing 
compiler teChnology that performs loop unrolling and stat ic branch predict ion and a llows 
3. Software and Hardware Parallelism 47 
the processor to exploit the available parallelism beyond basic blocks [22]. [2]. 
Additionally Fisher suggested the co-design of the compiler and the VLIW processor in 
order to simplify the scheduling algorithms. In 1980's they were three general-purpose 
VLIWs with varying degrees of parallelism [10]; TRACE from Multiflow Computers Inc 
[23]. Cydra 5 from Cydrome [24] and the Culler-7 from Culler Scientific Systems. These 
processors. though not commercially successful. developed methods and technologies 
that influenced the VLIW design philosophy. Current examples of contemporary VLIW 
CPVs include the TriMedia media processors [25] by NXP (formerly Philips 
Semiconductors). the SHARC DSP by Analog Devices [26]. the C6000 DSP family by 
Texas Instruments [27]. and the STMicroelectronics ST200 [28] family based on the Lx 
architecture. These contemporary VLIW CPUs are primarily successful as embedded 
media processors for consumer electronic devices. In addition. the new Intel 1A-64 [29] 
architecture utilizes VLIW techniques to create a scalable instruction-level parallel 
processor family. 
Because of the nature of VLIW processor instructions. they are generally statically 
scheduled by a compiler removing the need for a complicated scheduling logic. In 
addition they are highly scalable but require the source code re-compilation across 
implementations [2]. VLIW is more effective as the number of issues per cycle becomes 
larger [2]. In the case that they are not enough independent instructions to execute in 
parallel the fixed-length packet includes NOP instructions which can lead to oversized 
code. There are several solutions for this problem; Sun MAJC and Tensilica's Xtensa 
LX2 [30] processor for example utilise variable-length packets to issue per cycle. 
Trimedia TM3270 compress the code stream in memory and un-compress them when 
they are loaded in the instruction code and so on. 
An advanced form of VLIW that is not used in embedded processors and embodies new 
principles is the Explicitly Parallel Instruction Computing (EPIC) processors [2][16]. 
EPIC is a design philosophy that enhances instruction level parallelism and supports 
explicit parallelism. Explicit parallelism is supported by large parallel execution resources 
and large register files. EPIC architectures use the compiler to perform full speCUlative 
execution and instruction predication to increase parallelism in a program sequence. 
Speculation is a technique that reduces the effects of memory latency by performing 
speculative loading. Predication allows conditional execution without branches implying 
3. Software and Hardware Parallelism 48 
larger basic blocks [31]. Furthennore, this type of architecture aIIows some degree of 
scalability in issue-width implementation to accommodate the resource limits in various 
applications [16]. The ISA that implements the ideas embodied in EPIC is the lA-64; the 
first implementation of that ISA was the Merced processor. 
3.3.2 Data Level Parallelism 
Data Level ParaIIelism (DLP) is a very important leverage in high performance 
computing. This paradigm uses vectorization techniques to operate in a large amount of 
independent data by executing a single instruction (vector instruction) simultaneously on 
arrays of elements [5]. Multimedia-rich applications involve real-time processing of 
continuous data streams in the form of vectors of packet 8- 16- and 32-bits integers and 
floating point numbers and undergo identical processing such as filtering, transformation 
etc. The microarchitectures capable of extracting this fine-grained data-paraIIelism are 
different than those used in fine-grained instruction-level paraIIe1ism. The most efficient 
method to exploit this type of data is by employing machines with SIMD hardware units 
that can execute whole loops in parallel [1]. These machines are known as vector 
processors and their advantages over the other architectures are explained in the 
foJlowing sections. 
3.3.2.1 Advantages of vector architectures 
The exploitation of DLP by the use of vector instruction architectures has many 
advantages compared to a classic scalar system. First, a vector instruction performs a 
number of individual operations in paraIIe1 thus it contains higher semantic content. 
Hence the vector program exhibits better code density compared to an equivalent scalar 
program and therefore smaIIer instruction fetch overhead. Higher code density implies 
less instruction fetch bandwidth and thus reduced pressure in the instruction fetch engine. 
The smaIIer overhead is due to fewer address computations and loop counter increments 
as weII as branch computations. In addition, relatively simple control can dispatch a large 
number of operations every time and can better utilize the wide datapath [32]. Examples 
of the application kernels and the vectorization techniques employed in this work are 
described in more detail in Chapter 4. Another benefit that DLP machines deliver is 
better memory system performance than superscalar processors. Despite out-of-order 
3. Software and Hardware Parallelism 49 
execution, non blocking caches and pre-fetching mechanisms, the predictive model for 
the caches is inefficient. This happens because the retrieved data from the previous level 
in the cache are not necessarily needed. Furthermore, since load/store instructions are 
mixed with computation and/or conditional execution, possible dependences and resource 
constraints prevent a memory operation to be performed on every cycle. Therefore the 
superscalar CPU cannot utilise efficiently the data cache subsystem in vectorized kernels. 
These problems are avoided in vector memory operations as the requested data usually 
have stride 1 of the memory pattern. By requesting an array of data with a single memory 
address a DLP machine uses effectively the available memory bandwidth without 
requiring extra issue slots and complex decode hardware. Thus by sending a simple 
address it can achieve a bandwidth of approximately N words per cycle. Finally, the 
datapath control remains simple as a vector engine can be easily scaled to higher levels of 
parallelism by replicating the functional units and adding wider paths from the vector 
registers to the functional units [32]. DLP however, is the least flexible form of 
parallelism compared to the !LP and TLP. It is also interesting to note that the available 
DLP in an application can also be exploited from non-DLP architectures by scheduling 
multiple independent instructions to execute in parallel in a superscalar architecture (!LP) 
or by computing the elements in parallel instruction streams in a multiprocessor system 
(TLP) [3]. This inflexibility though makes DLP the easiest form of parallelism that can be 
exploited with vector machines. These machines are easily scalable to exploit varying 
amounts ofDLP in whole application domains. This is one of the greatest advantages of 
DLP over !LP since !LP architectures can't scale easily due to dependences between 
instructions which increase quadratically with the number of the parallel instructions 
loaded-up; TLP also requires duplicated instruction management logic for each 
instruction stream, duplicated processor state and suffers overheads from inter-thread 
synchronization and communication [3]. In addition, superscalar processors with wider 
issue (>4) exhibit diminishing performance and require large area dedicated to control 
rather than to datapath. Research has shown that vector processors are able to execute 
some highly parallel, integer based applications 1.5-7.3 times faster than superscalar 
processors [33]. Therefore vector processors with wide datapaths could lead to significant 
performance without increasing the hardware complexity of architectures that exploit the 
other forms of parallelism. 
3. Software and Hardware Parallelism 50 
3.3.2.2 Vector Processors 
In this section, the fundamental concepts of vector architectures are provided as this 
research is based on this processing paradigm. Vector processor architectures made their 
appearance in the late 1960s and early 1970s to support massive vector and matrix 
calculations. The first successful implementations of vector processors were the Control 
Data Corporation (CDC) STAR-lOO [34] and the Texas Instruments Advanced Scientific 
Computer (TI ASC) [35] in 1964. These architectures were memory-to-memory with 
high bandwidth memory systems centred on a vector processing unit. However, they were 
not commercially successful due to the long start up overhead of vector instructions and 
the deep pipeIining [36]. They did however presented several innovative ideas that 
influenced the design of vector supercomputers over the next years. A vector architecture 
with a different philosophy than the aforementioned was CRAY-1 computer system [37] 
which introduced in 1976 and it was the first commercially successful vector 
supercomputer. This machine was centred on scalar processing but it was using vector-
register architecture and thus it had significantly lower overhead and less memory 
bandwidth requirements. CRA Y -I was the fastest processor of its time and its successors 
CRAY-2 and CRAY X-MP developed by two different groups ofCray Research were 
amongst the most successful vector machines until 1991. At the same period, CDC 
continued the development of memory-to-memory vector processors with the Cyber 200 
series that was using the same basic architecture as the CDC STAR but offered better 
performance and wider vector datapaths. Still their performance could not compete with 
the CRA Y machines since they had long memory latencies and could not handle 
efficiently non-unit strides [36] [38]. In 1980s, CDC created a group called ETA that built 
the supercomputer ETA-IO that again was based on the same memory-to-memory 
architecture of Cyber 200 series and had a configuration of up to 10 processors. This 
processor achieved a performance of 10 GFLOPS but its scalar performance was not as 
good and in 1989 its production stopped completely. In the 1980s smaller-scale vector 
processors appeared with the most successful designed by Convex and AlIiant. At this 
time Japanese supercomputers made their appearance starting with the Fujitsu VPIOO, 
Hitachi S810 and the NEC SXl2 that were vector-register architectures with similar 
performance to the CRA Y X-MP [36]. These computers continued to evolve with NEC 
SXl5 which was the fastest vector supercomputer in 200 I with a 16 processors 
configuration clocking at 312 MHz and Fujitsu VPP5000 with a 128 processors 
3. Software and Hardware Parallelism 51 
configuration clocking at 300 MHz. Historically, the fastest supercomputer was the 
eRA Y -4 with 64 processors running at 1 GHz but it was never completed as the company 
went bankrupt in 1995 [36]. After the appearance of superscalar architectures in the early 
nineties research was concentrated on superscalar and VLIW architectures as there was 
the prevailing belief that vector processing would be redundant [3]. Multimedia-rich 
applications becoming the dominant application domain however has changed the 
computer architecture and microprocessor design and the interest for vector processing 
has been revived [1]. 
Vector architectures can be either memory-to-memory or register-to-register based with 
the latter being the most dominant type. A typical vector processor consists of pipelined 
scalar and vector units. The scalar unit handles memory addressing and control where as 
the vector unit performs the actual processing. Vector architectures are similar to RISe 
architectures with instruction sets that include arithmetic and memory instructions but 
instead of processing scalar values they execute the same operation simultaneously on 
arrays of elements. In other words, a single opcode defines a large number of identical yet 
independent, operations on the elements of one or more arrays. The arrays of operands 
are stored in a vector register file in a similar way with the operands in RISe architecture. 
However the vector register file is a two-dimensional storage array where each row 
contains all the elements of a single vector [36]. The number of the elements per register 
is defined by the vector processor rSA/programmer model. A general vector processor 
architecture is depicted in Figure 3-5. It consists of a number of functional units that 
operate in parallel. Each unit is fully pipelined and can start a new operation every clock 
cycle. The vector functional units generate interim results that are used immediately 
without the time-costly memory references that slowed down the first vector computers 
[37]. This takes place in combination with the scalar unit which detects structural and 
data hazards and handles memory accesses. 
3. Software and Hardware Parallelism 
Main memory 
r Vecto 
Reglste rs 
t 
Vector 
Load-5tor. 
~ 
~ 
r---
r---
r---
Scalar r---Registers r---
r---
'--
f::1 FUs add/subtract 
~ FUs multiply 
~ FUsdlvlde 
~ FUs divide 
=: f::1 Integer 
-+ 
r+H Logical 
Figure 3-5: Basic Vector Processor Architecture 
52 
~ 
f--. 
f--. 
f--. 
~ I-
~ 
Each vector register has at least two read ports and one write port in order to allow RISe-
like 3-operand execution. Another important component of a vector processor is the load-
store unit responsible for loading vectors from and store to memory and it is fully 
pipelined [39) [36). A more detailed description of the vector processor architecture and 
microarchitecture developed in this work is given in Chapter 5 and 6. 
3.3.3 Thread Level Parallelism 
Another approach to achieve high execution performance is by exploiting the available 
parallelism at the thread/process level. Architectures that exploit this form of parallelism 
belongs to the MIMD category [2). Such architectures consist of a collection of 
interconnected single-thread processors, with each processor executing independent 
instructions streams operating on multiple data items. When the processors run 
independent tasks (programs) this is the case of a mUltiprogrammed environment. When 
the multiple processors execute different parts of the same program and share most of 
their address space this is known as multithreading. The independent parts or processes of 
the program are called threads. These threads execute concurrently and define another 
type of parallelism that is known as Thread-Level Parallelism or TLP [2). TLP is a 
coarse-grained type of parallelism since each processor works on a specific process and 
3. Software and Hardware Parallelism 53 
communicates with the other processors onl y if necessary. The theoretica l performance 
improvement on n-wide TLP processor is n-fo ld compared to a s ingle processor where n 
is the number of the processors that compri se the mUlti processor. 
There are two classes of M IM D multiprocessors dependin g on the number of the 
procesSors, the memory organi:Ulti on, and the type of their interconnection: The 
centralised shared-memory architecture and the di stributed-memory architecTure [2]. 
3.3.3 .1 Shared-Memory Architecture 
Shared-me mory architectures consist of a number of processors that share the same 
memory and are connected via some inte rconnect scheme typi ca lly a bus. When the 
single mai n memory has s imilar (sy mmetric) access time from a ll processors thi s is the 
case of Symmetri c Multiprocessing (SMP) or Uni form Me mory Access (UMA) [2). 
Processor Processor 
----
Processor 
1 2 n 
Interconnection network 
Main tlO 
Memory System 
Figure 3-6: The basic architecture of a centralised shared-memory multiprocessor system 
[n shared-memory archi tectures it is easier to ba lance the processor workl oad efficientl y. 
T his class is the most popu lar organi zation with a reasonabl y simp le programmi ng model 
and it is used in tightly-coupled architectures [40). Support for SMP must be bu ilt into the 
operating system in order to take advantage of the additional processors. SMP was first 
implemented on the Burroughs B5500 in 196 1 and by 2006 has dominated the server a nd 
workstation market. With the introduction o f dual-core devices, it became preva lent in 
most new deskt ops and laptops such a In(e l's Xeon and Core Duo, AMO's At hl on64 X2 
3. SoOIl'(lre and Hardware Parallelism 54 
and Opteron etc that use the x86 instruction set; mher non-x86 architeclUres are Sun 
Microsystems UltraSPARC, Imel Itanium, Hewlett Packard PA-RISC etc and are used 
primaril y in the server domain. An alternati ve architeclUre is the Asymmetric 
Mult iprocessing or ASMP in which only spec ific locations in memory and spec ific task 
are allocated For each processor. A n example or this architecture can be round in the high-
perFormance 3D chips in modern videocards. 
3.3.3.2 Distri bu ted-M emory Architecture 
The second class is known as Distri buted-Memory architectures in which the memory is 
phys icall y distributed among a number of processors. This approach CHn more easi ly 
support the bandwidth demands of the indi vidual processors as lhere is no need to access 
a centrali ed resource as the shared memory. 
Processor Processor Processor 
--- -1 2 n 
I I I 
Private Private Private 
Memory 1 Memory 2 ---- Memory n 
I I I 
tnterconnectlon network 
Figure 3-7: The basic architecture of a distributed-memory multiprocessor system 
The interconnecti on bet ween processors and memory can be direct (direct interconnection 
networks) using For example swilches or indirect using typicall y multidimensional 
meshes [2]. The disrributed-memory architecture can be implememed by using two 
differel1l approaches for communicating data among processors. In the first approach the 
communication takes place through a shared address space. This happens by addressing 
the physical separate memories as one logicall y shared address pace. T he 
multiproces ors that are using thi s approach are cal led Distributed Shared-Memory 
(DSM) multiprocessors. DSM multiprocessors are also known as on-Unirorm Memory 
Access ( UMA) since the access time depends on the data word location in memory. An 
3. Software and Hardware Parallelism 55 
alternative approach is when the address space of the processors consists of multiple and 
logically disjoint address spaces and the same physical address corresponds to two 
different locations of two different processors memories. Each processor-memory module 
is a separate computer and this type of architecture is called a multicomputer. 
Additionally, a multi computer can consist of separate computers connected in a local area 
network, known as a cluster. This approach is very cost effective when little or no 
communication is required [2]. 
3.3.3.3 Multithreading Architecture 
Multi-threaded processors are based on a hybrid approach that combines ILP and TLP 
and improve performance by exploiting the pipeline parallelism available through 
multiplexing independent threads. In this case, multiple threads execute concurrently and 
share the fimctional units of a single, wide processor. Each thread has a separate register 
file, program counter and memory page table that are duplicated in the processor 
(processor contexts). Multi-threaded processors hide the operation latency by switching 
threads at appropriate times or by interleaving operations from multiple threads at the 
same time using superscalar techniques. Apart from successfully hiding operation 
latency, multi-threaded processors improve processor utilization by keeping active many 
functional units on every cycle. There is special hardware to switch between different 
threads [2]. When one thread runs until it is blocked by an event that would cause a long 
latency stall such as level-2 cache miss (need to access an off-chip memory) execution 
switches to another thread that was ready to run. This technique is called blocked or 
coarse-grained multithreading [41]. This is the simplest type of multi threading that issues 
instructions from only a single thread per cycle and it is effective on high-cost stalls [41] 
[2]. Another alternative is when the switching between threads takes place on every 
instruction in order for the execution of multiple threads to be interleaved. This is called 
interleaved or fine-grained multithreading [4 I]. This type of switching occurs each clock 
cycle and eliminates control and data dependence stalls from the execution pipeline since 
threads are relatively independent from each other. In this technique the processor skips 
any threads that are stalled at that time and it has a very simple and fast pipeline. 
Similarly with the blocked multithreading, this type also issues from a single thread. 
When instructions can be issued from multiple threads per cycle this is the case of 
Simultaneous Multithreading (SMT). SMT is the most advanced type of multithreading 
3. Software and Hardware Parallelism 56 
and it is a variation of the fine-grained multithreading that applies to superscalar 
processors to exploit the available ILP and TLP across multiple threads [2]. Simultaneous 
multithreading improves utilization by sharing many of the resources within the processor 
and can enhance the performance of a superscalar when the available ILP is not enough 
[41]. This technique was first researched by IBM in 1968 and the first commercial CPU 
was the DEC 21464 [42]. In another architectural extreme lies the Chip Multiprocessing 
(CMP). CMP enables multiple cores to share chip resources such as the memory 
controller, off-chip bandwidth and the L2 cache improving this way the utilisation of 
these resources [43]. It is an integrated form of Symmetric Multiprocessing and in this 
configuration, instead of having separate processing units in the computing system, the 
individual processors (CPU cores) are integrated in a single high performance chip. 
3.3.4 Hybrid Approaches and Research 
The various forms of machine parallelism are not clearly separated and they can be 
combined to increase even further the computer performance. For example, the NEC SX-
4 vector supercomputer is a pipelined superscalar vector microprocessor architecture 
which can exploit ILP, DLP, and TLP [3]. Simultaneous multithreading processors 
employ TLP and ILP in the same time [44]. Another example that combines all the 
parallelism techniques is the SS _ SP ARC [45] which is a configurable, extensible, 
simultaneous multithreaded vector processor. More details of this processor are given in 
Chapter 7. There has also been a great amount of interest in the addition of extensions in 
existing instruction sets to accomodate vector processing. Examples of general-purpose 
microprocessors with vector extensions are Intel's MMX [46], PowerPC's Altivec [47], 
Sun UltraSparc's VIS [48] and Tarantula [49] that adds to Alpha (EV8) a vector unit. 
Another interesting combination is the merge of ILP and DLP paradigms in a single 
architecture [32] and the SMV architecture that combines simultaneous multithreading 
and DLP [50]. Research is currently underway into the potential performance benefit 
obtainable through the combination of different forms of parallelism within a single 
system-on-chip architecture. 
3. Software and Hardware Parallelism 57 
3.4 Summary 
This chapter presented an overview of parallelism and the performance advantages of 
exploiting it within given architectures. The limitations and the hazards caused by 
dependences across instructions in the application binary were also presented along with 
their main types. In addition, the three basic forms of parallelism were introduced and the 
processor architectures that exploit them together with their advantages and 
disadvantages. 
3. Software and Hardware Parallelism 
3.5 References 
[1) K. Diefendorff and P. Dubey, "How Multimedia Workloads Will Change 
Processor Design," in IEEE Computer. vo!. 30, September 1997, pp. 43-45. 
[2) Kevin W. Rudd, "VLlW Processors: Efficiently Exploiting Instruction-Level 
Parallelism," in Electrical Engineering: PhD Thesis, Stanford University, 
December 1999. 
58 
[3) K. Asanovic, "Vector Microprocessors," PhD Thesis, University of California at 
Berkeley, May 1998. 
[4) W. Buchholz, Planning a computer system: Project Stretch: McGraw-HiII Inc, 
1962. 
[5) John L. Hennessy and David A. Patterson, "Computer Architecture: A 
Quantitative Approach," 3 ed: Morgan Kaufinann, 2003. 
[6) J.E. Thornton, "Parallel Operation in the Control Data 6600," in Proceedings of 
the 26th AFIPS Conference, 1964, pp. 34-40. 
[7) R. Alien and K. Kennedy, "Automatic translation of FORTRAN programs to 
vector form," ACM Transactions on Programming Languages and Systems 
(TOPLAS), vo!. 9, pp. 491 - 542, October 1987. 
[8) John Crawford and Jerry Huck, "Motivations and Design Approach for the lA-64 
64-Bit Instruction Set Architecture," in Microprocessor Forum, San Jose, 
California, October 1997. 
[9) M. J. Flynn, "Some Computer Organisations and Their Effectiveness," IEEE 
Transactions on Computers, vo!. 21, pp. 948-960, 1972. 
[10) Joseph A. Fisher and Ramakrishna Rau, "Instruction-level Parallel Processing," 
Science, vo!. 253, pp. 1233-1241, September 13 1991. 
[11) Roger Espasa and Mateo Valero, "Simultaneous Multithreaded Vector 
Architectures Merging lLP and DLP for High Performance," in the Proceedings 
of the Fourth International Conference on High-Peiformance Computing, 
December 1997, pp. 350-357. 
[12) R. M. Tomasulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic 
Units," IBM Journal of Research and Development, pp. 25-33, January 1967. 
[13) Ralph Duncan, "A Survey of Parallel Computer Architectures," in IEEE 
Computer, February 1990, pp. 5-16. 
[14) James E. Smith and Gurindar S. Sohi, "The Microarchitecture of Superscalar 
Processors," in Proceedings of the IEEE. vo!. 83, December 1995, pp. 1609-
1624. 
[15) Alexandru Nicolau and Joseph A. Fisher, "Measuring the Parallelism Available 
for Very Long Instruction Word Architectures," IEEE Transactions on 
Computers, vo!. 33, pp. 968-976, November 1984. 
[16) J. A. Fisher, P. Faraboschi, and C. Young, "Embedded Computing: A VLlW 
Approach to Architecture, Compilers, and Tools," Morgan Kaufmann, 2005. 
3. Software and Hardware Parallelism 59 
[J 7] R. L. Sites, "Alpha AXP Architecture," in Communications of the ACM. vol. 36, 
February 1993, pp. 33-44. 
[18] K. Diefendorff, "Pentium III = Pentium II + SSE: Internet SSE Architecture 
Boosts Multimedia Performance," in Microprocessor Report. vol. 13, March 
1999. 
[19] Steve McGeady, "Inside Intel's i960CA superscalar processor," in 
Microprocessors and Microsystems. vol. 14, July 1990, pp. 385-396. 
[20] Daniel Mann, "Evaluating and Programming the 29K RISC Family," Advanced 
Micro Devices (AMD), 3d edition 1995. 
[21] A. E. Charlesworth, "An Approach to Scientific Array Processing: The 
Architectural Design of the AP-120BIFPS-I64 Family," in IEEE Computer. vol. 
14, 1981, pp. 18-27. 
[22] Joseph A. Fisher, "Very Long Instruction Word architectures and the ELl-512," 
in Proceedings of the 1 Oth annual international symposium on Computer 
architecture, Stockholm, Sweden, 1983, pp. 140-150. 
[23] R. P. ColweII, R. P. Nix, J. J. O'DonneII, et aI., "A VLIW architecture for a trace 
scheduling compiler," in ACM SIGARCH Computer Architecture News. vol. 15, 
October 1987, pp. 180-192. 
[24] G. R. Beck, D. W. L. Yen, and T. L. Anderson., "The Cydra 5 
minisupercomputer: Architecture and implementation," The Journal of 
Supercomputing. vol. 7, pp. 143-180, May 1993. 
[25] J. W. Van de Waerdt, S. Vassiliadis, D. Sanjeev, et aI., "The TM3270 media-
processor," in MICRO '05: Proceedings of the 38th International Symposium on 
Microarchitecture, November 2005, pp. 331-342. 
[26] Analog Devices, www.analog.com/processorsfsharcf. 
[27] "Processor Comparison: Texas Instruments C6000 DSP and Motorola G4 
PowerPC," http://www.pentek.com/dspcentralfpowerpc/articles.cfm. 
[28] Benoit Dupont de Dinechin, "From Machine Scheduling to VLIW Instruction 
Scheduling," ST Journal of Research Processor Architecture and Compilationfor 
Embedded Systems vol. I, September 2004. 
[29] Martin Hopkins, "A Critical Look at IA-64: Massive Resources, Massive ILP, 
But Can It Deliver?," in Microprocessor Report, February 2000. 
[30] R. E Gonzalez, "Xtensa: A configurable and extensible processor," in IEEE 
MIicro, March! April 2000, pp. 60-70. 
[31] R. Arno1d, R. Bhatia, and D. SoItis, "Reducing the Physical Cost of Large 
Register Files in EPIC Architectures with Stacked Register Aliasing," in 
Proceedings of the Workshop on EPIC Architectures and Compiler Techniques, 
Istanbul, Turkey, November 2002. 
[32] Francisca Quintana, Roger Espasa, and Mateo Valero, "A Case for Merging the 
ILP and DLP Paradigms," in 6th Euromicro Workshop on Parallel and 
Distributed Processing, Madrid, Spain, 1998, pp. 217-224. 
3. Software and Hardware Parallelism 
[33] C. G. Lee and D. J. DeVries, "Initial Results on the Perfonnanc and Cost of 
Vector Microprocessors," in the Proceedings of the 30th Annual International 
Symposium on Microarchitecture, 171-182, December 1997. 
60 
[34] R. G. Hinz and D. P. Tate, "Control data STAR-lOO processor design," in IEEE 
COMPCON, September 1972. 
[35] W. Watson, "The TI-ASC, A highly modular and flexible super computer 
architecture," in American Federation of Information Processing Societies 
AFIPS, 1972, pp. 221-228. 
[36] John L. Hennessy and David J. Patterson, Computer Architecture: A Quantitative 
Approach 2nd ed.: Morgan Kaufinan, 1996. 
[37] Richard M. Russell, "The CRAY-I computer system," Communications of the 
ACMvol. 21, pp. 63-72, 1978. 
[38] R. Espasa, M. Valero, and J. E. Smith, "Vector architectures: past, present and 
future," in Proceedings of the 12th international conference on Supercomputing, 
Melbourne, Australia, 1998, pp. 425-432. 
[39] C. Kozyrakis, "A Media-Enhanced Vector Architecture for Embedded Memory 
Systems," Technical Report: CSD-99-1059, University of California at Berkeley 
1999. 
[40] Rajkumar Buyya, High Performance Cluster Computing: Architectures and 
Systems vol. I, 1999. 
[41] T. Ungerer, B. Robic, and J. Silc, "A Survey of Processors with Explicit 
Multithreading," in ACM Computing Surveys (CSUR). vol. 35, March 2003, pp. 
29-63. 
[42] M. Meswani and P. J. Teller, "Evaluating the Perfonnance Impact of Hardware 
Thread Priorities in Simultaneous Multithreaded Processors using SPEC 
CPU2000," in 2nd International Workshop on Operating Systems Interference In 
High Performance Applications, Seattle, WA, September 2006. 
[43] L. SprackJen and S. G. Abraham, "Chip Multithreading: Opportunities and 
Challenges," in Proceedings of the 11th Intel Symposium on High-Performance 
Computer Architecture, 2005. 
[44] S. J. Eggers, J. S. Emer, H. M. Levy, et aI., "Simultaneous Multithreading: A 
Platfonn for Next-Generation Processors" in IEEE Micro. vol. 17, October 1997, 
pp. 12-19. 
[45] V. A. Chouliaras, K. Koutsomyti, T. Jacobs, et aI., "SystemC-defined SIMD 
instructions for high perfonnance SoC architectures," in 13th IEEE International 
Conference on Electronics, Circuits and Systems, Nice, France, December 2006, 
pp. 822-825. 
[46] A. Peleg and U. Weiser, "MMX Technology Extension to the Intel Architecture," 
in IEEE Micro. vol. 16, August 1996, pp. 42-50. 
[47] K. Diefendorff, P. K. Dubey, R. Hochspnmg, et aI., "AltiVec Extension to 
PowerPC Accelerates Media Processing," in IEEE Micro. vol. 20, March 2000, 
pp. 85-95. 
3. Software and Hardware Parallelism 61 
[48] Marc Tremblay, J. Michael O'Connor, Venkatesh Narayanan, et a!., "VIS Speeds 
New Media Processing," in IEEE Micro. vo!. 16, August 1996, pp. 10-20. 
[49] R. Espasa, F. Ardanaz, J. Gago, et aI., "Tarantula: A Vector Extension to the 
Alpha Architecture" in the Proceedings of the 29th Annual International 
Symposium on Computer Architecture (ISCA '02) Anchorage, Alaska, 2002, pp. 
281-292. 
[50] R. Espasa and M. Valero, "Exploiting Instruction- and Data-Level Parallelism," 
in IEEE Micro. vo!. 17, September 1997, pp. 20-27. 
CHAPTER 4 
METHODOLOGY AND ARCHITECTURAL RESULTS 
4.1 Introduction 
This chapter presents the optimization methodology and the architectural exploration of 
the !TU G.729A and G.723.l speech coders for a data parallel processor. The 
methodology addresses target-independent optimizations of both reference codes. Thus, 
these workload optimizations presented here can be utilised on any DSP code with data-
parallel infrastructure for acceleration. As instruction level simulation was the base for 
the adopted experimentation methodology, the description of the software tools and their 
development to suit the purpose of this research is given in the following sections along 
with a briefly survey of computer systems simulators. Both speech coding algorithms 
were benchmarked using the SimpleScalar toolset [1], before and after the data-
parallelization and optimization to obtain the instruction count (also called dynamic 
instruction count). The instruction count is the total number of instructions executed by 
an ideal scalar processor when running the codes. Using this information, the 
vectorization of the speech algorithms was performed and performance improvement 
recorded after every new vector instruction was introduced. Finally, the architecture of 
the coprocessor was defined. 
4.2 Simulation Infrastructure 
There is a growing need for efficient techniques to predict the performance of future 
computer systems and evaluate candidate, novel microarchitectures in the research phase 
of a new computer before implementing them in hardware [2]. Simulation has been 
essential for the research and design of processors, compilers or any hardware that 
comprises a computer system or platform. It accelerates the hardware development 
process by employing software models for the proposed hardware. Simulation can reveal 
the dynamic characteristics of the hardware model and the software system that executes 
on it and allows for rapid design space exploration. Such models can be implemented in 
62 
4. Methodology and Architectural Results 63 
traditional programming language such as C/C++ or hardware description language such 
as Verilog and VHDL and then, exercised with appropriate workloads to validate the 
performance and correctness of the proposed hardware at very early stage. In addition, 
computer system models allow the developing and testing of the software before the 
hardware is available [3), [4), [5). Typically, such software models are substantially 
slower than the equivalent hardware; however they can be built in very short time [I). 
The implementation of the software model can vary in the following quality 
features/requirements: 
Performance: The performance depends on the amount of the workload that can be 
exercised. The greater the number of workloads that can exercise the model, the more 
thorough the model study and verification can be, ensuring this way increased probability 
of correct-by-construction design. Performance has to do also with the speed (simulated 
MlPS) of the actual simulation model and therefore with the speed of each of the 
component that comprise the latter. 
Detail: The detail or simulation model accuracy determines the level of abstraction of the 
implemented model's components. It can describe the simulated system from a purely 
functional processor state level all the way to cycle-accurate timing including memory 
wait states and interrupt latency of all its components. Different levels of abstraction 
provide complementary amounts of information to the system designer however, at 
increased execution time. 
Flexibilitv: Flexibility indicates how well structured is the simulator to easily modify or 
add design variants of the simulated system in order to re-use it for slightly or completely 
different models. 
There is a trade-off between these three aspects of the computer system simulators. A 
highly detailed model can faithfully simulate all aspects of the system's operation but 
does so at low execution speeds and has reduced flexibility. On the other hand, a simpler 
model is less accurate but faster and certainly more flexible. Thus, there are several 
different simulator implementation models that meet different requirements in terms of 
performance, detail and flexibility. There is ongoing research in this area as researchers 
strive to achieve a reasonable trade-off [I, 6). 
4. Methodology and Architectural Results 64 
Certain types of simulators are described with an Architecture Description Language 
(ADL) [7], [8]. ADLs are computer languages designed specifically for representing and 
analysing system's microarchitecture. Such formal descriptions of architecture and 
microarchitecture have been the subject of research for years [9] and several models and 
techniques have been proposed on this front in an effort to facilitate architecture and 
microarchitecture description and space exploration. ADL-based simulators belong 
primarily to one of the two categories depending on whether the ADL captures the 
behaviour (instruction-set) or the structure (microarchitecture) of the system [10]. 
Recently, a third category has emerged which combines effectively both, behaviour and 
microarchitecture [7]. Behaviour-centric [7] also known as instruction set [10], [Il] 
simulators describe instruction functionality but don't allow detailed pipeline and control-
path specification. They are primarily used during the development phase of architecture 
(before the actual hardware specification is written and implementation begins) providing 
an execution model of the system and thus, writing the first programs and testing the 
compiler code generation [I I]. Such ADLs are good for regular architectures and provide 
programmer's model but they are tedious for irregular architectures [8]. This type of 
simulators is simple, relatively fast (low MHz range) and can be easily retargeted to 
various ISAs. Examples of ADLs generated behaviour-centric simulators are nML [12], 
ISDL [13], ISPS [14]. nML is based on the concept that the majority of instructions share 
common properties. By exploiting these common properties, a hierarchy scheme is 
developed to describe instruction sets. The instructions are the topmost elements in the 
hierarchy and partial instructions are the intermediate elements. Each instruction 
definition in nML can be in the form of an AND-OR tree of intermediate elements that 
has a few attributes [10], [12]. The Instruction Set Description Language (ISDL) [13] was 
developed at MIT in order to express parallelism with explicit specification and it targets 
mainly VLIW processors [10]. The Instruction Set Processor Specification (ISPS) [14] 
appeared in the early 1970's and has been the basis for many design tools [15]. ISPS was 
used to model the architecture of processors and analyse their performance rather than to 
describe a complete computer system [15], [10]. On the other hand, structure-centric [7] 
also known as cycle-accurate [11] simulators simulate the microarchitecture of a system 
and provide performance metrics such as cycle counts, cache hit ratios and resource 
utilization statistics amongst others. Examples of structure-centric simulators are 
MIMOLA and VDLfI. MIMOLA [10] describes application programs with a Pascal-like 
4. Methodology and Architectural Results 65 
syntax while the processor model has the form of a component netlist. VDLlI (16) stands 
for the Unified Design Language for Integrated circuit. It is a Register Transfer level 
description language for simulation and logic synthesis. The techniques that are used in 
this ADL category to describe in detail the computer microarchitectures are very 
complex, quite slow and sometimes architecture specific (11). Mixed type of simulators 
such as LISA and EXPRESSION capture both the structure and behaviour of the 
architecture. The Language for Instruction Set Architecture (LISA) (17) explicitly models 
both the datapath and control that are necessary for cycle accurate simulation. This 
description comprises two types of declaration: resources and operations. Resources 
refers to hardware structures such as registers, pipelines and memory systems whereas 
operations are the basic objects that represent the programmer's view of the behaviour, 
structure and the instruction set of the architecture. EXPRESSION (8) describes a 
processor as a netlist of functional units and storage elements and automatically generates 
Reservation Tables (RT) based on that netlist. Thus netlist representation is at a higher 
level of abstraction, similar to a block-diagram level description. 
Simulators are also classified depending on whether they are trace driven (5) or execution 
driven (2),[15). Trace-based simulation is a more traditional simulation technique that 
uses a stream of pre-recorded instructions to drive a hardware timing model. It employs a 
variety of techniques, both hardware and software, in order to obtain the instruction 
traces. Such techniques include hardware monitoring, binary instrumentation that inserts 
probe functions at various location in the to-be-traced code in order to collect event traces 
or trace synthesis [I). Trace-based is faster than execution driven simulation but requires 
large amount for storage of traces and incurs large time overheads as traces can contain 
billions of references. In addition, it can be less accurate because of the difficulty in 
characterizing the behaviour of real programs stochastically meaning that it can capture 
only a part of processor behaviour e.g. cache misses. Since a trace is obtained from 
logical execution paths of a workload it can't model speculative execution such as branch 
directions or load addresses (2). On the other hand, execution-driven simulation permits 
greater accuracy as the execution of the program and the simulation of the architecture 
are closely related and interleaved. It can reproduce a device's internal operation by 
replicating the execution of instructions on the simulated machine. In this way it provides 
all the data produced or consumed inside all microarchitecture components. The typical 
4. Methodologv and Architectural Results 66 
output of this type of simulation is a large number of statistics that can help to understand 
how the components ofthe simulated system behave and a precisely-estimated execution 
time. Execution-based simulation can be also employed for dynamic power analysis as it 
can precisely record the change in the inputs of microarchitecture blocks and calculate 
relative dynamic power metrics accordingly. The drawbacks of the execution driven 
simulation are the high model complexity and the difficulty in reproducing experiments 
[I]. There is ongoing research to overcome these issues such as retargetable instruction 
set simulators [18], where the goal is to generate a simulator automatically from a 
machine description language. Additionally, traces can record the precise system state 
and can help to recreate the record-of-execution [I]. 
Finally, simulators can be classified depending on the amount of detail that they employ 
for system representation from Instruction-accurate simulators (1SS) [4], [18], [I] to 
Cycle-accurate simulators (CAS) [6], [19]. 1SS imitates the behaviour of a mainframe or 
microprocessor by "executing" instructions and maintaining internal variables which 
represent the processor's registers. The 1SS represents the system at a higher level of 
abstraction allowing the development of this simulator in short time. It is preferred from 
the cycle-accurate simulators in the early stages of a project to model fast the 
architectural features of the system but it can be also used in later stages to validate the 
functionality of the system since it can rapidly run the complete benchmark. The 1SS 
however can't be used for performance analysis as they don't contain pipeline detail or 
timing issues [4], [18]. The Cycle-accurate simulators on the other hand, can perform 
timing (Cpn analysis and give quite accurate performance estimates. They are more 
complex to develop because of the great amount of detail and thus more time-consuming 
and lower speed than the 1SS. Additionally, different CAS need to be developed for any 
new implementation of an architecture whereas the ISS undergo only minor changes 
between implementations of the same architecture [19]. 
4.2.1 Simple Scalar Toolset 
The pnmary architecture exploitation was carried out on the Version 3.0 of the 
Simple Scalar tool set that is publicly available. Since its release in the Opensource (1995) 
Simple Scalar has been widely used for research in the computer architecture community 
[18]. The toolset provides an infrastructure for simulation and architectural modelling that 
4. Methodologv and Architectural Results 67 
simplifies the implementation of hardware models for simulation of complete 
applications [I], [18]. It can perform program performance analysis, measure the dynamic 
characteristics of the hardware model and contribute to the software-hardware co-
verification and co-optimization. It comprises of a compiler, assembler, linker and 
simulation tools for a range of modern processors architectures. SimpleScalar comprises 
several simulator models ranging from a simple functional instruction emulator (sim-safe) 
to a detailed microarchitectural model with dyoamic scheduling (sim-outorder). Table 4-1 
lists the seven simulators at different level of microarchitectural abstraction that are 
contained in the current release (version 3.0) of Simple Scalar. These simulator models are 
Instruction Set Simulators (ISS), also called functional, apart from the sim-outorder 
which is full cycle-accurate simulator and provides detailed microarchitectural timing [1]. 
Table 4-1: SimpleScalar baseline simulator models 
Simulator Description Code Lines Typical Speed 
sim-safe Simple functional simulator 320 6MIPS 
sim-fast Speed-optimized functional simulator 780 7MIPS 
sim-profile Dynamic program analyser 1,300 4MIPS 
sim-bpred Branch predictor simulator 1,200 5MIPS 
sim-cache Multilevel cache memory simulator 1,400 4MIPS 
sim-fuzz Random instruction generator and tester 2,300 2MIPS 
sim-outorder Detailed microarchitectural timing model 3,900 0.3 MIPS 
Figure 4-1 illustrates the SimpleScalar infrastructure and its main components. The 
behaviour of the simulator depends on the processor model that is defined at three levels: 
ISA, ABI (Application Binary Interface) and microarchitecture [13]. 
Simulators 
os 
Figure 4-1: SimpleScalar Infrastructure 
4. Methodology and Architectural Results 68 
Only two ISA's are supported in the current release, the Portable Instruction Set 
Architecture (PISA) and the Alpha instruction set architecture. The instructions have a 
specific format that comprises the assembly format, binary opcode, register source and 
destinations, execution unit, instruction class and enum opcode that are assigned from the 
infrastructure [13]. Each instruction is associated with a semantic action statement that 
provides a comprehensive mechanism for describing how the instructions modify the 
state of the registers and memory. The OS handles only the trap instructions with the help 
of the system cal1 simulation. The Application Binary Interface (AB!) establishes the 
communication between the simulated system and the external 110. The instructions are 
loaded on a binary file format of the machine code after they are linked and relocated 
statical1y as no dynamic linking is supported. SimpleScalar uses the provided COFF 
binary file loader or the GNU's binary file descriptor library [13]. Since the SimpleScalar 
toolset is an execution-driven simulator, there is no need for instruction trace files as al1 
the instructions are generated dynamical1y [1]. It models several microarchitectural 
components such as cache, memory, functional unit resource, scheduler and branch 
predictor. Its microarchitectural modelling ability can be extended easily due to its simple 
design that al10ws the addition of more components [13]. 
4.2.2 Customizing the Simple Scalar Toolset 
For this research the SimpleScalar PISA instruction set was used. This is an extension of 
Hennessy and Patterson's DLX instruction set [20] that it also includes a number of 
instructions and addressing modes from the MIPS-IV [21] and RS/6000 (IBM pSeries). It 
utilizes a 64-bit instruction encoding to provide an easily extensible, research 
environment for instruction-set and system design. This extended encoding can support 
modification or addition of instructions, variation of the number of the program used 
registers etc [22]. The simulation tool utilised was the sim-fast that is a speed-optimized 
functional simulator that provides instruction accurate simulation but no timing. It 
executes al1 the instructions serial1y without assuming the existence of a cache. Based on 
this simulation tool, sim-vector was created that incorporates apart from the existing 
PISA, a file with the proposed vector ISA (vector. def). The vector. def file contains 
the definition of al1 the instruction extensions (scalar and vector) of the proposed 
coprocessor. The development of the coprocessor ISA and its introduction in the 
4. Methodology and Architectural Results 69 
vector.def file are described in more details in section 4.3.7. The specific two target 
workloads (G.729A and G.723.1) run on the model using execution-driven simulation. 
They use the statistical package which tracks updates to statistical counters and produces 
a detailed report. Sim-system is another tool based on the sim-fast that was created to 
model a shared memory multiprocessor environment. The sim-system simulator also 
called a PRAM model (Parallel RAM), is multithreaded and allows the execution of 
shared-memory applications. Sim-system was not utilised in this research as sim-vector 
provided the entire infrastructure in terms of single processor and the ability to add 
scalar-vector extensions. It has however been part of another closely linked research 
project in which the identified scalar-vector extensions were implemented in SystemC 
and attached to the vector unit of a high performance configurable extensible processor 
[23]. This research project and its results are detailed in Chapter 7. 
4.3 Workload Optimization 
4.3.1 Profiling 
As mentioned previously, ITU-T provides reference C code for a number of speech 
coders. Every such reference implementation defines a set of universal, basic arithmetic 
operations (functions), essential for the implementation of speech coding algorithms. For 
the purpose of this research and in order to investigate the potential acceleration, the ITU 
G.729A reference code was profiled initially in native mode (Intel X86) in order to 
identify the computation workload distribution in these basic functions [24]. This was 
achieved by compiling the code with the compile flag -pg (for embedding profile 
instrumentation in the resulting binary) and running it with one of the ITU-T supplied test 
vectors, to produce a single profile data file. Subsequently, this was processed by the 
gprof Linux utility. Profiling revealed that the average relative amount of time spent 
outside the basic-op functions in reference code was 30.4% and 26.9% for the G.729A 
coder and decoder respectively as it shown in Table 4-2 [24]. The same profiling was also 
performed for the ITU G.723.1 reference code and the results are depicted also in Table 
4-2. 
4. Methodolof!!! and Architectural Results 
Table 4-2: Relative amount of time spent outside the basic instructions 
Algorithm Relative CPU Time (%) in Native Mode 
G.729A Coder 30.4 
G.729A Decoder 
G.723.1 Coder 
G.723.1 Decoder 
26.9 
31.3 
22.8 
70 
As general applicability and consistency of the profiling data were desirable, the 
workloads were profiled again in the SimpleScalar environment which is our simulation 
infrastructure. Table 4-3 depicts the highest percentage of the dynamic instruction count 
spent outside the basic operations of both the application codes, for encoding and 
decoding [25]. 
Table 4-3: Relative number of total instructions executed outside the DSP emulation 
instructions 
Algorithm 
G.729A Coder 
G.729A Decoder 
G.723.1 Coder 
G.723.1 Decoder 
Relative Instructions(%, simulated) 
34.2 
37.2 
34.5 
33.3 
Even though two fundamentally different instruction set architectures and profiling 
collection/execution environment were used, both respective profiling metrics of the 
codecs were within 5% of one another. Therefore the experiments were continued by 
using the simulated infrastructure as the produced results are reasonably independent of 
the sampling issues of profiling in native mode and closer to real implementations of 
RISCIDSP processing kernels for multimedia applications [25]. The profiling results, as it 
was expected, revealed that the workloads spend a significant amount of time/instructions 
executing the basic emulation functions. Table 4-3 reveals that a 66.7% of the total 
machine instructions executed is inside the set of basic functions. A further, very 
important observation relates to parallelism exploitation within the right DSP loops 
utilising these basic operations. In general, visual inspection of the code suggests a 
significant number of the basic operations appear in data-parallel loops [24]. It was 
apparent that efficient implementation of the basic operations via a configurable 
microprocessor with a targeted, data-parallel architecture, that closely matches these basic 
4. MetltodolOl!v alltl Architectural Results 71 
operalions, could lead to high performance. These basic instructi ons are listed in the chart 
of Figure 4-2 along with the number or the executed machine instnlctions that they need. 
Thererore the creation or vector instructions was ba ed primarily on the profiling 
informat ion electing the most machine instructi on consuming. 
,<0 
' 20 
"" 
' 00 
eo " 72 
.. 
.. 
" 
" 33 38 
28 
20 
f igure 4-2: Machine instruction count for the IlASOP.C functions 
Additi onal acceleration of these computationall y ex pensi ve operations can be achieved by 
taking advantage of the Data Level Para llelism (DLP) to create vector operation , based 
on the DSP emu lation instructi ons, into a data paral lel form. As it was explained in the 
prev ious chapter vector instnlctions are a simple yet, very powerfu l mechan ism to 
significantly improve the performance of the system [26]. The unm dined speech coders 
0 .729A and 0.723.1 were profi led once more using the SimpleSca lar too lset for all ITU-
T test vectors. T he results of the comprehensive proli l ing for both codecs are shown in 
Table 4-5 and Table 4-4 respectively. 
4. Methodology and Architectural Results 72 
Table 4-4: G.723.1 Unmodified Workloads Instruction Count 
Workloads Instruction Count Frames 
Encoder 
Dtx63.tin 10,159,684,865 864 
Dtx53mix.tin 925,852,798 120 
(r53) 
Dtx53mix.tin 1,062,686,614 120 
(mixed) 
Decoder 
Dtx63.rco 680,066,056 864 
Dtx53.rco 90,359,083 120 
Dtxrnix.rco 90,305,154 120 
Dtx63e.tco 925,852,811 120 
Dtx63b.tco 9,093,395 11 
These results were used as a baseline during the research and optimization phases of the 
scalar and vector ISA in order to precisely quantifY the benefit. 
Table 4-5: G.729A Unmodified Workloads Instruction Count 
Workloads Instruction Count Frames 
Encoder 
Algthrn 62,613,638 34 
Fixed 213,961,855 119 
Lsp 3,977,183,269 2231 
pitch 3,253,175,283 1834 
Tame 230,917,008 127 
Test 311,692,276 175 
Speech 6,656,624,952 3749 
Decoder 
Algthrn 13,456,279 34 
Fixed 45,865,491 119 
Lsp 865,256,672 2231 
pitch 706,161,011 1834 
Tame 49,456,050 127 
Speech 1,440,402,972 3749 
Erasure 114,722,597 299 
Overflow 148,851,504 383 
Parity 115,390,78 288 
4. Methodolof[V and Architectural Results 73 
Table 4-6 depicts the top ten most computationally intensive functions of the G.729A 
speech coder. As it can be seen the most demanding function is the cor_h_x that 
computes the correlation of the input response with the target vector. 
Table 4-6: Profiling the G.729A functions by using the speech workload 
Function No of "all 
Dynamic 
DLP Description 
Instruction Count 
15,000 247,349,024 High 
Compute correlation of 
Cor_h_x target vector 
Linear Prediction 
Syn_filt 30,000 236,497,500 High 
synthesis filter 
7,500 217,172,751 Low 
Algebraic codebook with 
D4i40_17_fast 4 nonzero pulses 
3,750 213,394,987 High 
Compute the open pitch 
pitch_ol_fast lag 
Find autocorrelations of 
Autocorr 3,750 203,658,564 High 
signal with windowing 
7,500 199,979,638 High 
First stage quantizer 
Lsp-.pre_seleet 
using LSP codebook 
7,500 58,402,500 High 
Compute the LPC 
Residu 
residual 
7,500 43,319,433 Low 
Find Pitch period and 
pi t-.pst_filt perform Postfiltering 
63,755 41,693,130 High 
Copy input to output 
Copy 
vector 
7,500 25,141,988 High 
Scale postfilter output by 
Age 
automatic control 
The next function, Syn_fil t, implements the synthesis filtering [27]. Visual inspection 
of these functions identified the amount of the Data Level Parallelism (DLP) that can be 
effectively exploited and this is also shown in Table 4-6. Table 4-7 shows the top ten 
most computationally intense functions of the G.723.1 for the 6.3kbits/s workload. In this 
case, the most demanding function is the Find_Best that performs the fixed codebook 
search for the high rate encoder [28]. It contains a significant DLP and thus has high 
vectorization potential. The next function in the list, F ind_Acbk, computes the adaptive 
4. Methodolof!!! and Architectural Results 74 
codebook contribution in the closed-loop around the open-loop pitch lag. This function 
unfortunately does not posses sufficient DLP [28]. 
Table 4-7: Profiling the G.723.1 functions by using the 6.3kbits/s workload 
No of call 
Dynamic 
DLP Function Instruction Count Description 
Find_Best 4,408 1,370,009,644 High Fixed Codebook Search 
Find_Acbk 2,772 915,225,959 Low Adaptive Codebook Calculation 
Estim_Pitch 1,728 430,602,013 High Open-loop pitch 
estimation 
Lsp_Svq 926 141,876,220 Medium Search for the LSP indices 
Comp_Lpc 864 126,386,784 High Computes the LPC filter 
coefficients 
Upd_Ring 3,456 98,506,368 Medium Update memory of the filters 
2,772 78,871,716 Low 
Computes the zero-input 
Sub_Ring response and target 
speech vector 
2,772 78,048,432 Low 
Computes the combined 
Comp_Ir impulse response from 
the filters 
864 66,604,896 Low 
Implements the formant 
Error_Wght perceptual weighting 
filter 
Decod_Acbk 6,228 31,267,516 High Computes the adaptive 
codebook contribution 
Tlris functional profiling in conjunction with visual inspection indicates that a vector 
implementation of the basic operations can lead to a high performance processing 
platform for these workloads. This is the basic premise around which a vector ISA and 
microarchitecture have been defined in this work. 
4.3.2 Vector ISA Development and Experimentation Methodology 
This section describes the optimization methodology adopted for both !TU G.729A and 
G.723.l reference codes on the vector coprocessor software model. The main steps of the 
software optimization process are depicted in Figure 4-3. The selection of the kernels for 
optimization was based primarily on the profiling information for both !TU speech 
coding algorithms and focused on the most time/instruction-critical functions. 
4. Methodologv and Architect/l r£/l Results 
Profile Algori thm 
Architecture 
Specification 
Vector and Scalar 
Extensions 
Vectorize Oata 
Parallel Loops 
Run Tests 
(X86 Mode) 
Scalar optimization 
of non-
vectorizable 
section 
Run Tests 
(X86 Mode) 
Tests O.K? 
Instructions in 
Inline Assembly 
Run Tests 
(SS Mode) 
AestsO.K? 
f 
Simulation 
Archtectural 
Resul ts 
Figure 4-3: Experimentation Methodology 
75 
The architectura l state of Ihe proposed veclor accele ralor was defined in Ihe architectural 
confi guration fi le vstate . h. Thai fil e precisely describes the extended processor state, 
on top of the ex isting one (S impleScal ar specifi ed processor state). Figure 4-4 illustrates 
the coments of the vstate.h that re lates to the ex tended vector state (vstateT 
structure). The #define directi ves spec ify the number of the vector and scalar registers, 
the vector accumu lat.ors and the predicati on and overfl ow fl ag bi ts. As the coprocessor is 
uniquely parametric, parameter VLMAX is defined at the beginning of the fil e and 
4. Methodology and Architectural Results 76 
determines the maximum vector length of the vector components. In this particular case 
VLMAX is equal to 8. This means that a vector register will include 8x 16-bit elements 
and a vector accumulator will have (8/2)x32-bit elements respectively. The structure 
vs ta teT encapsulates the total vector coprocessor state. It includes the definition of all 
the above mentioned programmer-visible registers as two dimensional arrays apart from 
the overflow flag that is a single dimension array. 
//***************** 
#define VLMAX 8 
//***************** 
typedef signed short int VECTOR[VLMAXJ; 
#define VECTOR_REGS 16 
#define VACCUMULATORS 2 
#define PRED_REGS 1 
#define SCALAR_REGS 16 
typedef struct 
{ 
II Vector length register 
int VLEN; 
II Vector register file 
signed short int VRF[VECTOR_REGSJ [VLMAXJ; 
II Vector accumulators 
signed int VACC[VACCUMULATORSJ [VLMAX/2J; 
II Predicate registers 
unsigned short int PRED[PRED_REGSJ [VLMAXJ; 
II Scalar registers 
signed int SRF[SCALAR_REGSJ; 
II Vector overflow 
unsigned short int V16[VLMAXJ; 
} vstateT; 
Figure 4-4: The extended processor state as defined in the configuration file vstate.h 
Subsequently, vector instruction extensions were developed that match the basic 
operations ofthe speech coding algorithms as these were proved to be the most critical. In 
order to check the coprocessor at the functionality level without the need to specify any 
underlying technology, C macros were created to represent the vector instruction 
extensions. This resulted in a new codebase which included these new instructions and 
thus can benefit from the power ofthe vector hardware. With this method, the instruction-
accurate model of the coprocessor was verified with the help of the test vectors by 
mapping directly the output of the modelled coprocessor with that of the original scalar 
one. 
4. MethodoloIT and Architectural Results 77 
ill order to be able to run the algorithm in vector mode, it was essential to re-write the 
data level parallel loops of the code in vector assembly in such a manner that no semantic 
difference exists between the vectorized and the original code. At this point, it must make 
clear that the architecture specification and the ISA development are interlocked and both 
evolved during the vectorization of the workloads. The remaining (non-vectorizable part) 
of the code was also optimized by re-writing it in scalar assembly by using Scalar 
illstruction Set extensions. illitially, all the created instructions modelled in C and were 
included in the x8 6_ vi sa. h header file located in the source directory of the nu 
codecs. This step allowed for at-speed validation, in native mode, of the custom 
instructions with the original instructions replaced by the instruction extensions. An 
example of such instruction, as defined in the x86_visa.h file, is shown in Figure 4-5. 
#ifdef X86 
//Vector register shift right 
/*-------------------*/ 
#define vshri(vrf,amount)\ 
/*-------------------*/\ 
({\ 
extern vstateT vstate,\ 
int index,\ 
stats_start, \ 
update_stats('vshri'),\ 
for (index = 0, index < vstate.VLEN 
{\ 
putv\ 
if (vrf!=O)\ 
index ++)\ 
vstate.VRF[vrfl [index) =shr_simple (index,vstate.VRF [vrf) 
[index), (Word16)amount),\ 
} ) ; 
orvi\ 
)\ 
regv, \ 
stats_end,\ 
Figure 4-5: Example of a C macro Instruction Definition 
The pre-processor directive #ifdef x86 at the beginning of the instruction is used as a 
switch to enable or disable the C macros when executing Linux x86. The vshri 
instruction performs an arithmetic shift right of the source vector register, vrf, by as 
many positions as the variable amount defines. It calls the shr_simple function for each 
vector element, to perform the shift right operation. The result is stored in the destination 
register vrf. The shift is performed within a loop of vI en iterations, which is the 
4. Methodology and Architectural Results 78 
dynamic number of elements (16-bits each) that comprise the operand vector (vrf). This 
number is specified in the vlen_r register. A similar format was followed for all the 
instructions. The only main difference between the scalar and vector instructions is that 
the former does not contain a loop as the length of the scalar operands is constant while 
the length of the vector operands is parametric (run-time). After all the identified DLP 
loops were replaced with vector assembly the optimized workloads were validated by 
running the test vectors to ensure that there is no semantic difference between the 
vectorized and the original code. The remaining of the code was optimised by using 
scalar assembly and again it was verified by running the same ITV test vectors. When the 
optimization of the workloads was complete the vector and scalar instructions re-written 
in inline assembly and inserted in the SimpleScalar simulation infrastructure to extend its 
functionality and thus, the architectural simulation results. The steps that are mentioned 
above are described in more details in the following sections of this chapter. 
4.3.3 Identification of Data Parallel Loops 
As it already discussed, parts of both C reference codes had to be re-written in vector 
assembly in order to run efficiently on the vector accelerator. The replacement of scalar 
operations by vector extensions is called vectorization. Vectorization takes place in 
functions that can exhibit Data Level Parallelism (DLP). Such functions typically operate 
iteratively on blocks of data without the presence of data dependences (loop-carried 
dependences). By carefully examining the code it became apparent that the main area of 
interest is the loops as, in their overwhelming majority, perform DSP-type operations on 
arrays of data. These loops were therefore targeted and their bodies were replaced with 
vector operations semantically equivalent to the original code. Any mismatch in the 
output bitstreams between the original (ITU -1) and vectorized (as above) codes is 
attributed to loop-carried dependences which can't be eliminated [29]. In chapter 3 were 
described all the types of data dependences that can be detected in a program. In this case 
only the true dependences between statements in a loop were considered. More 
specifically, every loop was examined to determine whether a statement depends upon 
itself (loop-carried dependences) or if a statement that writes a memory location precedes 
a statement that uses that memory location as an input [29]. Figure 4-6 and Figure 4-7 
illustrate the case of data dependent loop (non-vectorizable) and a data independent loop 
4. Methodology and Architectural Results 79 
(vectorizable) respectively. In Figure 4-6 the loop calculates the Line Spectral Pair (LSP) 
coefficients in G.729A encoder and shows interstatement (iteration-carried) dependences. 
This loop can't be vectorized. As it can be seen statements S5 and S9 depend upon input 
values that were created by previous execution (iteration) ofS5 and S9 respectively. 
for (i = 0; i< NC; i++) 
( 
S2 to = L_mult(a[i+lJ, 8192); /*x=(a[i+lJ+a[M-iJ»>I*/ 
S3 to = L_mac(tO, a[M-iJ,8192); /*-> From Qll to QI0*/ 
S4 x = extract_h(tO); 
S5 fl[i+lJ = sub(x, fl[iJ); /*fl[i+lJ=a[i+lJ+a[M-iJ-fl[iJ*/ 
S6 to L_mult(a[i+lJ. 8192); /* x = (a[i+lJ-a[M-iJ) » 1 */ 
S7 to = L_msu(tO, a[M-iJ,8192); /*-> From Qll to QI0 */ 
S8 x = extract_h(tO); 
S9 f2[i+lJ = add(x, f2[iJ); /*f2[i+lJ=a[i+lJ-a[M-iJ+f2[iJ*/ 
} 
Figure 4-6: Example of a non-vectorizable loop as the statement 85 depends on a previous 
result of the 85 execution. The same dependency appears to the statement 89. 
Figure 4-7 presents a loop that subtracts the unquantized LSP frequencies for the current 
frame in order to compute the VQ weighting vectors. It selects the frequencies that are 
closer in value with each other in order to produce weights of greater precision. As 
shown, this loop is vectorizable as both statements (S2 and S3) are independent from 
previous results of their execution (producer/consumer iteration indexes are linear 
combination of one another and independent). The inputs of these statements are arrays of 
the currents frequencies that can be loaded from assigned pointers to the vector registers. 
for ( i = 1 ; i < LpcOrder-l ; i ++ ) 
{ 
S2 TmpO sub( CurrLsp[i+lJ, CurrLsp[iJ 
83 Tmpl = sub( CurrLsp[iJ, CurrLsp[i-lJ 
84 if ( TmpO > Tmpl ) 
85 Wvect[iJ Tmpl 
86 else 
87 Wvect [iJ = TmpO 
} 
Figure 4-7: Example of a vectorizable loop with statements 82 and 83 being independent 
from previous results of their execution. 
The same methodology was followed for all the loops in both C reference codes for both 
encoder and decoder, whenever iteration-carried data dependences didn't arise between 
4. Methodology and Architectural Results 80 
loop statements, loops were re-written in vector assembly as described in the following 
section. 
4.3.4 Implementation ofvector loop using custom ISA 
Figure 4-8 shows a loop that quantizes the difference between the computed and 
predicted coefficients at the first-stage vector quantizer in the LP analysis of the G.729A 
encoder. This vectorizable loop has M iterations (value is specified at compile time) that 
performs subtraction (sub) of two arrays, multiply the subtraction result with itself and 
adds the product to the accumulator (L_rnac), for the entire current frame M. As it can be 
seen these two operations are data independent as the iterated statements are not using 
values computed in some previous iterations. Therefore they can safely be replaced and 
directly converted to vector form. The pre-processor directive #ifded ORIGINAL selects 
the conditional compilation of the code to run this non-optimized part when the original 
mode is selected in the compile. h header file. 
/******************LOOP1*******************/ 
#ifdef ORIGINAL 
for ( j = 0 ; j < M ; j ++ ) 
( 
) 
tmp = sub (rbuf (j J, lspcbl (iJ (j J); 
L_tmp = L_rnac( L_tmp, tmp, tmp ); 
Figure 4-8: Example of loop with DLP within the original C code 
Figure 4-9 depicts the first part of the transformed loop with the introduction of vector 
assembly. Having identified that the loop is vectorizable, it is necessary to identify the 
inputs and outputs of the loop that have to be loaded or stored in vectors. By associating 
these I/O vectors with 16-bit or 32-bit pointers, this allows the data to be represented 
using the l6-bit elements of the vector registers or the 32-bit elements of the vector 
accumulators respectively. The newly created pointers point to the first values in both 
data arrays (inputs). All the intermediate values are stored temporarily into the vector 
registers or accumulators, depending on the instruction. When the pointers are set the 
vector length register (vlen_r) needs to be loaded with the maximum vector length 
(VLMAX). The last instruction vsplatacci ( ... ) in this code snippet loads the value 
zero to the vector accumulator zero in order to clear it before any calculations take place. 
4. Methodologv and Architectural Results 
#else 
( 
//Set Pointers 
signed short int *froml=rbuf; 
signed short int *from2=&lspcbl[iJ [OJ; 
//Load VLMAX into vlen_r register 
ldvlen_r(VLMAX); 
//Clear accumulator 
vsplatacci(O,O); 
Figure 4-9: Assign pointers and load the vlen] register 
81 
Figure 4-10 illustrates the main vectorized Joop (modulus part). This is true while 
executing the loop since the loop only deals with whole vector lengths. In the figure the 
original loop range is decreased by dividing the initial iteration number by the maximum 
number of vector elements available, VLMAX. Doing this, in combination with 
incrementing the vector pointers, froml and from2 by VLMAX, allows for each iteration 
of the loop the pointers to point to new set of vector data. This part of the code will be 
performed as many times as the quotient of this division. 
//Modulus Part 
for (i=O; i < M/VLMAX; i++) 
( 
//Load vector register from rbuf 
Si vldw(l, froml); 
//Load vector register from &lspcbl 
S2 vldw(2,from2); 
//Perform subtraction to vrl, vr2 
S3 vitu_sub_r(3,1,2); 
//Multiply even word and add to VACCO 
S4 vmace(0,3,3); 
//Multiply odd word and add to VACCO 
S5 vmaco (0,3,3) ; 
//Increase address. pointers 
S6 froml += VLMAX; 
S7 from2 += VLMAX; 
} 
Figure 4-10: Main vector loop 
Within this loop five custom vector instructions are executed. The first two (statements 
SI, S2) are vector loads which load the data from the pointer addresses froml and 
from2 and deposit them in the vector registers I and 2 respectively. The next three 
instructions (S3, S4, S5) perform the main functionality of the loop, that is vector 
subtraction and multiply-accumulate operations. First the subtraction is executed on 
4. Methodology and Architectural Results 82 
vector source registers 1 and 2 and the result is stored into vector register 3. The multiply-
accumulate calculation is performed as a pair of instructions, for the even and odd 
elements respectively. Each of these instructions multiplies the register 3 with itself and 
adds the product to the corresponding even or odd elements of accumulator O. The last 
two instructions increment the pointers by VLMAX to prepare the data for the next loop 
iteration. 
Since the original loop parameter, in this case M, may not be exactly divisible by VLMAX 
a remainder section (loop strip mining code) is required to ensure that all the original data 
is processed. Strip mining is the process of running the loop with a number of iterations 
that does not divide exactly the VLMAX architecture constant. This code is only 
executed if there is a remainder from the modulus operation, M%VLMAX. If this is the case, 
the vlen register (dynamic vector length) is loaded with the new vector length M%VLMAX 
in order to indicate in which elements the vector instructions will be performed during the 
strip mined section. This loop is executed only once, for the specified vector elements and 
thus, only a subset of the vector datapath is achieved during this section. 
IIRemainder Part 
if (M % VLMAX) 
( 
SI ldvlen_r(M % VLMAX); 
IILoad vector register from rbuf 
S2 vldw(l, froml); 
IILoad vector register from &lspcbl 
S3 vldw(2,from2); 
IIPerform subtraction to vrl, vr2 
S4 vitu_sub_r(3,1,2); 
IIMultiply even word and add to VACCO 
S5 vrnace(O,3,3); 
IIMultiply odd word and add to VACCO 
S6 vrnaco(O,3,3); 
} 
S7 ldvlen_r(VLMAX); 
II Do ADD reduction of VACCO 
S8 vaccaddreduce(O); 
IIStore accumulator value in element ° to L_tmp 
S9 vstacc(O,O,&L_tmp); 
} 
#endif 
Figure 4-11: Strip mining loop 
4. Methodology and Architectural Results 83 
The last section of the code snippet in Figure 4-11 restores the dynamic vector length 
register to the maximum vector length for the vector accumulator to perform an add-
reduce operation in all its elements and produce a final 32-bits scalar result. This result is 
deposited in the lowest element (element 0) of accumulator 0 and it is stored into memory 
at the pointer's address L_tmp with statement S9. In the vectorized code the arithmetic 
instructions calculating the displacement from the index base are reduced by 
(VLMAX + I) times as the number of iterations is divided with VLMAX plus the modulus 
calculation for loop strip mining. 
4.3.5 Scalar Optimization 
All the data-parallel loops constructs that didn't exhibit any data (iteration-carried) 
dependences were re-written in vector assembly, using vector instruction extensions and 
the techniques discussed previously. The remaining of the code that comprises non-
vectorizable loops and parts that contain BASOP instructions was optimized through the 
addition of custom scalar instructions. Figure 4-12 depicts an example of scalar assembly 
that replaces part of the original code. This loop transforms back the LPC from the LSP 
coefficients. As it can been seen this loop presents data dependency as both statements of 
the original code depend upon input values that were created by previous 
execution/iteration (fl [i-I] and f2 [i-I]). Therefore this loop is not vectorizable and 
can be only optimized by replacing the BASOP operations with scalar instructions. The 
pre-processor directive #ifdef METHOD2 is used at compile time to allow this scalar-
optimized part of the reference code to run and it is activated by the METHOD2 switch in 
the compile.h header file. In a similar manner with the vector-optimized loops, the 
operands are loaded to the coprocessor scalar registers. The difference is that these 
registers are scalar and the loop iteration is the same as the original one. The next 
instructions perform long addition (L_add) and long subtraction (L_sub) to the scalar 
registers and results are stored back to the memory. 
4. Methodology and Architectural Results 
for (i = 5; i > 0; i--) 
( 
/******************** METHOD2 ********************/ 
#ifdef METHOD2 
//Load variable fl[i) in register[l) 
m2sld32(1,fl[i]); 
//Load variable fl[i-I) in register[2] 
m2sld32(2,fl[i-I]); 
//Load variable f2[i] in register[3] 
m2sld32(3,f2[i]); 
//Load variable f2[i-l] in register[4] 
m2sld32(4,f2[i-l]); 
//Perform L_add 
m2sladd(I,I,2); 
//Perform L_sub 
m2slsub(3,3,4); 
//Store to f1[i] 
m2sst32(I,fl[i]); 
//Store to f2[i] 
m2sst32(3,f2[i]); 
#else //ORIGINAL CODE 
#endif 
f1[i] = L_add(fl[iJ. f1[i-I]); 
f2[i] = L_sub(f2[iJ. f2[i-l]); 
/* f1[i] += f1[i-l]; */ 
/* f2 [i] -= f2 [i-I]; * / 
Figure 4-12: Scalar optimization example 
4.3.6 Validation Tests 
84 
Every time a vector or scalar assembly instruction was added in one of the C reference 
codes, tests were run, using the test vectors provided by the ITU-T. This was to verifY the 
full algorithmic equivalence between the optimized and the original (reference) codes. 
The test vectors employed for both algorithms are listed in Table 4-8 below. 
Table 4-8: G729 Encoder Test Vectors 
Input vector 
Algthm.in 
Fixed.in 
Lsp.in 
Pitch.in 
Speech.in 
ITU Reference output 
Algthm.bit 
Fixed.bit 
Lsp.bit 
Pitch.bit 
Speech.hit 
Description 
Conditional parts of the algorithm 
Fixed codebook search 
Lsp quantization 
Pitch search 
Generic speech file 
4. Methodolorry and Architectural Results 85 
It is important to note that these vectors are not exhaustive and thus can only be part of a 
more comprehensive validation suite. 
Table 4-9: G729 Decoder Test Vectors 
Input vector ITV Reference output 
Algthm.bit Algthm.pst 
Fixed.bit Fixed.pst 
Lsp.bit Lsp.pst 
Pitch.bit Pitch.pst 
Speech.bit Speech.pst 
Tame.bit Tame.pst 
Erasure.bit Erasure.pst 
Overflow.bit Overflow.pst 
Parity.bit Parity.pst 
Description 
Conditional parts of the algorithm 
Fixed codebook search 
Lsp quantization 
Pitch search 
Generic speech file 
Tarniog procedure 
Frame erasure recovery 
Overflow detection in synthesizer 
Parity test 
Passing these vectors can be considered a minimum requirement, and is not a guarantee 
that the implementation is correct for every possible input sigoal. 
Table 4-10: G.723.1 Encoder and Decoder Test Vectors 
Input vector 
dtx63.tin 
dtx53mix.tin 
dtx53mix.tin I 
dtxmix.rat 
dtx63.rco 
dtx53.rco 
dtxmix.rco 
dtx63e.tcol 
dtx63e.crc 
dtx63b.tco 
ITV Reference output 
Encoder 
dtx63.rco 
dtx53.rco 
dtxmix.rco 
Decoder 
dtx63.rou 
dtx53.rou 
dtxmix.rou 
dtx63e.rou 
dtx63b.rou 
Description 
Encoder input I 6.3 rate 
Encoder input I 5.3 and mixed rate 
Encoder rate input 
Decoder input I rate 6.3 
Decoder input I rate 5.3 
Decoder input I mixed rate 
Decoder input I rate 6.3 with Cyclic 
Redundancy Check (CRC) input 
Decoder input I rate 6.3 
For the purpose of this research these ITV supplied test vectors were used to ensure 
compliance of the reference speech coders throughout the optimization phase. The 
4. Methodology and Architectural Results 86 
compiler used to compile the vectorized reference code was gee 3.3.2 (Hnux x86) [30] 
and the gee 2.7.3 [30] cross-compiler for the SimpleScalar ISA. 
4.3.7 The extended ISA (Scalar and Vector Extensions) 
This section, describes the modifications that took place in the core SimpleScalar toolset 
in order to emulate the coprocessor architecture under study. The sim-vector tool that was 
used is an extended simulator based on the sim-fast simulator but modified with added 
state (coprocessor scalar and vector state) and instructions (coprocessor scalar and vector 
instructions). This code includes the extra processor state and the instructions that operate 
on that extended state. The extended state specifies the additional registers on top of the 
existing architectural state (SimpJeScalar processor state). The vector. def file includes 
the definition of all the existing instructions of the SimpleScalar along with the extended 
instruction set architecture. The vector. def file contains the PISA. def which includes 
C macro implementations of all the basecase SimpJeScalar instructions. 
Vector.def example 
switch «inst.b» CATEGORY_LSB) & CATEGORY_MASK) \ 
{\ 
/**************************/\ 
case 2: /* CATEGORY 2 */\ 
/**************************/\ 
switch (OPCODE)\ 
{\ 
case 1:\ 
{\ 
switch «inst.b » EXT_OPCODE_LSB) & EXT_OPCODE_MASK)\ 
{\ 
case 3:\ 
{\ 
/* VSHRI */\ 
extern vstateT vstate;\ 
enum md_fault_type _fault;\ 
int index; \ 
Word16 amount;\ 
amount=GPR(IMM9_ADDR);/*(Word16)IMM8;*/\ 
for (index=O; index< vstate.VLEN; index++)\ 
{\ 
if (RD_ADDR !=o )\ 
vstate.VRF[RD_ADDRJ [indexJ=my_shr_simple 
(index,vstate.VRF[RSl_ADDRJ [indexJ, (Word16)amount);\ 
)\ 
break; 
Figure 4-13: Instruction Definition in Vector.deC 
4. Methodology and Architectural Results 87 
The vector. def file contains the opcode definitions of the whole extended instruction 
set. Extended opcodes are split into 3 parts; the opcode bits 20-24, category bits 25-28 
and the extended opcode bits 29-31. A typical opcode is implemented with 3 levels of 
switch statements. The first level is the category switch, the second level the opcode 
switch and the final level is the extended opcode switch. Figure 4-13 shows the C 
description of the vector shift right coprocessor command and, as it can be seen, is similar 
to the C macro definition of the instruction in Figure 4-5. The main difference is how the 
source/destination registers are decoded. They have been extracted from the instruction 
opcode in an earlier stage. Inside the loop the vector instruction is performed and every 
loop iteration represents a vector datapath lane. This replicates the functionality of the 
vector processor. When the extended Simple Scalar toolset is running, the extended vector 
instruction count is added to the default instruction count to derive precise execution 
statistics for the whole (base and extended) processor architecture. 
4.3.8 Inline Assembly 
The C representation of the extended instructions (macro-based) adds a lot of time-
overhead as every opcode corresponds to a number of instructions and is thus used only 
to model the execution of these instructions. Therefore, in order to derive a final 
optimized implementation, the extended instructions were inserted with inline assembly. 
#endif 
#ifdef ss 
1/ sirnplescalar 
#define vshri(vrf,arnount) \ 
({\ 
asrn volatile ("addu $10,%0,$0" : :"r"(arnount):"$10");\ 
asrn volatile (".word Ox00010000");\ 
asrn volatile (".word \ 
3 « 29 I 1* EXT_OPCODE */\ 
2 « 25 I 1* CATEGORY *1\ 
1 « 20 I 1* OPCODE */\ 
"#vrf"« 15 1* VRD = VRF *1\ 
"#vrf"« 10 1* VRS1=10 *1\ 
10 « 5"); 1* RS2 = HOST REG */ \ 
} ) ; 
#else 
II Sparc 
#endif 
Figure 4-14: InUne Assembly Instruction Definition 
4. Methodology and Architectural Results 88 
With this method, every scalar/vector opcode corresponds to one instruction only and the 
added instructions can run in the Simple Scalar mode to produce reliable statistics. Figure 
4-14 illustrates an example of in line assembly for the vshri vector instruction; that its C 
macro was showed in Figure 4-5. The asm volatile statement is divided in three parts. The 
first part is the code section where the first (source) operand (%0) is added to source 
register 0 and stored into the target register 10 ($10) [30]. Since there are not output 
operands two consecutives colons are added on the place where the output operands 
would go. The "r" (amount) signifies that is the other input register operand. The "r" is a 
constraint string which indicates that the following C variable (amount) is placed in a 
general register. The last part of the asm instruction, the clobber list, is utilised to inform 
the compiler about which register is clobbered (modified) by the assembly code. In this 
example "10" indicates to the compiler that register 10 has been modified by the inline 
assembly. The next two assembly lines comprise a 64-bit opcode that will be dynamically 
decoded by sim-vector during run time. The first part of the opcode which is 32-bits 
(OxOOOIOOOO) represents the nap instruction annotation I (flag) whereas the second part 
builds the remaining 32-bits (word) which is the actual vector instruction [30]. This word 
is the binary pattern of the instruction set extensions. The compiler composes the opcode 
binary according to the above inline assembly statements. During runtime SimpleScalar 
encounters the nap opcode and checks the binary pattern. If it is an extended instruction, 
it performs the transformation on the processor state as specified by the extended opcode. 
4.4 Architectural Results 
As it was described in the previous sections, both workloads were optimized with the 
development of vector and scalar extended ISAs. Throughout the experimentation phase, 
the modified workloads were validated by using the ITU-T test bitstreams to ensure 
compliance with the reference speech coders. In order to study the optimization benefits, 
simulations were run for all ITU-T input vectors and for vector lengths of up to 128 16-
bit elements. During compile time, the user can select which mode the coprocessor will 
run. A special file (compile.h) contains all necessary switches for compilation in order 
to be able to select the mode that the code will run. By selecting x86 (native mode) or SS 
(SimpleScalar mode) the compiler is using the C-macros (Figure 4-5) or the inline 
assembly implementation (Figure 4-14) of the extended instructions respectively. The 
4. Me/hodalayv alld ArchirecflIral Res/llrs 89 
OR IGINAL switch selects if the code wi ll run In or iginal or vector mode while the 
METHOD2 adds the scalar features. The results are segmented in two major groupings 
with the first group showing the induced performance o f the vector ISA only. T he second 
group refl ects the performance of the full optimi zati ons and exposes the add iti onal 
performance benefits o f the sca lar ISA. Figure 4- I 5 and Figure 4- I 6 show the results of 
the ex tended, architecture-level performance simulation of the G.729A encoder and 
decoder respectively for vector opti mizat ions only. The performance metri c used is the 
relative dynamic instmction count which in both cases is approx imately 59. I % and 
60.7% respectively at a vector length of sixteen 16-bit elements. 
0 .. 
~ 
§ 0.4 
o 
" . ~ 
~ 0.4 
.. 
E 
o 
·E 0.4 
• c 
~ 
o 
.~ 0.4 
• .. 
" 
0.3 
, 
~ t 
G.729A Encoder (Vector Onty) 
-algthm - fixed 
I" - pitch 
- tame - test 
speech 
" ~ ~ v 
,. 32 4. 64 60 96 
'" " • Vector length (VlMAX) 
Figure 4-15: G.729A Encoder (Vector Only) Results 
Thi s essentiall y means that the vectori zed G.729A encoder executes 59. I % fewer 
instmctions compared to the reference C implementation when the vector ISA comes in 
effect , for a VLMAX of 16. In the case of the G.729A decoder, this fi gure is 60.7%. 
4. Merhodn!ogv llIU! Archile(:lllral Results 
G.729A Decoder (Vector Only) 
o~----------------------------------~~~~~ algthm li Ked 
20 ' 0 80 so 
Vector length (Vl MAX) 
lip pItch 
- Lame - &peecn 
lOO 
erasIJra -overllow 
pao~ 
120 
Figure 4-16: G.729A Decoder (Vector Only) Results 
90 
The slope of Ihe graphs c learl y demon Irates Ih at the most significant I erformllnce 
be nefit are realized at shorter vector lenglhs, in the range of 2 to 16 16-bit ele ments, 
while no further significant reduction is measured beyond that confi guration. T his 
observation has the benefit of restricting the microarchitecture design space to shorter 
vector lengths a confi gurations with vector lengthS greater than 16-bit e lement s are in 
practice unrealistic, due to the large silicon overhead incurred by such wide datapath and 
the need For very long cache fill bur ts [3 I] . 
G.729A Encoder (Full Optimization) 
o 
. Igthm I,,.. 
I,p - PItch 
-
-test 
--
0. 
• 
\ .~ ; 0 
~ 
~ ! 16 32 '8 .. so .. 
"' 
'.8 
o 
• Vector length (VlMAX) 
Figure 4-17: G.729A Encoder (Full O ptimizMion) Results 
4. Merllodnlogv lIud Archirectll rlll Results 91 
Figure 4- 17 and Figure 4- 18 de pict the re lati ve algorithmi c co mp lex it y of the G.729A 
e ncoder and decoder obtained with a ll the optimizm ions. 
G.729A Decoder (Full Optimization) 
0 4rr---------------------algthm - lJxed 
0'01 
2 
\ 
I. 32 48 64 80 
Veclor Lenglh (VLM AX) 
96 
Isp pitch 
- lame -speech 
- erasure -overltow 
paril)' 
11 2 ". 
Figure 4-18: G.729A Decoder (Full Optimization) Results 
In th is particular case, both the data-paral le l as we ll the non-vectori zable sections o f the 
code were optimized. The ac hieved performance metric improvement for the encoder and 
decoder is 76.2% and 65 .9% respecti ve ly for vector length of 16 (256 bits) and no furt her 
improvement appears for larger vector lengths. It is clear from the results that the re is 
s ignificant improvement in the dynamic instruction compared to the ori ginal execution. 
These data indicate that both speech coding standards benefit sub tantiall y from 
combined, scalar and vector accelera tor. 
4. M ethot!o[og\' lI11d ArchitecfIlral Results 
G.723.1 Encoder (Vector Only) 
o~----------------------------------o===~~~, 
1
:--(I1X53mIXlln tf53) ,J 
- dlX63lln (r63) 
O~ l 
, 
v 
y 
to 32 
v 
v 
"8 6f 80 
Vector Length (VLMAX) 
--dtx53rnlx.lin (mlxfl CI) 
I 
96 
'" 
". 
Figure 4- t9: C_723.1 Encoder V cctor O ptimization Results 
Y2 
Figure 4- 19 and Figure 4-20 ill uSlrate the relati ve dyna mi c instruction count reduction of 
the G.723. 1 encoder and decoder respecti vely for the vector optimi zation o nl y. 
G.723.1 Decoder (Vector Only) 
O~r-------------------------------------------------, 
- dbmux.rco - dtd)3e.ICO 
- dtx5Jb.ICO 
.. 32 .. ... 80 
'" " . Vector Length (VLMAX) 
Figure 4-20: C.723.1 Decoder Vector O ptimiza tion Resul ts 
The performance metric for the encoder a nd decoder i approx imately 70% and 67% 
re pectively at a vector length of 16 16-bi t e lements. Th is maximum improvement 
appears at a vector length of 16 (256 bi ts) and no significant improvement emerges 
beyond that. Performance salll rati on clearly indicates that wider DLP confi gurati ons are 
4. Methodologv and Architecfll ra l Results 93 
not needed and that most of the inherent DLP of the a lgorithms can be ex pl oited by a 
256-bit wide vector coprocessor. 
G.723.1 Coder (Full Optimization) 
o~~------------------------------------------------, 
u 
·E 0.2 
[ 
o 
• .~ O. 
• 
.. 
a: 
-~I' I· ". 
'. 
-- dlx53rTl1x.tin (rf)3) 
a dt~63. lIn (1'53) 
• dlK53mlx,\In (mixed) I 
11\ . , .... ,. .... , • 
.. . .. .. .. ... ............... .. .................. -.. .......... .. ................. . ....... .. 
as .. I" 
'\ · .. ····V · . 
y •••• ·······V ···························· 
o .• Eh-------- -------------_______ -l 
0, IS 32 '. 
.. eo 96 ", 
'" Vector length (VlMAX) 
Figure 4-21 : C.723.1 Encode)' Full Optimization Resul ts 
Figure 4-2 1 and Figure 4-22 depict the performance metri c of the 0 .723.1 encoder and 
decoder with fu ll scalar and vector optimi zati ons. In th is case , the dy namic instructi on 
cou nt is reduced approx imate ly to 79 .7% and 73.6% at a vector length of 16 (256 bits). 
0 
~ 0.3 
< , 
.3 0.3 
< 
:B 
~ 0.3 f 
" E 
u o. 
-
'E 
• ~ o.z 
• > 
~ 0.2 
a: 
o.,~ 
, 
G.723.1 Decoder (Full Optimization) 
--dlX63.fCO - dbr:53.rco 
- dtxmluoo - dbl:6Je.tco 
--dlX63b,tco 
\ 
~ 
\....: 
\6 32 
" 
eo 96 \I ' 128 
Vector length (VlMAX) 
Figure 4-22: C.723.1 Decoder Full Optimization Results 
4. Methodologv mid Architecfllral Results 94 
This add iti onal decrease in the dynamic instruction count of both speech codecs shows 
considerab le improvement with the introducti on of the scalar instructions. it is clear that 
the combinati on of sca lar and veClOr optimi zed code, via the two proposed extended IS As 
yields belter performance metrics and for thi s reason the design implementation includes 
both coprocessors. The nex t set of graphs (Figure 4-23 up to Figure 4-34) i llustrates the 
performance improvement of the most compute- intensive functi ons as they appear in 
Table 4-6 and Table 4-7, for both speech codecs. T hese resuits can be used to see the 
spec ific secti ons of the speech codec which have been improved. Figure 4-23 shows the 
resuits for the G.729A encoder function Cor h x under full -optimizat ion. Th is 
functi on computes the correlation of the input response w ith the target veClOr in the 
algebraic codebook (fi xed codebook) search procedure [27] . T he fracti onal performance 
improvement of the optimized codebook search at vector length of 16 16-bit elements 
(256 bits) is 75.5% and it reaches 78.5% at vector length of 2048 bit over the range of 
reference input bitstTeams. 
Cor_ h_x (Full Optimization) 
o .~-------------------------------------------. 
--Algthm -- Fixed I 
'i 
'E 04 , 
o 
" § 0 
't; 
2 
~ 0.3 
u 
'E 
• o 0 
r; 
• . ~ 
.; 0.2 
• ex: 
16 
---Lsp - Pitch 
-- Speech -- Tame 
- Tesl 
32 46 64 eo 96 1>' ". 
Vector Length (VLMAX) 
Figu re 4-23: Cor _h_x (Full Optimization) Results 
Figure 4-24 presents the performance metric (relat ive instruction count) of the G.729A 
functi on Syn _ f i 1 t . This function implements the l Ot" order Linear Prediction (LP) 
synthesis fi lter ( I /A(z) [27] . The performance improvement of the synthesis fil ter at 
vector length of sixteen is 73.5% over the range of the reference input bitstreams. As it 
can be seen no further improvement is evident beyond this vector length . T his is 
4. MelllOr!% gv alld Archileclll ral Re,\'ulls 95 
explained frolll the fact tha! the nllmber of itermions for the internal loop of thi s functi on 
is 10. 
Syn_fill (Full Optimization) 
D. --AJgrtvn - Fixed 
- up - PitCh 
- Speech - Tame 
- T ... 
••• II ••• IIIIII.II.II.I.I •• ~ 
o~------------------------------------------~ 
oi ,. . , .. eo 
" '" '" Vector l ength (VlMAX) 
Figure 4-24 : $yn_lilt (Full Optimization) Resul ts 
Figure 4-25 depicts the relati ve instruction couo! of the G.729A functi on 
Pi tch _01_ fast under full-optimizmions. This funcrion estimates the open-loop pitch 
delay based on the perceptually weighted peech signal. T his open-loop delay is used as 
an indication from the closed-loop ana lysis to find the adapti ve-codebook delay and gain 
[27]. 
4. Mefhot/rJ logv find Architectural Results 
~ o 
§ 0 
o 
~ 0 
·2 2 0 
~ o .
. " E 0 
~ 
~ O 
.~ 0 
.. 
• a: 01 
" 
32 
Pitch_oUast (Full Optimization) 
.. .. 80 
Vector length (VlMAX) 
--AlgtNn - Fixed 
- up - Pilch 
- Speech - Tame 
- Test 
96 112 
Figure 4-25: I'itch_ol_rast (Full Optimization) Resul ts 
96 
128 
In thi s case the performance-metri c reduction (relati ve dynamic instructi on count) is 
approx imately 78.7% at a vector length of sixteen 16-bit elements. The nex t graph in 
Fi gure 4-26 shows the relati ve performance improvement of the G.729A functi on 
Residu. This function computes the LP residual signal by filtering the input peech 
through the LP synthesi filter. The LP residual signal is lIsed to find the target vector for 
t he adapt i ve-codebook search [27]. 
Residu (Full Optimization) 
O.'IT-------------------
_A1g1hm • F,xscI 
lO.5 
C , 8 o. 
~ o. u . 
~ E o. 
." ~ 0'" 
~ 
~ 0.: 
:; 
~ 02 
1-
\ 
• 
Ls9 Pitch 
--Speech -- Tame 
- Tesl 
~ ........................................................ . 
" 
32 .. 54 80 
" 
112 128 
Vector Length (VLMAX) 
Figure 4-26: Residu (Full Optimization) Results 
4. Methodologr (lnd Architectural Resulls 97 
In thi s case the achieved instruction count reduction for thi s functi on is of the order 
75. I %. 0 further improvement is evident beyond thi vector length a the number of 
iterations for the internal loop of thi s function i 10. Figure 4-27 depicts the performance 
improvement for the G.729A function Autocorr. T his fun ti on computes the 
autocorrelation of the signa l with a 30ms asymmetric window in order to perForm linear 
predicti on (short -term) analys is. Later the autocorrelation coefficients of the windowed 
speech are computed and converted to the LP coeflicient. using the Levinson algorithm 
[27]. 
Autocorr (Full Opti mization) 
0 
- AIgthm - Flxed 
~ 0.3 
E 
- Lsp - PitCh 
- Speech --Teme 
, 
0 0 0 
-Tesl 
c 
0 
t :n 0.2 , >-• 0 £ 
•  ~ 01 t c > 0 
• 0 > 
'" • 1-£ 0.0 
oj ,. 32 '8 .. 80 .. 112 128 
Vector Length (VLMAX) 
Figure 4-27: AutocoH (Full Optimization) Resul ts 
This functi on demonstrates excellent performance stabi lity and experiences a dynamic 
instruction count reduction of approx imately 93.4% at a vector length sixteen. The 
Autocorr functi on contains a large number of data-parallel loops that were vectori zed 
uccessful ly. 
Figure 4-28 how the relati ve in truction count of the G.729A functi on 
Lspyre_select . This function implements the lirst stage quanti zer that quanti zes the 
difference between the computed and predicted LSF coefficients of the current frame. 
This quamizer is a IO-dimensional Vector Quantizer (VQ) that use a codebook w ith 128 
entries (7 bits) [27]. 
4. Methodologv alld Arcllitectural Results 
Lsp_pre_seleet (Full Optimization) 
0.,...,...------------------------, 
--Alg!hm - Fixed 
- Lsp - Pi1ch 
- Speech - Tame 
- Tes1 
.... __ .-._ .. _---.-. .. __ .... --........... _. . ....... .. 
O. I<1-,~----------------------------J 
o ,. 32 48 64 80 
Veclor l englh (VlMAX) 
11 2 
Figure 4-28: Lsp_pre_seleet (Full Optimization) Results 
12' 
98 
The improvemenl in performance under fu ll -optimizations is of the order o f 80.7%. After 
this veClor length an expected performance saturation is observed since the number of 
iterations for the interna l loop of th is functi on is 10. 
Figure 4-29 repre ents the resu lts of Ihe G.729A decoder functi on Age. This function 
implements the Automatic Gain Control (AGC) procedure Ihal lakes Ihe output of the 
adapli ve post filter and scales it to match the energy of the reconstructed signal [27]. 
Age (Full Optimization) 
o. 
l 0 
c , 
0 0 
1 .... 
,---
" c .~ 0 
2 
~ o 
\~ 
"'-.:. ... . 
- ~- . -.-....,..-. ---.. -...... _ ......... ..... _ _ ..... ___ ..... ... ..- f'TT1o .~ .... 
. " ~ 0 f c 
~ 
0 
• 0 
-" :; 
• 0 a: 
..... 
• --.. ... H _ ........ 
Y ~ ••••••••• ~ •••• .f! ••••• -•••••• '§ •••• _. ~."" ••• 
0 
" 
32 .. 64 80 ,. 11 2 128 
Vector length (VLM AX) 
Figure 4-29: Age (Full Optimization) Results 
4. Ml'tltot!ologv lInd ArchitecfIlrtll Results 99 
The dy namic instruction count reducti on ranges between 50.6% and 86.6'31 at vector 
length sixteen 16-bit e lements over the range of reference inpu t bi tstrea ms. 
Figure 4-30 illustrates the results of the full optim ization of the G.723. 1 Find Best 
fun cti on. This function implements the fixed co lebo k searc h for the high rate encoder 
by performing quantization on the res idual s ignal in the MP-MLQ bloc k [28]. 
Find_Best (Full Optimization) 
0 . ,....----------------------_---, .. .,-... ---".,--,,-. 
- 63 Rate 
I. 32 ' 8 .. 80 .. 112 128 
Vector length (VlMAX) 
Figure 4-30: Find_Best (Full Optimization) Results 
It is interesting to note that this graph onl y shows results for two workloads: Mi xed Rate 
and 5.3 kbits/s. This is because the codebook search is onl y done at lower bit rates. The 
quanti zation process is approximating the target vector (residual signal) and the excitation 
is made by positive or negati ve pul ses multiplied by a ga in a nd whose positi ons can be 
either all odd or even. The fractional instruction cout1l improvement of this codebook 
search is 66% at vector length of six teen (256 bits) . 
Figure 4-3 1 illustrates the results fo r the re lati ve algorithmic complex ity of the G.723. 1 
function Estirn_Pi tch. This fu nction implements the open-loop pitch estimati on that i 
performed twice per frame, one for the fi rst two subframes and one for the last two. The 
open loop esti mate is computed using the pe rcepl'ua ll y weighted speech that i elected by 
the maximizati on of the cross-correlation of the speech method [28] . 
4. M elhodnlogv and ArchileClllral ReSIIIIs 
_ 0." 
C-
c O. 
, 
o 
~ 0.2 
.2 ~ 02: 
j 0.1 
• E ! 0 1 
El 
~ 0.1 
.~ 
~ ME 
Estim_Pitch (Full Optimization) 
--MIKed Rote 
- SJRale 
- 63 Rale 
o.oi,------------------------' o ~ 
" 
32 48 64 80 
Vector Length (VLMAX) 
96 11 2 
Figure 4-31: Estilll_Pitch (Full Optimization) Results 
'" 
100 
The overa ll improvement appears for a vector length sixteen and full -optimizations and is 
of the order of 92.3%. The next plot in Figure 4-32 shows the architecrure- Ievel results of 
the G.723. 1 encoder function Comp_Lpc that compute the 10'h order LPC fi lte r 
coeffi c ients for every frame. A Hamming-windowed bloc k is centred on the subframe and 
i. u ed to compute the eleven autocorrelation coeffi c ients that are inputs in the Levin on-
Durbin algorithm that generate the LPC coefn c ients. The produced LPC set are 
constructing the short-term perceptual weight ing filter that performs the synthesis [28]. 
Thi s fun ction demonstrates excellent pe,formance scalabi lity and ex periences a reducti on 
in dynamic instruction count of approximately 88% at a vector length sixteen (256 bi t). 
4. Met!lOdnlogv {lil t! Architectural Reslt /ts 101 
Comp_Lpc (Full Optimization) 
o~--------------------------------------------------
-+- MIMed COde 
- 53Al'l1e 
- 63 Aaut 
• •• ...... _-
" 
32 ' 8 .. 80 .. 112 128 
Vector Length (VLMAX) 
Figure 4-32: COlllp_Lpc (l' ull Optimization) Results 
Figure 4-33 shows the relat i ve algori thmic complex ity results of the G.723 .1 functi on 
Decod_ Acbk that computes the adapti ve codebook contribution from the previous 
exc itat ion vector in the pitch predictor [28] . 
Oecod_Acbk (Full Optimization) 
o~,~----------------------------------------------------_ 
~ 0 
c 
5 
o 08 
c 
:2 ! 0',1 
• c 
t; 0 
-+-Codenu 
-S3Rate 
---63 Aata 
~ I ~ 0.6 . ... _ _ _ ____ _ ____________ _ ___ -+ 
o. 0 I 
2 
18 32 46 64 80 
Vector Length (VLMAX) 
96 112 
Figure 4-33: Decod_Acbk (Full Optimization) Results 
128 
The reducti on in the dynamic instructi on count is 36% at vector length of sixteen (256 
bit) and full -opt imizations. As it can be seen there is no further improvement beyond 
4. Met/wt/olog\' alld Architectural Reslt/ts 102 
vector length of 6 (96 bi ts) as the number of i terations For the i11lernal loop of this 
Function is 6. 
Figure 4-34 depicts the simu lation results for the G.723 .1 functi on Comp_ Pw tha! 
computes the harmonic noise fi ller coefricie11ls. The optimal lag For thi fi ller is searched 
around the open loop pitch lag that maxi mise the positi ve correlation [28J. 
0 
~ 0_3 
C , 
0 
0 0 
c 
0 
.f; 
2 0.2 
;; 
t 
.E 
.2 O. 
E 
• c 
'''1 .~ • 'ii o. a: 
0.0 , I 
O 2 
1. 
Comp_Pw (Full Optimization) 
48 601 80 
Vector Length (VLMAX) 
.. 
- MflllldAare 
- 53 Rate 
- 63 Aare 
112 
Figure 4-34: Comp_ Pw (Full Optimization) Results 
12 • 
The results show that the improveme11l in the dynamic in trnction count performance 
metric is of the order of 87.6% at a vector length of 16. In the following table are the most 
compute-intensive functions of G.732.1 in order to have an overall view of the 
performance improvement. 
4.5 Summary 
Thi chapter described the optimiz"tion met hodology and the performance improveme11ls 
that were achieved via custom vector and calar (SA ex ten ions and optimizations of both 
speech cod ing standard . During this process the work loads were profi led over a range of 
vector lengths to identi fy the enhancement the custom ISA ex ten ion have produced. 
The archi teclllral resulls are very promising, demonstrati ng a reduction in the dynamic 
instructi on count metric of 58% and 7 1 % for G.729A and G.723.1 speech coder 
respec ti vely when the vector instrncti ons were introduced and a Funher 18% and 9% 
4. Methodology and Architectural Results 103 
reduction in dynamic instruction count when the scalar instructions were applied. These 
results show the potential benefit of applying custom instructions and having associated 
coprocessor vector functional units. The overall simulation results indicate that the 
area/performance points of interest lie in between 64-bit to 256-bit wide configurations. 
ill addition both sets of results reveal that the maximum benefit is achieved by a 
combination of custom vector and scalar architectures. From this, the microarchitecture 
can be designed and attached to a generic RISC CPU. This is explained in more detail in 
the next chapters. 
4. Methodology and Architectural Results 
4.6 References 
[I] T. Austin, E. Larson, and D. Ernst, "Simple Scalar: An Infrastructure for 
Computer System Modeling," in Computer. vo!. 35 - no. 2, February 2002 pp. 
59-67. 
104 
[2] S. Dwarkadas, J. R. Jump, and J. B. Sinclair, "Execution-Driven Simulation of 
Multiprocessors: Address and Timing Analysis," ACM Transactions on Modeling 
and Computer Simulation (TOMACS), vo!. 4, pp. 314 - 338 October 1994. 
[3] J. L. Peterson, P. J. Bohrer, and e. aI, "Application of full-system simulation in 
exploratory system design and development," vo!. 50, pp. 321-332, March 2006. 
[4] S. Hangal and M. O'Connor, "Performance Analysis and Validation ofthe 
picoJava Processor," in IEEE Micro. vo!. 19, May 1999, pp. 66-72. 
[5] T. A. Diep, C. Nelson, and J. P. Shen, "Performance evaluation of the PowerPC 
620 microarchitecture," in Proceedings of the 22nd annual international 
symposium on Computer architecture, S. Margherita Ligure, Italy, 1995, pp. 163-
174. 
[6] L. Guerra, J. Fitzner, D. Talukdar, C. SchHiger, B. Tabbara, and V. Zivojnovic, 
"Cycle and phase accurate DSP modeling and integration for HW /SW co-
verification," in in Proceedings of the 36th ACM/IEEE conference on Design 
automation, New Orleans, Louisiana, United States, 1999, pp. 964 - 969 
[7] P. Mishra, N. Dut!, and H. Tomiyama, "Architecture Description Language 
driven Validation of Dynamic Behavior in Pipelined Processor Specifications," 
CECS Technical Report #03-25, Center for Embedded Computer Systems, 
University of California, Irvine July 2003. 
[8] A. Halambi, P. Grun, V. Ganesh, A. Khare, N. Dut!, and A. Nicolau, 
"EXPRESSION: A languagefor architecture exploration through 
compiler/simulator retargetability," in in Proceedings of Design Automation and 
Test in Europe (DATE), 1999, pp. 485-490. 
[9] F. S.-H. Chang, "Fast Specification of Cycle-Accurate Processor Models," in 
Proceedings of the International Conference on Computer Design: VLSI in 
Computers & Processors, 2001, pp. 488-492. 
[10] G. Zimmermann, "The MIMOLA design system a computer aided digital 
processor design method," in in Proceedings of the 16th ACM IEEE Conference 
on Design automation San Diego, CA, United States, 1979, pp. 53-58. 
[11] M. Reshadi and N. Dut!, "Generic Pipelined Processor Modeling and High 
Performance Cycle-Accurate Simulator Generation," in Proceedings of the 
conference on Design, Automation and Test in Europe, 2005, pp. 786 - 791. 
[12] M. Freericks, "The nML machine description formalism," Technical Report 
1991115, Technische Fachbereich Informatik, Berlin University, Berlin 1991. 
[13] G. Hadjiyiannis, S. Hanono, and S. Devadas, "ISDL: An Instruction Set 
Description Language for Retargetability," in in Proceedings of the 34th annual 
conference on Design automation, Anaheim, California, United States, 1997, pp. 
299-302. 
4. Methodology and Architectural Results 105 
[14] M. Barbacci, "Instruction Set Processor Specifications (ISPS): The Notation and 
Its Applications," IEEE Transactions on Computers, vo!. 30(1), pp. 24-40, 1981. 
[15] G. Mulley, "Using Ismene to Debug and Predict the Performance of an 
Embedded System Device Driver," University of Glamorgan, Technical report 
2004. 
[16] T. Hoshino, "VDUI version Two: A New Horizon ofHDL Standards," IFIP 
Transactions: Proceedings of the 11th IFIP WG 1 0.2 International Conference on 
Computer Hardware Description Languages and their Applications, vo!. A-32, 
pp. 437 - 452 1993. 
[17] V. Zivojnovic, S. Pees, and H. Meyr, "USA-machine description language and 
generic machine model for HW/SW co-design," in IEEE Workshop on VLSI 
Signal Processing, pp. 127-136,1996. 
[18] W. S. Mong and J. Zhu, "A retargetable micro-architecture simulator," in 
Proceedings of the 40th ACM IEEE conference on Design automation, Anaheim, 
CA, USA, 2003, pp. 752-757. 
[19] G. Maturana, J. L. Ball, J. Gee, and e. ai, "Incas: A Cycle Accurate Model of 
UltraSPARC," in Proceedings of the 1995 International Conference on Computer 
Design: VLSI in Computers and Processors, Los Alamitos, California, October 
1995,pp.130-135. 
[20] J. L. Hennessy and D. J. Patterson, Computer Architecture: A Quantitative 
Approach 2nd ed.: Morgan Kaufman, 1996. 
[21] D. Martin, "Vector Extensions to the MIPS-N Instruction Set Architecture (The 
VIRAM Architecture Manual) Revision 3.7.5.," March 2000. 
[22] T. M. Austin, "SimpleScalar 3.0a pre-release," SimpleScalar LLC: 
http://www.simplescalar.com. 
[23] V. A. Chouliaras, K. Koutsomyti, T. Jacobs, S. Parr, D. Mulvaney, and R. 
Thomson, "SystemC-defined SIMD instructions for high performance SoC 
architectures," in 13th IEEE International Conference on Electronics, Circuits 
and Systems, Nice France, December 10-13, 2006. 
[24] V. A. Chouliaras and J. L. Nunez, "Scalar Coprocessors for accelerating the 
G723.l and G729A Speech Coders," IEEE Transactions on Consumer 
Electronics, vo!. 49, pp. 703-710, August 2003. 
[25] V. A. Chouliaras, J. Nunez, S. R. Parr, K. Koutsomyti, D. J. Mulvaney, and S. 
Data, "Development of custom vector accelerator for high-performance speech 
coding," lEE Electronics Letters, vo!. 40, pp. 1559-1561, Nov 2004. 
[26] K. Asanovic, "Vectorizing SPECint95," in Computer Science Division. vo\. 
Unpublished manuscript extracted from PhD Thesis California: Berkeley, March 
1998. 
[27] ITU-T Recommendation G.729A, "Coding of speech at 8 kbitls using conjugate-
structure algebraic-code-excited linear-prediction (CS-ACELP)," 3/96. 
[28] ITU-T Recommendation G .723.1, "Dual Rate Speech coder for multimedia 
communications transmitting at 5.3 and 6.3 kbitls," 3/96. 
4. Methodology and Architectural Results 
[29] R. Alien and K. Kennedy, "Automatic translation of FORTRAN programs to 
vector form," ACM Transactions on Programming Languages and Systems 
(TOPLAS), vo!. 9, pp. 491 - 542, October 1987. 
[30] ''http://gcc.gnu.org/onlinedocs/.'' 
106 
[31] K. Koutsomyti, S. R. Parr, V. A. Chouliaras, andJ. Nunez, "Applying Data-
Parallel and Scalar Optimizations for the efficient implementation of the G.729A 
and G.723.1 Speech Coding Standards," in Proceedings of the 7th lASTED 
International Conference, Signal and Image Processing, Honolulu, Hawaii, USA, 
August 2005, pp. 40-45. 
CHAPTERS 
VECTOR PROCESSOR ARCHITECTURE 
5.1 Vector Architectural State 
The vector-scalar coprocessor is attached to the Sparc-Vg compliant CPU core via a 
custom, pipelined coprocessor interface. The accelerator consists of two major unit 
microarchitectures: One parametric microarchitecture that implements the vector ISA and 
a second that implements the scalar ISA. The coprocessor attaches to the integer unit of 
the Leon CPU in the fifth pipeline stage which is the memory stage. It was not designed 
as a stand-alone AHB coprocessor because, though the workloads perform a lot of work 
on blocks of data (samples), there where many more instances where custom assembly 
code (scalar) needed to be inserted into irregular (non-iterative) blocks. Therefore a very 
tightly-coupled configuration was pursued which accommodates efficiently both cases 
[1]. The coprocessor is connected to the memory stage in order to avoid the majority of 
the exceptions and interruptions of the Leon CPU and to have enough time to transfer 
data to/from the main processor if requested. Therefore, when a valid vector coprocessor 
instruction is encountered and there is no exception or pipeline staU then the vector/scalar 
instruction along with a valid signal is sent to the first stage ( decode) of the vector 
coprocessor pipeline for execution. By defining coprocessor extension instructions 
instead of a fuU stand-alone instruction set aUows taldng advantage of any developments 
in the Leon architecture and use of the development tools available for the latter. In 
addition, the coprocessor can be imported into any other embedded CPU architecture with 
very little modifications. 
The vector pipeline is a SIMD array of functional units (FUs). The functional units are 
organised in four groups: Addition (vadd), multiplication (vmult), shift (vshift) and 
misceUaneous (vmisc). Each group has a parametric number of functional units equal to 
half the maximum vector length (VLMAXl2) where VLMAX can take values that are 
power of 2. On every cycle, only one of the aforementioned FU groups is active. The 
subdivision of the vector pipeline into the four vector FU groups is detailed in the next 
chapter in section 6.4. The VLMAXl2 vector FUs are driven by the corresponding slices 
107 
5 Vecwr Processor ArcllifeClilre J08 
of the operand registers (vector elements), stored in the vector register fi le. These sli ces 
provide, pe r unit, two read porlS (2 x 32-bi t ) and a write 1'011 (32-bits). Each functional 
un it has a dy namica lly confi gu rable 2-way S IMD or scalar organi sation, depending on 
whether the instruction produces 2x 16-b it resul t. or 32-bi t. The vector length is located in 
the vector length (vlen) register and defines the wiclth of the vector regi ters and the 
nllmbe r of FU that are utilised to perform 1111 operation. It does not a lter any of the 
hardware resources. All vector operations are governed by the current vector le ngth ancl a 
vector mask. The current vector length is taken from the vlen register and the vector 
mask is implemented by a combinational logic that differs for each specific instructi on. 
The FUs take their source operands from ei ther vector registers, scalar registers or vector 
accumulators and can perform both vector a nd scalar operati ons. Eac h functi onal unit of 
the group active in the current cyc le accepts 32-bit source operands and produces a 32-bit 
result except from the FUs of the vmult group that can hand le 16-bi t input operands ancl 
produce a 32-bit result. Figure 5- 1 ill ustrates an example of a vector operation that is 
performed in two source vector registers. 
Source 2 register 163 3 48141 2 32 r ' 1 161'5 0 61 
Sourcel register r""--'==3=i'i
I
48iml'ir1 ='==2 ::::j:1 ;;;:32;;;;13,~==i=7I '6;rrI,r<=5 ='==0 ===i=1 :::;;6'1 -' 
qr1 
t3\ '611 opr2(31 16) I ~1!311'11 1~311'1 j I 
opr1(15 0) opr2{150) optl(15.0 opr2(15 
,, ' ++ :or + 
0) 
8888 8888 
r1l$(31 16) ~ res(ll 16) ~ reS(150).!, reS( 150)~ 
er r Destination regist 3 "'I' 2 ''I'' 1 '°1'0 0 "I 
Figure 5· 1: Example of an operation that is performed in two vector registers with vector 
length 64-l1i ts. Eaeh fun ctional unit is driven by the pair of the corresponding slices (,'ector 
elements) of the source Yector regis ters. The produced results are s tored back to the 
corresponding sl.ices (vector elements) of the destination vector register. 
In the case of scalar instructions onl y the first (FUO) functio nal unit fromlhe active group 
operates whereas the others do not change state (via clock gati ng and combi nationa l logic 
gating) in order to save power. The control! tatuS nag and register have two uses: to 
support pred icated execution and to store exception bits that are implicitly set by 
instructi ons that may produce the re leval1l exceptions [2] . 
5 Vector Processor Architecture 109 
5.2 Programmers Model 
The user programming model is shown in Figure 5-2. Along with the instruction set it 
completes the portion of the architecture that is visible to software. The programmer's 
model contains two types of registers, the general-purpose registers and the control/status 
registers. The general-purpose registers consist of the vector and scalar register files. The 
vector register file contains VREGS vector registers of statically-configurable length 
VLMAX of scalar 16-bits elements with two read ports and one write port. The VREGS 
configuration constant can take values from 2 to 32 and in this instance of the architecture 
that value is 16. The 16 vector registers are individually designated by the symbols VRO, 
VR1, ... , VR15 as illustrated. The scalar register file contains SREGS general-purpose 
scalar registers of 32-bits width and it has three read ports and one write port. The 
SREGS configuration constant is 16 in the current implementation but can be any value 
between of 2-32. The 16 scalar registers are individually designated by the symbols SRO, 
SRI, ... , SRI5, as illustrated in the model of Figure 5-2. The scalar registers can serve a 
number of purposes including use as address pointer registers, for scalar memory 
references, provide data values for vector and scalar operations, store final or 
intermediate results etc. 
Overflow Flag 
Ow 
1 bit 
Vector OVerflow 
Register 
,",=:==~I ov! ~: . (VLMAXJ2}-1 bits 
Predication Registe 
j . Ipred 
CVLMAXbi~~ 
Vlen Register 
I Ivlen 
Cabits" 
::; 
'" 
f 
Vector Register File Scalar Register File 
Element 0 Element 1 Element 2 Element (VLMAX-1) 
c 
16 bits • 
Element 0 
I-_-IVRO VR1 ~==~.VR2 VR3 i-_-lVR4 VR5 i---iVR6 
i-_-lVR7 
i-_-iVR8 VR9 t:::~:VR10 
F'===jVR11 
,-_-:VR12 
I-r==~:VR13 I- VR14 
L-_--'VR15 
c 
Element 1 
32 bits 
Element (VlMAXI2-1) 
• 
SRO 
SR1 
SR2 
SR3 
SR4 
SR5 
SR6 
SR7 
SR8 
SR9 
SR10 
SR11 
SR12 
SR13 
SR14 
SR15 
·l=I~c~~~;;~.~1 ::::::jl :::! Y' I~~~~~ 
32 bits 
Figure 5-2: Vector and Scalar coprocessor programmer's model 
5 Vector Processor Architecture 110 
There are also ACC_NUMBER vector accumulators consisting of VLMAXl2 scalar 
elements (32-bit). The ACC_NUMBER configuration constant in this case is 2 (V ACCO, 
VACCl) but can be any value of the range of 2-32 with the restriction for the long 
instructions which access the accumulators, except the multiply-add/sub instructions, that 
can use only the flfst two accumulators (V ACCO, VACCl) as source operands. There are 
special move instructions that exchange data between the vector, scalar and Leon general 
purpose registers and the vector accumulators. The control/status registers include a 
vector length register (vlen), a predication register (pred), a vector overflow register 
(ovf) and an overflow flag (W). The vlen register has maximum value of VLMAX 
and defines the width of the data that will be processed by the vector datapath. The 
predication register is a type of mask register with VLMAX bits where each bit 
corresponds to a vector element. It is set when a comparison instruction takes place and it 
is utilised during merge operations to select the appropriate vector elements that comprise 
the vector comparison result (merge operation). The overflow flag (W) is a single bit and 
it is set whenever an overflow happens during arithmetic instructions. Internally, multiple 
overflow flags are generated where each such flag corresponds to one vector element, and 
they are combined in a single overflow flag by using an or-reduce operation. In addition, 
there is a vector overflow register (ovf) that is VLMAXl2-bit long where every bit is the 
overflow result of each functional unit of the group that performed the particular 
operation. The only vector mask register is the predication register that is employed for 
the comparison and merge operations. All the other masking processes are implemented 
on the run by combinational logic obeying the current vector length value. 
5.3 Vector Processor Instruction Set Architecture 
The instruction set defines the transformations the soflware component can perform in the 
architectural state, including both memory and register file. Instructions define one or 
more operations for a scalar set of data. The vector instruction set, on the other hand, 
allows soflware to express, with a single opcode, multiple independent operations on 
arrays of data [3]. This section describes the instruction set that implements most of the 
basic DSP operations on the target, G.729A and G.723.l, ITU-T speech coding 
algorithms. These operations are more complicated than the basic operations of a RlSC 
architecture and are described in this document in two levels of detail. The first level of 
5 Vector Processor Architecture III 
detail is presented in the remaining of the chapter which is divided into sections that 
present and briefly describe groups of instructions of similar types. Each group is 
expanded into a more detailed description for each instruction that comprises it. This is 
the second level of detail that is contained in the Appendix A and contains for every 
instruction, its format, a short description of the instruction's operation and a software 
example. In the proposed processor architecture all the coprocessors instructions are 22 
bits wide and include 2 and 3-address formats (1 or 2 source operand registers and the 
destination register, all independently specified). The instruction set is divided into two 
main categories; the vector instructions and the scalar instructions. 
5.3.1 Vector ISA 
The vector instruction set described in this document comprises 43 instructions which are 
divided into groups of instructions of similar types. Every type is detailed by showing 
assembly formats and giving a short description of the instruction's operation. More 
detail is contained in Appendix A, where each instruction is presented separately. The 
vector instructions can be grouped into five categories: load/store, move, arithmetic, shift 
and miscellaneous. The assembly language format of an instruction is written with a 
shorthand notation and few examples of the vector and scalar assembly are given. In 
vector mode the coprocessor can process in parallel VLMAX 16-bit operations or 
VLMAXl2 32-bit operations. 
5.3.1.1 Load/Store Instructions 
Vector load/store instructions are the only instructions that access memory via the Vector 
Load/Store Unit (VLSU) and are illustrated in Table 5-1. This table also includes the 
instruction that loads the vlen register (ldvlen_r) with an immediate even if it is not 
regarded as a load instruction in a typical sense. 
5 Vector Processor Architecture lJ2 
Table 5-1: Vector LoadlStore Instructions 
No Instruction Assemblv Brief Description 
1 IdvlenJ Idvlen_r(imm) Load Vector Length Register with immediate 
2 vldw vldw(vrd,srsl) Load vector register from memory address 
3 vldwn vldwn(vrd,srsl) Load vector register downward from memory 
4 vstw vstw( vrs2,srs 1) Store vector register back to memory address 
S vstwn vstwn( vrs2,srs 1) Store vector register downwards to memory 
6 vldaccw vldaccw( vaccd,srs 1) Load vector accumulator from memory 
address 
7 vstacc vstacc( vacc, velem,srs 1) Store vector accumulator element to memory 
Vector load/store operations use a scalar register (srsl) that contains the memory 
address in which data is loaded from/stored to. For the load instructions the destination 
can be a vector register (vldw) or a vector accumulator (vldaccw) while for the store 
instructions (vstw or vstacc) these registers are the data sources. The load/store 
instructions are strided. A strided load takes a base address, in this case the srsl, and a 
signed stride, and loads a vector of values starting at the base address, where each 
element is separated by the stride amount. The stride is in units of elements, not bytes and 
can take the values I (vldw) and -I for load downward (vldwn). A similar method 
applies for the strided store in which a vector of values is stored starting from the base 
address and be separated from I (vstw) or -1 stride for store downward (stwn) [2]. 
Store instructions have one cycle latency and are performed in the Vector Register Access 
(VREG) stage where the store data, along with the memory address, are sent to the VLSU 
unit. Load instructions have latency of two cycles as the VLSU unit has a cascade 
TAGIDATA configuration. During a load operation the load address is sent to VLSU unit 
at the VREG stage and the memory data are obtained at the second Vector Datapath 
(VDP2) stage. It is clear that the load/store latency depends on the VLSU 
implementation. A parallel TAGIDATA configuration for the VLSU microarchitecture 
will reduce the load instruction latency from two cycles to one cycle at the expense of 
increased power consumption. 
5.3.1.2 Move Instructions 
The vector move instructions are used to exchange data between the vector, scalar and 
Leon general purpose registers as well as the vector accumulators. They comprise move 
instructions (mvvr2gpr or mvgpr2vr) that transfer data between coprocessor's vector 
registers and the main CPU's (Leon) general-purpose registers. 
5 Vector Processor Architecture 113 
Table 5-2: Vector Move Instructions 
No Instruction Assemblv Brief Description 
8 vacccIr vaccclr( vacc) Set the value in the vector accumulator to zero 
9 vsplatacci vsplatacci( vaccd,srs I) Load vector accumulator with a scalar value 
10 vldacceli vldacceli Load immediate value into vector accumulator (vaccd,velem, value) element 
11 vsplat_hJ vsplat_ h J(vrd,srs I) Splat a 16-bit scalar value to all elements of 
vector register 
vmvacctre Extract high (amount=O) or low (amount= 16) 12 vmvacctre (vrd, vacc I ,amount) the even elements of vector accumulator and load them to vector register 
vrnvacctro Extract high (amount=O) or low (amount= 16) 13 vmvacctro (vrd,vacc1,amount) the odd elements of vector accumulator and load 
them to vector register 
vmvrtacce Deposit high (amount=16) or low (amount=O) 14 vmvrtacce (vaccd, vrs I ,amount) the even elements of vector register to the vector 
accumulator 
vrnvrtacco Deposit high (amount= 16) or low (amount=O) 15 vmvrtacco (vaccd,vrsl,amount) the even elements of vector register to the vector 
accumulator 
16 mvgpr2vr mvgpr2vr Moves a value (32-bit) from the general purpose (vrd,velem,grsl) register (Leon) to the vector register element 
17 mvvr2gpr mvvr2gpr Moves the vector register element to the general (grd, velem, vrs I) purpose register (Leon) 
Splat instruction (vsplat_h_r) "splats" a scalar value in a vector register and deposit 
instructions (vmvrtacce and vmvrtacco) deposit low or high data from a vector 
register to a vector accumulator. The extract instructions (vmvacctre and vmvacctro) 
are utilized to extract high or low data from vector accumulators into vector registers. 
Finally, they comprise instructions that set to zero (vaccclr), splat scalar data 
(vsplatacci) or load an immediate value (vldacceli) into a vector accumulator. All 
the move instructions are summarised in Table 5-2. 
5.3.1.3 Arithmetic Instructions 
The vector arithmetic instructions include short and long addition, subtraction and 
multiplication. All the arithmetic instructions are performed in a single cycle apart from 
the multiply-add (vmace/vmaco) and the multiply-sub (vmsue/vmsuo) which take two 
cycles. The short addition (vaddh) and subtraction (vitu_sub_r) take as inputs two 
vector registers and perform a l6-bit addition operation. The long addition (vaddacc) 
and subtraction (vsubacc) take as inputs vector accumulators and perform 32-bit 
addition operations. Figure 5-3 shows a vector addition for a vector length of 2 that 
specifies two vectors as input operands and produces a vector result by executing the 
5 Vector Processor Architecture 114 
same operation on each pair of elements from the input arrays. The multiply instructions 
are implemented as pairs for the even and odd elements of the vector registers as the 
multiplier for every vector functional unit takes as input two 16-bits and produces a 16-bit 
(short multiplication) or 32-bit (long multiplication) product. Figure 5-4 illustrates a short 
multiplication of two vectors with vector length 2. 
Vector Addition 
v",l vr.;2 
I I 0 
I I 
¥ ~ 
vrd I I 
1 I 0 I 
Figure 5-3: Vector Short Addition 
It takes as inputs the even elements of the pair of vector registers (elements 0) and the 
product is placed in the even element of the destination register. 
Vector Multiplication Even 
v",l v",2 
I I o 
I 
vrd I 
o 
Vector Multiplication Odd 
v",l 
o o 
vrd 
Figure 5-4: Vector Short Multiplication for even/odd elements 
5 Vector Processor Architecture 115 
Then it takes as inputs the odd elements of the pair of vector registers (elements I) and 
the product is placed to the odd element of the destination register. The short 
multiplication involves simple multiplication (rnult), multiplication with rounding 
(rnult_r) and integer multiplication (irnult). All these multiply instructions perform a 
signed or unsigned 16 x 16 -t 16-bit operation. The long multiplication performs a 
signed 16 x 16 -t 32-bit operation and, along with the multiply-add, is executed from the 
pair of instructions vrnace/vrnaco but without the accumulation part. The multiply-add 
(vrnace/vrnaco) and the multiply-sub (vrnsue/vrnsuo) instructions are performed in the 
even and odd elements respectively of the vector registers vrsl and vrs2 and add or 
subtract the product to the even and odd elements of the vector accumulator vacc. 
Vector Multiplication Even Vector Multiplication Odd 
vrs1 vrs2 vrs1 vrs2 
o c::::::I:~o=:J o o 
o ;:.va:cc:::r=:J-~ L + 
vace 
vrd vrd 
o 
Figure 5-5: Vector multiply-add/sub 
Finally, the vaccaddreduce is used after the execution of the pair instructions that 
involve the accumulator and perform add-reduce to the elements of the accumulator. With 
the use of an adder tree, a 32-bit final result is obtained and it is placed to the element 0 of 
the vector accumulator. All the vector arithmetic instructions are summarized in Table 
5-3. 
5 Vector Processor Architecture 116 
Table 5-3: Arithmetic Instructions 
No Instruction Assembly Brief Description 
18 vaddh vaddh( vrd, vrs I, vrs2) Vector short addition (16-bit) of 
vector registers 
19 vitu_subJ vita _sub J(vrd,vrsl,vrs2) Vector short subtraction (16-bit) of 
vector registers 
20 vaddacc vaddacc(vaccd, vacc I, vacc2) Vector long addition (32-bit) of 
vector accumulators 
21 vsubacc vsubacc( vaccd, vacc I, vacc2) Vector long subtraction (32-bit) of 
vector accumulators 
22 vaccaddreduce vaccaddreduce(vacc) Vector accumulator add-reduce 
23 vitu_mult_eJ vita mult e r Vector signed short multiply of the - --(vrd, vrs I, vrs2) vector registers even elements 
24 vitu _ mult_ ° J vita mult 0 r Vector signed short multiply of the - --(vrd, vrs I, vrs2) vector registers odd elements 
25 vitu_multJ_eJ vita_muItJ_e_r Vector short multiply with roundiog (vrd,vrsl,vrs2) of the vector register even elements 
26 vitu_multJ_o_r vita_muItJ_oJ Vector short multiply with roundiog (vrd, vrs I, vrs2) of the vector registers odd elements 
27 vitu_i_mnlt_eJ vita_i_mult_e_r Vector short ioteger multiply of the (vrd, vrs I, vrs2) vector registers even elements 
28 vitu_i_mult_oJ vita i mult 0 r Vector short ioteger multiply of the -- --(vrd, vrs I, vrs2) vector elements odd elements 
29 vmace vmace (vacc,vrsl,vrs2) Vector mutliply-add (L_mac) of the 
vector registers even elements 
30 vrnaco vmaco (vacc,vrsl,vrs2) Vector mutliply-add (L _ mac) of the 
vector registers odd elements 
31 vrnsue vmsue (vacc,vrsl,vrs2) Vector mutliply-sub(L_msu) of the 
vector registers even elements 
32 vrnsuo vmsuo (vacc,vrsl,vrs2) Vector mutliply-sub(L_msu) of the 
vector registers odd elements 
5.3.1.4 Shift Instructions 
The shift instructions implement the 16 and 32-bit ITV shift operations. These operations 
have also the ability to specifY negative shift amounts resulting in a positive shift in the 
opposite direction. In addition they saturate the result in the range of Oxffff8000-
Ox00007fff in case of overflows or underflows. The short (16-bit) shifts are perfonned in 
a vector register with an immediate or with the shift amount being in the second vector 
register. The long (32-bit) shifts are implemented in vector accumulator with an 
immediate value or with the amount stored in a vector register. All the shift instructions 
are summarized in Table 5-4. 
5 Vector Processor Architecture 
Table 5-4: Vector Shift Instructions 
Paee Instruction Assemblv 
33 vshli vshli (vrd,vrsl,amount) 
34 vshri vshri (vrd, vrs 1 ,amount) 
35 vshlr vshlr (vrd,vrsl,vrs2) 
36 vshrr vshrr (vrd,vrsI ,vrs2) 
37 vlshlacc vlshlacc (vaccd,vacc 1 ,amount) 
38 vlshracc vlshracc (vaccd,vaccl,amount) 
39 vlshlaccr vlshlaccr (vaccd,vacc1,vrsl) 
40 vlshraccr vlshraccr (vaccd,vacc1 ,vrs2) 
--5.3.1.5 Miscellaneous Instructions 
117 
Brief Description 
Vector short (16-bit) shift left by amount 
Vector short (16-bit) shift right by amount 
Vector short shift left with register 
Vector short shift right with register 
Vector long (32-bit) shift left by amount 
Vector long (32-bit) shift right by amount 
Vector long (32-bit) shift left with register 
Vector long (32-bit) shift right with 
register 
The miscellaneous instructions for the vector ISA perform only comparison operations 
between vector registers (16-bit) or vector accumulators (32-bit) and comparison with 
zero. The compare instruction compares the two operands together by subtracting the one 
from the other. If the result is positive (flIst operand is greater than or equal to the second 
operand register, accumulator or zero) the predication flag (pred) is set to 'I'. If the 
result is negative (first operand is less than the second) the predication flag is set to '0'. 
Finally the merge instructions are utilised to select the vector register or accumulator 
value that satisfies the given equation, on a per-element basis. 
Table 5-5: Vector Miscellaneous Instructions 
Pa!!e Instruction Assemblv Brief Descrintion 
41 vcmp vcmp( vacc 1, vacc2) Compare vector accumulators and update Predication flag (pred) 
42 vrcmp vrcmp(vrs1 ,vrs2) Compare vector registers and update Predication flag (pred) 
43 vcmp_hJe vcmp_h_ge(vrsl) Check vector register if it is greater than or 
equal to zero and update Predication flag 
44 vmerge_t_hJ vmerge_t_hJ Merge two vector registers according to the (vrd,vrs1 ,vrs2) Predication flag value 
45 vmerge vmerge Merge two vector accumulators according (vaccd, vacel, vacc2) to the predication flag value 
This is a multiplexer-style operation that selects between two values which one to pass to 
the output result, according to the predication flag value. The miscellaneous instructions 
are depicted in the above table. 
5 Vector Processor Architecture lI8 
5.3.2 Scalar ISA 
The scalar instruction set comprises 36 instructions which are grouped into five 
categories: load/store, move, arithmetic, shift and miscellaneous. Each category is 
presented to the following sections whereas a more detailed description for every scalar 
instruction is given in Appendix A. In scalar mode the coprocessor can accommodate one 
l6-bit or 32-bit operation. 
5.3.2.1 Load/Store Instructions 
The scalar load/store instructions access memory via the VLSU unit. The load 
instructions can load 16 or 32-bit data from the memory location that is contained in 
scalar register (sisl) int()-the destination-register (srd).- The-store instructions store the- -----
16 or 32-bit data of the scalar register (srs2) into the memory location stored in scalar 
register (srsl). All the scalar load/store instructions are summarized to the Table 5-6. 
Table 5-6: Scalar LoadlStore Instructions 
Paee 
46 
47 
48 
49 
Instruction 
m2sld16 
m2sld32 
m2sst16 
m2sst32 
Assemblv 
m2sld 16( srd,srs I) 
m2sld32(srd,srsl) 
m2sstl6( srs2,srs I) 
m2sst32(srs2,srsl) 
5.3.2.2 Move Instructions 
Brief Description 
Load scalar register with 16-bit from memory 
Load scalar register with 32-bit from memory 
Store 16-bit word of scalar register to memory 
Store 32-bit word of scalar register to memory 
The scalar move instructions offer a flexible way to transfer data between the 
coprocessor's scalar registers and the main CPU's (Leon) general-purpose register file. 
These instructions comprise the address of the source register (srsl or grsl) and the 
address of the destination register (grd or srd) and are listed in Table 5-7. 
Table 5-7: Scalar Move Instructions 
Pa~e Instruction Assemblv Brief Description 
50 mvgpr2sr mvgpr2sr( srd,gpr I) Moves contents from general purpose register to scalar register 
Moves contents from scalar register to general 
51 mvsr2gpr mvsr2gpr(gprd,srs I) purpose register 
5 Vector Processor Architecture ]]9 
5.3.2.3 Arithmetic Instructions 
The scalar arithmetic instructions include short and long addition, subtraction and 
multiplication. All these arithmetic instructions take as inputs two scalar registers and 
perform a 16 or 32-bit operation. When the result exceeds the range of OxSOOOOOOO-
Ox7fffffff an overflow bit is produced. In the case of multiply-add and multiply-sub the 
role of the accumulator is played by a third scalar register that is used both as a source 
and as destination register. This was also the reason that the scalar register file has three 
read ports instead of two as the main vector register file has. The scalar arithmetic 
instructions are listed in Table 5-8. 
Table 5-8: Scalar Arithmetic Instruction 
Paee Instruction Assembly 
52 m2sladd m2sladd( srd,srs I ,srs2) 
53 m2slsub m2slsub( srd,srs I ,srs2) 
54 m2sadd m2sadd(srd,srsl,srs2) 
55 m2ssub m2ssub( srd,srs 1 ,srs2) 
56 m2s1mac m2slmac(srd,srsl,srs2) 
57 m2slmsu m2slmsu( srd,srs 1 ,srs2) 
58 m2slmult m2slmult( srd,srs 1 ,srs2) 
59 m2smult m2smult( srd,srsl ,srs2) 
60 m2smultJ m2smult_r{srd,srsl,srs2) 
61 m2simult m2simult( srd,srs 1 ,srs2) 
5.3.2.4 Shift Instructions 
Brief Description 
Scalar Long (32-bit) Addition 
Scalar Long (32-bit) Subtraction 
Scalar Short (16-bit) Addition 
Scalar Short(16-bit) Subtraction 
Scalar multiply-accumulate (L_mac) 
Scalar multiply-subtract (L_msu) 
Scalar long (32-bit) multiplication 
Scalar short (16-bit) multiplication 
Scalar multiplication with rounding 
Scalar short integer multiplication 
Shift instructions are used to shift the contents of a scalar register left or right by a given 
amount. The shift amount can be specified by a constant (amount) in the instruction or 
by the contents of a scalar register (srs2). As with the vector shift instructions, short and 
long scalar shifts are supported. The scalar shift instructions are summarized in Table 5-9. 
5 Vector Processor Architecture 120 
Table 5-9: Scalar Shift Instructions 
Page Instruction Assembly Brief Description 
62 m2slshl m2slshl (srd,srs I ,amount) Scalar long 32-bit shift left by immediate 
63 m2slshr m2slshr (srd,srsl,amount) Scalar long shift right by inunediate 
64 m2slshlJg m2slshlJg (srd,srsl,srs2) Scalar long shift left with register 
65 m2slshr_rg m2slshrJg (srd,srsl,srs2) Scalar long shift right with register 
66 m2sshl m2sshl (srd,srs I ,amount) Scalar short shift left by amount 
67 m2sshr m2sshr (srd,srsl,amount) Scalar short shift right by amount 
68 m2sshlJg m2sshl_rg (srd,srsl,srs2) Scalar short shift left with register 
69 m2sshr rg m2sshr rg (srd,srsl,srs2) Scalar short shift right with register 
5.3.2.5 Miscellaneous Instructions 
The miscellaneous instructions perform the remaining instructions that comprise the basic 
operations of the ITU standard algorithms. They include short and long negate, absolute 
value, normalization, deposit, extract and rounding. 
Table 5-10: Scalar miscellaneous instructions 
Page 
70 
71 
72 
73 
74 
75 
76 
77 
78 
79 
Instruction 
m2slnegate 
m2slabs 
m2snorm_1 
m2sldeposiU 
m2sldeposit_h 
m2snegate 
m2sabs_s 
m2sextract_h 
m2sextract_1 
m2sround 
Assembly 
m2slnegate (srd,srsl) 
m2slabs (srd,srs I) 
m2snorm_1 (srd,srsl) 
m2sldeposiU 
(srd,srsl) 
m2sldeposit_ h 
(srd,srsl) 
m2snegate (srd,srsl) 
m2sabs _ s (srd,srs I) 
m2sextract_h (srd,srsl) 
m2sextracU (srd,srs I) 
m2sround (srd,srs I) 
Brief Description 
Scalar long negate (L_negate) 
Scalar long absolute value (L_abs) 
Scalar long normalisation (norm_I) 
Deposits 16 LSB into the LSB of scalar 
register the remain are sign extended 
Deposits 16 LSB into the MSB of scalar 
register the remain are zero extended 
Scalar short negate (negate) 
Scalar short absolute value (abs _ s) 
Extracts the 16 MSB from scalar register 
Extracts the 16 LSB from scalar register 
Rounds a 32-bit value to l6-bit 
They use one scalar register (srsl) as source operand and calculate the result that place 
into the destination register (srd). Table 5-10 lists all the miscellaneous instructions. 
5.4 Leon3 CPU 
Leon3 is an open-source synthesisable VHDL model of a 32-bit processor core 
implementing the SP ARC V8 architecture (standard IEEE-1754) [4). The model is highly 
configurable, and particularly suitable for system-on-a-chip (SoC) designs. It is designed 
for embedded applications that require a high performance, low complexity and low 
5 Vector Processor Architecture 121 
power consumption programmable engine. The Leon3 CPU has a 7 stage pipelined 
integer unit with a pseudo-Harvard architecture (separate instruction and data caches): 
• Fetch Stage: In this stage the instruction is fetched from the instruction cache if it 
is enabled else a request sent to the memory controller. In addition, the value of 
the program counter is updated. At the end of this stage the valid instruction and 
the value of the program counter are latched to the next stage. 
• Decode Stage: The instruction is decoded and extracts the addresses for both 
source operands and the destination operand. Also it generates the addresses for 
branch and CALL instructions and the control signals for the next stages. 
• Register Access Stage: The source operands are read from the register file or 
from bypassed intermediate results. 
• Execute Stage: All the arithmetical, shift and miscellaneous operations are 
performed. For memory load or store and jump/return operations the address is 
generated and sent to the memory unit. 
• 
• 
Memory Stage: At this stage the data cache is accessed and the store operation is 
performed. 
Exception Stage: All the traps and interrupts signals are processed and the data 
are aligned in the case of a data cache load. 
• Write Back Stage: The result from any arithmetical, logical, shift or cache 
operation is written back to the register file. 
It has an on-chip debug support unit and interfaces to a Floating-point unit (FPU) and a 
custom coprocessor. The Leon3 processor implements the full SP ARC V8 Reference 
Memory Management Unit (SRMMU) and its interrupt model recognises and handles 15 
asynchronous interrupts. The number of the registers in the register file is configurable 
within the range of 2 to 32 with a default value of 8. The cache system is highly 
configurable as well and is connected to two independent cache controllers for the 
instruction and data caches respectively (icache.vhd and dcache.vhd) [4]. In addition, 
there is an interface between the two caches controllers and the Amba AHB bus 
(acache.vhd). Both caches are configured to be direct-mapped or multi-set with set 
associativity of 2-4 sets, where every set can be 1-256 Kbytes and be divided into cache 
lines (blocks) of 16-32 bytes each. The Leon3 includes a hardware multiplier, with 
optional l6x16 bit MAC and 40-bit accumulator, and a divider. In this research, we will 
5 Vector Processor Architecture 122 
consider the integer unit of the Leon3 processor in which the vector processor is attached 
in a closely coupled configuration. Leon3 can be configured to provide a generic interface 
to a user-defined co-processor. The interface allows the operation of the coprocessor in 
parallel increasing this way the performance. The vector coprocessor is a hardware 
component that will run in parallel with the Leon3 and will exchange data with it. In 
order to perform this, the coprocessor-allocated opcodes must be ignored by the decode 
logic of the pipeline of Leon3. This means that the Leon3 should treat these instructions 
in a benign way, as is the case of a nop instruction. From the SP ARC architecture manual 
it can be seen that the instructions are encoded in three major 32-bit formats as illustrated 
in Figure 5-6. 
Format 1 (op = 1): CALL 
I op I disp30 
31 29 o 
Fonnat 2 (op = 0): SETHI & Branches (Bicc, FBfcc, CBccc) 
imm22 
disp22 
31 2928 24 21 o 
Fonnat 3 (op = 2 or 3): Remaining instructions 
op rd op3 rs1 i=O asi I rs2 
Op rd op3 rs1 i=O simm13 
Op rd op3 rs1 opf I rs2 
31 29 24 18 13 12 4 o 
Figure 5-6: Instruction Formats of Leoo3 
The format that can be used for the vector coprocessor and will demand only few 
modifications ofthe Leon3 decode logic is the unimplemented instruction (UNIMP). The 
values of the UNIMP instruction are not reserved by the architecture for any future use 
and the const22 value is ignored by the hardware [5]. The UNIMP instruction is an 
instruction with unimplemented opcode that causes an illegaUnstruction trap and its 
format is shown in Figure 5-7. 
Format 2 (op = 0): UNIMP 
I 00 I reserved I 000 I const22 
31 29 24 21 o 
Figure 5-7: Unimpiemented Instruction 
5 Vector Processor Architecture 123 
Because the UNIMP instruction causes an illegal_instruction trap at the exception 
detection stage additional decode logic and modifications in the existing decode logic 
prevent the exception process from setting the illegatinst signal. Furthermore, the Leon3 
was modified to perfonn add with zero when the allocated opcode is decoded. In this 
way, the Leon3 ID perfonns a nop instruction while the 22-bits of the UNIMP opcode 
(const22) are sent for further decoding in the vector coprocessor. Therefore the available 
22-bits are utilised for encoding the vector and scalar instruction set. More detailed 
description, for the Leon3 modifications and the way that the vector coprocessor is 
attached to it, is given in Chapter 6. 
As mentioned the UNIMP instruction cause an illegal_instruction trap. Traps are 
vectored transfer of program control caused from events that should not occur during 
normal program execution. Traps can be induced by an exception related to an instruction 
or by an external interrupt. If a defmed trap condition occurs, the system trap handler is 
invoked to handle the program interruption through a special trap table. The base address 
is defined in the trap base register (TBR) and the displacement within the table is 
calculated in combination with the trap ID. There are three trap categories: the precise 
trap that is caused from a particular instruction and takes place before any program-
visible state is altered; the deferred trap that is like the precise one but occurs after the 
program-visible state changes and the interrupting trap that is induced by an external 
interrupt request. The default trap model that is implemented in Leon3 comprises precise 
traps apart from the FPU or coprocessor traps and the "Non-resumable machine-check" 
exceptions. The table that contains the 3-bit field (op2) that encode the fonnat 2 
instructions is shown in Table 5-11. 
Table 5-11: Enhanced op2 Encoding (Format 2) 
Op2 
o 
1 
2 
3 
4 
5 
6 
7 
Instructions 
UNIMP 
unimplemented 
Bicc 
unimplemented 
SETHI 
unimplemented 
FBfcc 
CBfcc 
Description 
Vector Processor Instruction 
unimplemented 
Branch on Integer Condition Codes 
unimplemented 
Set High 22 bits of an r register instruction 
unimplemented 
Branch on Floating-point Condition Codes 
Branch on Coprocessor Condition Codes 
5 Vector Processor Architecture 124 
5.5 Overall System Architecture 
The vector coprocessor microarchitecture is currently being implemented in RTL VHDL 
as a tightly coupled coprocessor for the Leon Sparc-VS CPU. It has private vector and 
scalar register files as this method promises significantly better performance. Detailed 
microarchitecture analysis followed by trial synthesis confirmed that all instructions can 
fit in a single high frequency cycle resulting in a latency of 1 and an initiation rate of I. 
Exceptions to this are the Multiply-add/subtract instructions and the short divide with 
latency/initiation rate of 211 and 17/17 respectively. In particular, it was decided that due 
to the very low improvement, the iterative divider block would not be utilized [6]. The 
overall system architecture is depicted in Figure 5-S. 
PCI 
Host 
DMA 
Unit 
SoC I/F 
Processing Unit 
r-----------------------------~ 
1 1 
1 1 
1 Coprocessor Lean 1 
1 1 
1 1 
: VlSU icache I acache Idcache : 
1 1 
"------ ------------- 1'--______ 1 
AHB 
PCII/F Memory I Controller 
J L 
I APB Bridge I 
I 
Timers I 1/0 11 System 1 Registers 
I SDRAM I SRAM I 
Figure 5-8: Overall system architecture 
It consists of the backbone interconnect (32-bit AHB bus), a configurable number of 
processor-coprocessor units, a DMA (Direct Memory Access) unit, a PCI IIF (peripheral 
Component Interconnect Interface), the external memory controller a low-speed (non-
streaming) peripheral bus (APB) subsystem which houses miscellaneous units such as 
timers, interrupt controllers, I/O and memory-mapped registers. 
5 Vector Processor Architecture 125 
5.5.1 Processor-co processor programmable unit 
The main processing unit is the vector processor (Leon3/vector coprocessor 
combination). This unit has two AlIB taps, one used for refilling the scalar processor 
caches (Instruction, Data) and the second for refilling the coprocessor data cache. Both 
main processor and coprocessor caches remain consistent via i) using a write-through 
configuration and ii) uses a write-invalidate mechanism which ensures that writes to a 
cache block from either processor invalidates the same block in the other processor. Thus 
the latter processor will have to go to the main memory if it accesses that location and 
recover the up-to-date contents instead of using its own stale data. 
5.5.2 DMA taps 
These are the input ports to the SoC. An external agent requests the DMA unit for 
transferring PCM (frames) data into the SoC address space. The DMA unit has ARB 
mastering capability and is also used to transfer the compressed bitstream (processed 
frames) from the SoC address space to the environment. 
5.5.3 PCI IIF 
An Opencores [7] PCI IIF is used to transfer data between the host system (host PC) and 
the FPGA board. 
5.5.4 External Memory Controller 
This unit is responsible for all memory accesses in the SoC addresses space. It directly 
interfaces to a 133MHz DDR (Double Data Rate) memory component and a standard 
asynchronous RAM component. These external memories are address-range enabled 
(Ox60000000 for SDRAM, Ox40000000 SRAM). The optimized speech coder and the 
frames to be processed are transferred with DMA from the host PC to the SDRAM 
memory of the RISC/Coprocessor FPGA board. After that, the RISC CPU/coprocessor 
combination processes the frames and stores the compressed frames in local memory 
(SDRAM). The compressed frames are transferred back to the PC memory for 
comparison with the ITU-T test vectors [6]. 
5 Vector Processor Architecture 126 
5.5.5 APB Subsystem 
The final subsystem includes all non-streaming components (internal and external) such 
as timers, I/O ports, interrupt controllers and UARTS. This subsystem also houses 
memory mapped registers. 
5.6 Summary 
This chapter introduced the architectural state and programmer's model of the vector 
processor. The vector and scalar instruction extensions were presented, divided into 
groups of instructions of similar types. Every type was detailed by showing assembly 
formats and giving a short description of the instruction's operation. More details of the 
instructions are contained in Appendix A. Finally a description for the overall system 
architecture was given. 
5 Vector Processor Architecture 127 
5.7 References 
[I] K. Koutsomyti, S. R. Parr, V. A. Chouliaras, J. Nunez, D. J. Mulvaney, and S. 
Data, "Scalar and parametric vector accelerators for the G.729A speech coding 
standards," in Proceedings of IEE/ACM SoC Design, Test and Technology 
Postgraduate Seminar, Loughborough University, September 2004, pp. 53-57. 
[2] D. Martin, "Vector Extensions to the MlPS-JV Instruction Set Architecture (The 
VIRAM Architecture Manual) Revision 3.7.5.," March 2000. 
[3] C. Kozyrakis, "Scalable Vector Media-processors for Embedded Systems," in 
Computer Science University of California: Berkeley, 2002. 
[4] "GRLffi IP Core User's Manual, Version 1.0.7," Gaisler Research February 
2006. 
[5] "The Spare Architecture Manual Version 8 ", www.sparc.com. 
[6] V. A. Chouliaras and J. L. Nunez, "Scalar Coprocessors for accelerating the 
G723.1 and G729A Speech Coders," IEEE Transactions on Consumer 
Electronics, vo!. 49, pp. 703-710, August 2003. 
[7] http://www.opencores.org/. 
CHAPTER 6 
VECTOR PROCESSOR IMPLEMENTATION 
6.1 Overview 
This chapter describes the vector processor along with a number of implementation 
details and the general principles of its operation. In addition, it details the way that the 
vector speech coprocessor is attached to the main Leon3 scalar processor. The vector 
processor consists of the Vector Datapath (VDP) and the Vector Load/Store Unit 
(VLSU). In the sections that follow only the Vector Datapath is discussed in detail as the 
VLSU is addressed as part of another thesis [I]. The vector processor fully implements 
the Vector and Scalar ISAs that were described in the previous chapter. The vector 
pipeline comprises four-stage pipeline: the Vector Decode Stage (VDEC), the Vector 
Register Access Stage (VREG) and the Vector Datapath Stage (VDP) which consists of a 
two stage pipeline (VDPI and VDP2). All vector/scalar instructions are fully-pipelined 
with a latency of one and an initiation rate of one instruction per cycle, with the exception 
of multiply-add and multiply-sub instructions which have a latency of two cycles and an 
initiation rate of one. 
The organization of the speech coprocessor with the 4-stage pipeline is depicted in Figure 
6-1. The vector coprocessor is parameterised along both the architecture and the 
microarchitecture axes. The architectural parameterisation refers to the number of 
registers including accumulators and the extensible vector ISA. The micro architectural 
parameterisation refers to the extensible, non programmer visible state of the processor. 
This includes the number of scalar datapaths (functional units), maximum data width and 
internal flop-based state. This parameterisation is defined from a number of compile-time 
parameters that specifY the various architectural and microarchitectural characteristics of 
the coprocessor. 
128 
6 Vel.'for Processor /mv/emelllatioll 129 
IIJ2vcop_09C_valid iU2vcop_opc: leorujout 
• 
VDEC Stage ~OECSTAGE 
I I I c::::., 
VREG Stage 
byp ... 
VDP1 Stage 
VDP2 Stage 
- - - - - -~-'-----' 
Figure 6- 1: The vector speech cop roccssor microa rchitcctu re with the four-stage pipeline: 
Vector Decode Stage (VDEC), Vector Register Access Stage (VREG) and two stages for the 
Vector Datapath Stage (¥OP I a nd VDP2) 
The choice of compile-time configurali on putS the combined proces or/vector 
coprocessor firm ly in the domain of confi gurable, ex tensible CPUs. The compi le-time 
parameters are listed in Tab le 6- 1. 
This table indicates the valid values and the maximum number of the vector/scalar 
registers, the accumulators and the Vector units (VLMAXl2) . Exceedi ng these limits or 
choosing other va lues than the va lid will generate errors during the RTL simul ation. 
6 Vector Processor Implementation 130 
Table 6-1: Compile-time vector processor parameters for its architectural aud 
microarchitectural state that are coutained in gxx_ config.vhd file 
Parameter 
VLMAX 
VREGS 
SREGS 
ACC_NUMBER 
ACC WIDTH 
Allowed ran!'e 
2,4, 8, 16,32,64,128 
4,8,16,32 
4,8,16,32 
2,4,8, 16,32 
(VLMAXl2)*32 
Default 
2 
16 
16 
2 
(VLMAXl2)*32 
Description 
Maximum Vector Length 
Number of Vector Registers 
Number of Scalar Registers 
Number of Vector Accumulator 
Width of Vector Accumulator 
The code is parameterised as to target a number of technologies easily. This has been 
achieved through the use of fully technology independent VHDL constructs as well as 
using generic RAM components. The allowed silicon technologies are listed in Table 6-2. 
Table 6-2: The allowed silicon technologies that are used for synthesis and place and route 
contained in gxx_config.vhd file 
Parameter 
GEN 
XST 
TSMCOl8 
TSMC013 
Description 
Technology independent RAM macros 
Xiling FPGA Technology (Spartan3) 
Taiwan Semiconductor Manufacturing Company 
(TSMC) O.18J.UU standard-cell technology 
TSMC O.13J.UU standard·ceIl technology 
6.2 Vector Decode Stage (VDEC) 
This is the first stage of the pipelined vector coprocessor datapath. In this stage the 
instruction from the Leon3 opcode register is decoded and all the datapath control signals 
for the following pipeline stages are produced. The instruction for the decoding is coming 
pipelined from the Decode stage of the Leon3 to the Memory stage where the coprocessor 
is attached along with few control signals. More specifically in this stage the following 
operations are performed: 
• The opcode is decoded and control signals are produced ready to be pipelined in 
subsequent stages. 
• The addresses for the source and destination register operands are produced and 
access of the vector and the scalar register files starts (split over two stages). 
• The write enables for all the pipeline registers of the vector pipeline are 
produced. 
6 Vector Processor Im plementation JJ J 
The electrica l interface o f the VDEC stage is depicted in Figure 6-2. The input ignals are 
coming frol11 the Leon3 processor and veClOr 10ad/slOre unit (VLSU). The output signa l 
tha! are of vdec2vregs type are go ing to the input of the VREG stage. 
VDEC 
etk d 2 j clk vdec2vregs __ v ee vregs reset.....-- reset gxx_hold h Id 
. --0 0 
ope 10Pcode 
ope_valid l opcvalid 
ki ll iki ll 
vlsu2vdp 
--0
1 
vlsu2vdp 
lean din d' 
- --0 gp_ III 
t...:....:=---:--,--:--' 
F igure 6-2: The electrical interface of the VDEC Stage 
As men tioned in the previous chapter, the selected instructi on format r r the vector 
processor is incl uded in the Unimplemented Instruction [2] of the Sparc VB architeclUre 
and it is depicted in Figu re 6-3 . Thi s instruction is architeclUrally not implemented and 
generate an exception if encou ntered. 
In the Leon3 the collsI22 bitfie ld is complete ly ignored by the decoding logic of the 
processor. Additi onal combinationa l logic has been insen ed in Leon3 to extract the 
COIISln fi eld and sent it to the vector coprocessor decode unit as the in put vector opcode. 
Format 2 (op = 0): UNIMP 
I 00 I reserved I 000 I const22 
31 29 24 21 o 
Figure 6-3: The Uni mplemented inslruction format of lhe Sparc VS a rchilecl",'e 
In the decode stage oflhe coprocessor the opcode-valid s igna l is assened if the 22 bi ts are 
a valid vector instruction and datapath control signals, addres e for the vector/sca lar 
register opemnds and enables are produced. In the case of a 3-address format (Figure 6-4) 
the extracted addre ses fields a long wi th the produced read enable signa ls are used to 
access the synchronou registe r fi le, in parallel with lhe decoding of the latched 
instruction. In thi s way, the depth of the pipeline of the co processor is reduced by one 
stage compared to a purely cascade decode/register access organisation and this has an 
6 Vector Processor Implementation 132 
additional beneficial effect during the transfer of data from the coprocessor to and from 
the scalar processor. 
3-address format 
I opcode rd r51 r52 I 
21 14 9 4 0 
4-address format (4th is the accumulator implicitly 
I opcode vaccd vr51 vrs2 I 
21 14 9 4 0 
2-address format with immediate data 
I opcode vaccd vrsl amount I 
21 14 9 4 0 
3-address format with vector element 
I opcode vrd vaccelem I 9r51 I 
21 14 9 4 0 
Figure 6-4: Different types of instruction formats of the vector processor ISA 
A similar process is perfonned in the case of multiply-add and multiply-sub instructions 
that are a 4-address fonnat instructions (the accumulator is an implicit source and 
destination operand). In this case the accumulator address and read enable are sent during 
the decoding of the latched instruction in order to obtain the accumulator operand for the 
next stage. Another difference for these instructions is that they are always implemented 
in pairs of even and odd elements. A combinational logic (evod16_en) asserts the 
appropriate read enable bits for the even or odd operands. In the case that one of the 
source operands is immediate data, this is included in the instruction field [4:0] which is 
extracted and sent to the next stage where it is zero extended to 32-bits. A detailed 
description of the extension process will be given in the VREG stage. In the case where 
one register operand is used to select a vector element (vaccelem) for load or store 
operations then the instruction field [9:5] is extracted and used to calculate the write or 
read enables respectively for the specific vector element (\ 6-bit word). The same method 
is followed in the case of the move from or to Leon3 instructions to or from an element of 
a vector register. When a move from Leon3 instruction is perfonned, the operand is 
coming from the main scalar CPU register file and it is pipelined to the next stage 
(VREG). At that stage it is selected as a source operand and enters the appropriate lane of 
the coprocessor vector pipeline to finally commit to either the coprocessor scalar or 
vector register file. In the case of a move to Leon3 instruction, the selected 16-bit element 
6 Vector Processor Implementation /33 
of the vector register is zero extended to 32-bits and it is sent to the Leon3 register file. 
More detail description is given in the VREG stage. The custom instruction formats for 
the previously mentioned cases are depicted in Figure 6-4. All the datapath data and 
control signals are latched at the end of the VDEC stage to the set of registers of type 
vdec2vregs. The pipeline enable (reg_enl) of these registers is asserted when the 
following conditions are true: 
• the main CPU is not halted (holdn=' l' ) 
• no exception takes place in the main CPU (ikill=' 0' ) 
• there is no cache miss in VLSU (vlsu2vdp. hold=' 0') 
• the coprocessor instruction is valid (opcval id= ' 1 ' ) 
6.3 Vector Registers Stage (VREG) 
The Vector Registers Access Stage selects the source operands from the vector/scalar 
register files or from the accumulator file or from the bypassed results of the first and 
second stage of the vector datapath. In the latter case, the results are made available, from 
any of the other downstream stages, to the VREG stage in order to be used as source 
operands if this is required. This bypassing of intermediate results is established practice 
in CPU architecture [3] and is the only way to resolve data dependences without stalling 
the pipeline. As mentioned in Chapter 3, data dependences happen when an instruction 
needs to use the result of a previous instruction prior to its commit to the register file [4]. 
In addition, it is the stage where store instruction takes place and the memory address for 
the load instruction is sent to the VLSU unit in order for load data to be ready and be sent 
to the second stage of the VDP. Furthermore the vector length (vlen_r) register and the 
overflow and predication (pred) registers are updated. The detailed schematic of the 
VREG stage is illustrated in Figure 6-5. 
(i Vector Processor Implemelltation 
DECOOE lOGIC ) 
I 
l r ' l T-' 
VRF ~- -. SRF 
vrs'_douI_1 1 .1 vrs2_douU SfSl_douul .,,2 
ta vbpass' res vdp2vregs eta 
vdp2\ltegs aa la Wpass2_res ( ,.,.... ..,.. .. 
res",_sopr,_, r resv 
resv_...opr1_, r",,_v0pr2_t 
l oon do 
" sptaU1alV ' ~plal_daIV 
~ 1'-' "j" 'r=r SCC,_f "'81_ 
'\ 
flnaLvoprU '--, r8Q..en2 } --, J' finsLv0JX2_' 
~ ! 
douU srsl_douU 
vdolvregs data sbpassl 
1 \ldp2vregs data sbpass2 
"'P'2_, 
~ 
(SpIa,-",~ 
r lmm_value 
/ 
reg_en2 
~ 
"" 
... 
vregs2vdp data vrC0pr2J vregs2vdp 
Figure 6-S : Vector Register Access Stage (VREG) microarehi tecture 
6.3.1 Reverse Data Process 
134 
, eg enl 
r 
, 
When a load or store instruction with a negati ve stride is perfonned a special con trol 
ignal (vdec2vregs . l st_neg_r) i asserted. In the case of a negati ve stride store 
(vstwn) the data to be written to the memory that come from the bypass logic of the 
second register fi le read port (resv _ vopr2_i) need to be reversed. 
vector register 
3 
63 
vector operand 
for store in vlsu 
o 
63 
2 
47 31 
• • 
2 
47 31 
Figure 6-6 : Rc\'er e Dutn Process 
o 
15 o 
• 
J 
15 o 
This is performed with the use of the reverse data functi on logic (reverse_data) whi ch 
swaps the order of the e lements as they are placed within the final vector, from the most 
significa nt element to the least sign ificant e lement. The output of this functi on is sent to 
6 Vector Processor Implementation 135 
the input of the VLSU (sregs2vlsu.data_in) as the data that will be written to the 
vector data cache. The data-reversing process is shown in Figure 6-6. The reverse process 
for the load instruction is the same but it is performed in the VDP2 stage of the vector 
processor. For this reason is described in section 8.4.5. 
6.3.2 SpJat Data Process 
There are instructions that need to replicate a 16-bit (vsplat_h_r) or 32-bit 
(vsplatacci) scalar value to all the elements of a vector register or accumulator 
respectively. This "splat" operation is performed in the splat function (splaCdata) in 
the VREG stage. The splat logic takes as inputs a 32-bit word and the width of the vector 
operand in which the value will be copied. If the value that is to be "splated" is 16-bit, it 
is duplicated in order to produce the necessary 32-bit value that acts as the 32-bit input of 
the function. The resulting vector is sent to the multiplexer responsible for the first 
operand selection in the VREG stage. The schematic for this function is depicted in 
Figure 6-7. 
splat value 
15 o 
3 
63 47 31 15 
Figure 6-7: Splat Data Process 
6.3.3 Masking Process 
There are two masking processes that are implemented in the VREG stage: these are the 
mask_width and the mask_extract. The mask_width logic takes as an input a value 
that indicates the width of the vector to be processed and produces a mask bit-vector that 
is VLMAX* 16 bits long. The produced mask defines a set of bits that are used as a 
selector in order to extract the desired scalar elements from the vector that the mask is 
applied. The input value (width) gives the number of the mask's bits that will be' I' while 
the remaining bits will be '0'. The functionality of this masking operation is depicted in 
6 Vector Processor Implementation 136 
Figure 6-8. This type of mask is used in the bypass logic for the selection and formulation 
of the input operands to the vector ALU stage. 
unmasked vector I::::n::::::::::::::::::nn::::::::::::\::::()::::::::::::::::::':::::::::n:nl 
VLMAX'16 0 
AND 
mask width 
100000000 .......... 000 1111111111111111111111111111111111 .............. 11111111111 
VLMAX'16 vlen_ value "16 0 
masked vector 
"" VLMAX'16 v/en_value"16 0 
Figure 6-8: Mask width function 
The mask_extract logic (function) takes as an input a value that indicates which vector 
element of 32-bits should be selected and produces a mask that is VLMAX* 16 long. This 
second mask comprises sets of 'O's and 'l's that are structured in a way to extract the 
desired 32-bits from a given input vector. The input value to this function resembles a 
"read-enable" that selects the 32-bits element that will be extracted from the vector in 
which the mask is applied. The mask_extract functionality is illustrated in Figure 6-9. 
unmasked vector 
VLMAX'16 0 
AND 
mask extract 
100000000 .......... 0001111111111111 ...... 1111000000000000000000000000 ... ....... 0000 I 
VLMAX'16 (i+1)"32-1 i"32 0 
masked vector 
"" 
~----------------------, 
000000000000000000000000 .......... 0000 I 
Figure 6-9: Mask extract function 
This type of mask is used to select a scalar element from an accumulator register for load 
or store operations. 
6 Vector Processor Implementation 137 
6.3.4 Bypass process 
The bypass process is critical for the efficient operation of pipelined processors. In the 
VREG stage it selects the operands from the vector/scalar register files, the vector 
accumulators or the intermediate results produced in the vector datapath (before they are 
written to the register files) from the first and second stages of the VDP. There are 
actually two bypass processes: the vector_bypass and the scalar_bypass. In the 
vector_bypass process, the two vector-operand read addresses 
(vdec2vregs. vrsl_rdaddr_a, vdec2vregs. vrs2_rdaddr_a) for the vector 
register file are compared respectively with the write address 
(vdp2vregs. ctrl. vbpassl_vwr_addr_r) of the instruction currently executing at 
the first VDP stage (VDPl). If either of them is equal with the VDP write-back address 
and the valid signal (vdp2vregs. ctrl. vbpassl_valid) of the bypass result is 
asserted, the vector length of the bypassed result (vdp2vregs. data. vlen_cvalue_r) 
is compared with the current (architected) vector length (vlen_r) of the coprocessor that 
is located in the vlen register. In the case that the result from the VDPl stage has a vector 
length smaller than the vector length of the resolved operand, then the bits from 0 to 
vdp2vregs.data.vlen_cvalueJ*16-l are containing In the bypassed result 
(vdp2vregs .data. vbpassl_res) while the remaining bits up the vlenJ*16 are filled 
with the outputs of the corresponding read ports (vrsl_dout_i or vrs2_dout_i) of the 
vectorregister file. If the vector length of the bypassed result is larger then the operand's 
(resv_voprl_i or resv_vopr2_i) bits are filled with one of the outputs of the read 
ports (vrsl_dout_i or vrs2_dout_i) from bits 0 to vlen_r*16-1. In the case that both 
vector lengths are equal, the resolved operand comprises the bypassed result of the first 
VDP stage. The same process is followed for the bypassed result of the second VDP stage 
(VDP2) and both read ports of the vector register file in the case where there is a 
mismatch in the target register of the first VDP stage and the source register in VREGS in 
order the appropriate operands to be selected. The formulation of the resolved operands is 
always performed with the use ofthe masking process (maSk_width). The schematic for 
the vector bypass process for one of the vector source operands and the intermediate 
result of one of the two VDP stages is illustrated in Figure 6-10. 
6 \lector Processor /m"lememllljofl 138 
000 .. 0 ~ 000 .. 0 
1"l.MAA'0" ",,,_, .,.n_CViIIlw_, 
OR OR 
< > 
..L 
Figure 6~IO: Vector bypass process for onc of the vector source operands and the 
intermediate result of one of the t\l'O VDP stages 
The scalar bypass process is much simpler than the vector bypass as there is no need for 
masking of the operands. The three scalar read addre ses (srsl_rdaddr_a, 
srs2_rdaddr_a and srs3_r daddr_a) For the scalar register fi le are compared 
respecti vely with the write address of the bypassed scalar result 
(vdp2vregs . ctrl . sbpassl_swr_addr_r) of the first VDP stage. Lf they are the same 
and the valid signal (vdp2vregs. ctrl. sbpassl_valid) of the result i assened, the 
corresponding resolved operand (resv_soprl_i, resv_sopr2_i, resv_sopr3_i) is 
assigned from the scalar bypas ed result (vdp2vregs. data. sbpassl_res) else with 
the output of the correspondi ng read port of the scalar register fi le (srsl_dout_i or 
srs2_dout_i or srs3_dout_i) . The same process is foll owed for the bypassed scalar 
result of the second VDP stage and the three read ports of the scalar register fi le. Figure 
6- 11 depicts the scalar bypass process for one of the scalar operands. 
;:.5':.:5"" ______ -, sbypass1 _res 
Figure 6-11 : Scalar bypass process for the selection of onc of the scalar operands (firs t) 
6 Vector Processor Implementation /39 
6.3.5 Operands Selection 
The two source operands (vector or sca lar) are selected after the bypass process, prior to 
the end of the VREG stage and committed to the output registers of vregs2vdp type. in 
the case of a move-from-coprocessor instruction, the requested 32-bit data from the 
Leon3 are extracted from the selected first source operand prior to committing and sent 
back to the main CPU write-back stage via the coprocessor-CPU custom interface. The 
third operand, that is always scalar, is driven directly from the output of the third read 
port (srs3_douCi) of the scalar register file to the corresponding output register. The 
selection of the two operands is perfomled via two large mu ltiplexers as they depicted in 
the detailed schematic in Figure 6-5 . The first operand (f inal_voprl_i) can be the 16-
bit (vdec2vregs. sel_width_r ; '1') or 32-bit output of the scalar bypass process 
(resv_sopr1_i) or the output of the vector bypass process (resv_vopr1_i). It can 
also be one of the vector accumulator file hardwired read ports 
(vdp2vregs. data. vacc1_r or vdp2vregs. data. vacc2_r) or the Leon3 general 
purpose registers (gpdata) or the "splated" data fomlUlated from the splash function using 
data from a scalar register. Similarly, the second operand can come from the output (16-
bit or 32-bit) of the scalar bypass process (resv_sopr2_i) or the output of the vector 
bypass process (resv_vopr2_i) or one of the hardwired read ports of the accumulator 
file or the immediate data (imm_va1ue) that have been extracted from the coprocessor 
instruction at the decode stage. 
6.3.6 Register enable 
The register enable (reg_en2) for the output registers of the VREG stage is asserted 
when both hold signals that are coming from the Leon3 (hold) and the VLSU unit 
(vlsu2vdp. hold) are not asserted. In addition, the pipelined register enable 
(vdec2vregs. reg_en2_r) should be asserted and the signal vdec2vregs. sel_st_r 
must be set to zero in order to prevent any store instruction from taking place. The later 
condition is necessary because the store instruction is performed and completed at the 
VREG stage so the next stages are not used for this instruction. Therefore, when the store 
i completed no change in the state of the following datapath stages flip flops should take 
place in order to avoid unnecessary power consumption. The reg_en2 is the pipeline 
6 Vector Processor Im plementation 140 
enable of the output registers (vreg2vdp) in which all the datapath data and control 
signals of the VREG stage are latched. 
6.3.7 Vector Register File (gxx_vreg_fiIe) 
The co processor stores the results of the vec tor computations in a vector register file that 
is a two-dimensional storage array where every row holds all the scalar elements of a 
single vector. The vector register file is parametric so its dimensions are specified from 
compile-time parameters in both axes. The width is defined from the number of the vector 
elements (l6-bits each) and is equal to the maximum vector length (VLMAX) while the 
number of such entries is VREGS, equal to the architectural vector registers. The vector 
register fi le provides two read ports and one write port that translates to two vector read 
and one vector write operations per cycle. 
6.3.7.1 Parameterisation 
The vector register file is ful ly-configurable design. The number of register windows 
(VREGS) is within the range of2 to 32, with a default setting of 16. These parameters are 
specified in gxx_config.vhd and are shown in Table 6-3: 
Table 6-3: Compile-time vector register file parameters for its architectural and 
microarchitectural slate that are contained in gxx_config.vhd file 
Parameter 
VLMAX 
VREGS 
Teclmolo.y 
Default 
2,4,8, 16,32, 64, 128 
16 
GEN, TSMCO 13 
6.3.7.2 The vector register fi le implementation 
The electrical interface of the vector register file is shown in Figure 6- I 2. 
6 Vector Processor /lIlolemelllotion 141 
It has 
gxx vreg file 
elk 
.--. elk vrs1_dout 
reset I 
.--. reset vrs2 dout 
vdec2vregs.vrs1 rdaddr a I 1 d dd 
- - --. vrs r a r 
vdec2vregs.vrs1 rden a I -
- - --. vrs 1 rden 
vdec2vregs.vrs2 rdaddr a I -
- - --0 vrs2 rdaddr 
vdec2vregs.vrs2 rden a I -
- - --. vrs2 rden 
vdp2vregs.ctrLvrd waddr r i d -dd 
- - --. vr a r 
vdp2vregs.ctrt.vrd wen r I d-
- - --. vr wen 
vdp2vregs.data.vdp vres I d-d' 
- -- vr _ In 
vrs1_douU 
vrs2_douU 
Figure 6- 1.2: Electrical I nlerface of Veclor RegisIer File 
two read address ports (vdeC2vregs.vrsl_rdaddr_a, 
vdec2vregs. vrs2_rdaddr_a) that are dri ven unlatched from the vector decode s tage 
in paralle l wi th the decoding of the latched opcode, in order to in itiate the register fi le 
access which in mrn, will return the operands before the end of the VR EG stage. The 
write address port (vdp2vregs. ctrl. vrd_waddr_r) i coming pipelined frolll the 
end of the second stage of the VDP in order to commit the vector re ult. The register fi le 
is technology-i ndependent and allows two reads and one write to be performed on the 
ame cyc le. In the case where a read of a register is req uired at the same cyc le that it is 
written, a RfW connict occurs. When thi s condit ion i detected the read-port is disabled 
and the data are bypassed from the wri te-port write-data. This ensures that the memory 
cell does not get corrupted when doing a si mu ltaneous RJW operati on at the same 
ad dress. This behaviour ha been observed in the TSMC O. 1 3~Lm dual-port RAMs and the 
above solution en ures that thi s extreme case never cau es corrupt ion of data. Figure 6- 13 
details th organisation of the vector register fi le with RJW connict avoidance. This 
performed by the conflictyrocess logic in which each read add. e s is compared 
with the incoming write address and if are the same and any of the bits of the read or 
wriLe enable signals are asserted then a connict signal is produced. Becau e there are two 
read ports there are two connict signals (conflictl_i, conflict2_i) and when one of 
them is asserted the output data are comi ng from the write-port data via the output 
multiplexer. 
6 Vector Proce,\'sor Implementation 
'II'$ , _rdadcl' 
IIfSl_rden 
vrd_waddr 
YJd_wen 
---
---""'----
""--
. 
. 
I 
I 
confllCtl • 
I l' 'Vf' f contltd..,Pl'OCOS'l 
I" 
eonftic:t2 I 
./ '\, 1'den2_1 
<."'''-''''''"''' ) 
'---. 
142 
'1' +I~ yrd_dtn J'~on_rf_c.1I 
-
IIfS'_douU 
-
"'" 
douI 
"""" 
,den 
-
wdk 
-w",", 
don 
I IXM'If"Ct2_1t 
,J" yrd dlf'l 1 wdata_, 
J2.J10n_rf_cen 
Yfs2_dout I -
-
<cl< <lout 
/' 
""" 
-
""" w_ 
wden 
." 
Figure 6·13: Detailed mjcroarchitecture of tbe Veclor Regis!er File wilh R/W confli ct 
avoidance 
In addition, there are two read enable pon s (vdec2vregs. vrsl_rden_a, 
vdec2vregs. vrs2_rden_a) that a re co ming fro m the decode stage althe same lime as 
the read addresses. The read enable signa l is a bit vector in whi ch every bi t enables the 
read operation al byte-granularity from Ihe e lected registe r. In this specilic case every 2 
bil of the read enable s ignal c rrespond to a 16-bit e leme nt from Ihe source vector 
regisler. When the read add resses are valid and Ihe read enables are sel to ' I' , the 
corresponded dala are read and sent 10 Ihe oUlpulS of Ihe regisler li le (vrsl_dout_i, 
vrs2_dout_ i). The write enable SI robes (vdp2vregs. ctrl. vrd_wen_r) arri ve 
pipelined from the end of Ihe VD P2 slage and are based in Ihe ame pri nc iple as Ihe read 
enable strobes. When Ihe wrile address is valid and Ihe write enable is asserted Ihe inpuI 
data (vdp2vregs. data. vdp_ vres) are wrillen to Ihe selecled regisler. 
6 Vector Processor llll olemelllOlio" 143 
6.3.8 Scalar Register File (gxx_sre~fiIe) 
T he scalar operands are stored to the scalar regi ter lile that is again a two-di mensional 
storage array. It contains sixteen registers o f 32-bi t wid th and it supports th ree reads and 
one wri te operati ons per cycle. 
6.3.8. 1 Parameteri sati oll 
The scalar register file has SREGS registers that can be in the range of 2 to 32 and wi th 
defau lt setting o f 16. The compile-t ime parameters wilh their default values thal specify 
Ihe IruClUre of Ihe scalar regisler are shown in Table 6-4: 
Table 6-4: Compi le·time scalar regis ter file parameters for its archi tectu ral sta te that a rc 
contained in gxx_config.vhd file 
Pa rameter 
SREGS 
Technotogy 
Default 
t6 
GEN. TSMCOt3 
6.3.8.2 Scalar register file implementation 
The electrical interface of the scalar regisler li le is depicted in Figure 6- 14. 
elk 
resel srs2 _ douU 
.-- reset srs2_dout 
vdec2vregs.srs1 rdaddr aj d dd srs3_douU 
- - srs1 r a r srs3_dout - -
vdec2vregs.srs1 rden a I - d 
- - - . srs1 r en 
vdec2vregs.srs2 rdaddr a-I -
- -- ~ srs2 rdaddr 
vdec2vregs.srs2 rden a I -
-- -W srs2 rden 
vdp2vregs.srs3 rdaddr a 3- dd 
- - - srs a r 
vdp2vregs.srs3 rden a 3- d 
- - - srs r en 
vdp2vregs.ctrLsrd waddr r d -dd 
- - sr a r 
vdp2vregs.ctrLsrd wen r I d-
- - sr _wen 
vdp2vregs.data.vdp sres d d' 
=--- sr _ In 
Figure 6- 14: Electr ical Inle rfa ce of Scalar Register File 
The sca lar register fi le has Ihree read address ports (vdec2vregs.srs I_ rdaddr_a, 
vdec2vregs.srs2_rdaddr_a, vdec2vregs.srs3_rdaddr_a) and th ree read enable POrlS 
(vdec2vregs.srs I_rden, vdec2vregs.srs2_r len, vdec2vregs.srs3_rden) Ihat are coming 
6 Vector Processor Implementatioll 144 
unlatched from the vector decode stage (VDEC). This happening in order the scalar 
register fi le to be accessed during the decoding and produce the scalar operands before 
the end of the VREG stage in time for bypass ing. It was decided 10 anach an addi ti onal 
read port to the scalar register fil e as a third operand was needed to play the role of the 
accumulator for the multiply-add/sub instructions. 
"SI . 'daCldr!Iog,,(SAEGSj 
SlII _'esen(. ) 
5Id wldf.lr(tog,(SAEGS) 
s rd wOn(. J 
ar'~Udlddl(IOQl(SAEGSI 
1112_'don(. ) 
ard_wllkI/(Iog,lSAEGS) 
,'d_won(A) 
.fs3 ,dlC!dl(lOll,(SAEGS) 
al13. ld&n(. ) 
5Id_waddl(log>lSAEGSj 
.rd_wan!A) 
.-
-
eonllictl I 
~"den I 
,CQnl\lcull'ocoss :J 
I 
eonlllc12 • 
./' ' "CJe112. 1 
conrllcLPlocauj 
I 
eonHlc13 • 
~den:u 
~onIIIcLP,oces:J 
I 
l conHIct1 il 
. '-"".. 
"-d_tIIn i wdala_, 
I L-
J I gon_'1 coU f-l!-srsl_doul_1 
r- ",. d. 
"d" 
"M 
C- .,Ok 
WI CldI 
wileI'! 
d. 
ICOnHIct2_1r 
lid_din 'I wdala r 
J 2_IIon le ceH I f-l!-
-
srs2_douU 
". 
d< . 
. S152 dout 
laOdI 
.d .. 
:-
-wl eldr 
.,," 
~" 
Iconlllct3 I1 
.,d dlf\ 'I wdaULI 
Jl_lIon " c.H I f-l!-sls.3 . douU 
:- ". 
.. 
,add, 
-
-
welk 
wldd! 
wdon 
d. 
Fig"re 6-1 5: Detailed Ill,croarch,lectur. of the Sca lar Reglst. r File with R/W connict 
avoidnncc 
The write addre s (vdp2 vregs.ctrl.srd_ waddr _r) and the write enable 
(vdp2vregs.ctrl. rd_wen_r) are coming pipelined from the end of the VDP2 stage. The 
calar register tI le is described in a technology-independent way and supports three reads 
and one write operations per cycle. R/W contlict avoidance happens with three conflict 
signals (contlict U , contlict2_i, connict3_i) that are produced from the conflict_process 
I . 2, 3 in order to prevent a re'ld and write operati on to happen simultaneously 10 the 
6 Vector Processor Implementation 145 
same register address. In this case the particular read-port is disabled and the data are 
coming instead bypassed from the write-port data. The detailed schematic is depicted in 
Figure 6-15. 
6.3.9 Vlen register 
The degree of data-level parallelism that the vector coprocessor can exploit on every 
cycle is defined by the vector length register (vIen_r). This control regi ster stores the 
value of the dynamic vector length that determines the number of the 16-bit elements in 
which the vector operations will be performed. For example, a vector short addition with 
vector length of four wi ll only add the first four pairs (4x 16-bit) of elements of the input 
vector registers and will ignore the rest. The va lue of the vector length is stored to 
vIen_r regi ster before any other instruction takes place in order to reconfigure the 
hardware. The vector length can take any value that is multiple of two; 2, 4, 16, 32, 64 up 
to the maximum vector length (VLMAX) that in this case is 128 (2048 bits). If the data-
level parallelism in a particular loop of the speech algorithm, which corresponds to the 
number of times a loop body is executed, is greater than the VLMAX then the vlen_r is 
loaded with the maximum value and performs a sequence of identical operations that 
comprise the loop. At the end the vIen_r is loaded with the remaining of the modulus 
di vision of the number of repetitions of the loop with the VLMAX (loop strip mining) 
and one more iteration of identical operations is perfomled but this time with a shorter-
than-VLMAX vector length. The instruction that is responsible for loading the vIen_r 
register with a value for vector length is ldvlen_r (value) . When thi s instruction is 
encountered, the value is extracted fiom the instruction opcode and the vlen_r write-
enable (vdec_i . vI en_wen) is set at the decode stage. Subsequently they are latched to 
the VREG stage as vdec2vregs. vIen_nvalue_r and vdec2vregs. vIen_wen_r 
respectively and pipelined to the following stages of the vector datapath coprocessor. At 
the VREG stage the pipelined write-enable (vdp2vregs. ctrI. vlen_wen_r) is 
checked and if asserted the pipelined va lue of the vector length 
(vdp2vregs . data . vlen_nvalue_r ) from the VDP2 stage is committed to vIen_ r. In 
every cycle the vlen_r IS read and the current value 
(vregs2vdp . data. vlen_cvalue_r) is pipelined to next stages to dynamically 
reconfigure tbe vector pipeline [5]. 
6 Vector Processor Implemelllatioll 146 
6.3.10 Overflow and Pred Flags 
Whcn an arithmetic instruction produces a result that is greater than the value a register 
can store or represent then an overflow bit is asscrted and written to the ovcrflow flag. 
The overflow flag is set to indicate a problem so the software can be aware of this 
condition and act accordingly to compensate or mitigate the error. More specifically, both 
ITU-T speech coding algorithms that execute on the coprocessor deal with this problem 
by using the saturation instruction that limits the output to the allowed range for 16-bit or 
32-bit numbers. The coprocessor has a vector overflow register (ovE) and an overflow 
flag (w). The vector overflow register (ovE) is VLMAXl2 bits long, one overflow bit per 
32-bits of the vector length, and it is updated at the VREG stage when the overflow 
cnable (vdp2vregs . ctrl. ovE_wen) that is forwarded from the end of the second stage 
of VDP is set. In this case, the vcctor ovcrflow register takes the new value 
(vdp2vregs . data . over) that is coming pipclined from the VDP2 stage. The 
overflow flag (w) is I-bit and it changes when the same overflow enable as before is 
asserted. The new value of the overflow flag is the or-reduce resul t of the vector overflow 
value (vdp2vregs . data. ovE_r) . In the case of a vector comparison instruction the 
predicate bits are set according to the result of the comparison and written to the pred 
flag. The comparison is performed on pairs of vector operand elements and every 
produced predicate bit corresponds to a 16-bit comparison. The pred register is VLMAX 
bits long. The pred register is updated at the VR£G stage when the predicate write-
enable (vdp2vregs . c t rl. pred_en) that is coming pipelined from the end of the 
VDP2 stage is set and the pred regi ster can take the result predicate bits 
(vdp2vregs . data . pred_r) from the comparison. 
6.4 Vector Load/Store Unit (gxx_ vlsu) 
At the VR£G stage the Vector Load Store Unit (VLSU) is accessed and the load/store 
instruction along with the store data (in the case of store) and the control information are 
sent from the vector coprocessor to the fomler. The electrical interface of the VLSU is 
illustrated in Figure 6-16. 
6 Vector Processor Im pleme1ltatioll 
c1k. __ 
clk 
reset 
• rst 
ahbi 1 ---~ahbi 
addr dd 
--. a r 
VLSU 
addr valid 1 
-. --"I addr_valid 
datajn __ °
1 
dataJn 
vlen 
--. vlen 
read --J read 
read 
data 
miss 
hold 
vi en 
ahbo 
addr_out 
addr_valid_out 
read 
data 
miss 
hold 
vlen 
ahbo 
Figure 6-16: VLSU Electrica l Interface 
147 
In the case of a load instruction (sregs2vlsu. read=' 1'), the VLS U takes the read 
add res (sregs2vlsu . addr) for the memory along with the valid signal 
(sregs2vlsu. addr _ val id) and the vector length (sregs2vl su. vlen) that 
determines the width of the vector data in order to prepare that vector and return it to the 
second stage of VDP. When a store instruction (sregs2vlsu. read=' 0 ' ) is performed, 
again the address wilh the control signals a re sent from the VREG stage to the VLSU 
along with the data fo r storing (s r egs2vlsu . data_in). The VLSU has a cascade 
TAG/DATA confi guration resulting in one latent Load-Use cyc le through the bypass 
logic of the vector coprocessor. This means that the TAG array is checked one cycle 
before accessing the DATA array, on the foll owing cyc le, resulting in the load data being 
ready at the second stage of the VDP. Even though this configurati on results in increased 
latency than the more traditional paral lel TAGIDATA organi zati on, it leads to 
substantially lower power consumpti on; in a multi-way confi gurati on, all TAG and one 
(selected) DATA arrays are powered up in consecuti ve cycles whereas in the para lle l 
TAGIDATA case, all TAG and all DATA arrays are powered up concurrentl y, resulting 
in higher power consumpt ion. Whereas in the cascade TAG/ DATA confi gurat ion all 
TAG RAMs are power-u p during cycle I but only the selected way of the DATA RAM is 
powered lip on cycle 2. 
6 VeClOr Processor Implememat;ol/ 148 
Parallel TAG/DATA Configuralion Cache 
address 
L-------------------~~~_r--J 
• returned data 
Cascade TAG/DATA Configuration Cache 
address 
i 
Figure 6-17: Parallel TAGIDATA configura tion and Cascade TAG/DATA configuration 
caches 
For exa mple, a cascade 4-way set associati ve data cache has four TAG RAMs (cycle I ) 
and onc DATA RAM (cycle 2) powered li p whi le a paralle l data cache will have four 
TAG RAMs and four DATA RAM that makes e ight RAMs in Iota I powered up . Figure 
6-17 depicts parallel TAGIDATA confi guration and cascade TAGIDATA confi gurations 
caches. 
6 Vec({)r Processor Il1lvlemell1{1fioll 
addr valid addr 
.. • 
TAG 
. • . T 
..-_.., 
- r- - - _:=-'- --r-alid address data_ alid way miss Ir 
-..0.1. 
I 
DATA 
"" ... 
1 
• • • • addt_valkUlt.ll addr_ou1 data_valid data 
dala • ...!n "ad , 
miss 
-r- - r-
• miss 
data_In read 
• 
,e" 
-
"eo 
• 
.en 
~ 
149 
-. AHBJN 
FSM 
-
Figure 6-18: MicroarchHeclure of VLSU in cnscnde TAG/DA TA configuration 
The microarchi tecrure or the VLSU is parameterised and is depicted in Figure 6- 18. 
There are a number of compile-time parameters that spec ify rhe number of ways and the 
size of each cache and are defined in the coprocessor configurati on fi le (gxx_config.vhd). 
In addi tion, there is a Finite-State-Machine (FSM) that handles the communication with 
the AMBA bus (6). 
6.5 Vector Datapath Stage (VDP) 
The Vector Dmapath Stage forms the execut ion core of the vector coprocessor and is the 
most complicated piece or logic. The VDI' is divided in two stages: Stage one performs 
all the ,u·ithmetic, hift and miscellaneous vector operati ons that lire of single cycle 
latency and the multiplication part of the multiply-add and mult iply-sub instructi ons. 
6 VecTOr Processor Im plemelltation 150 
vr8gs2vdp nop type vregs2vdp elf! sel VI,I r vregs2vdp etrt 
' I T • - -
'\ ' 0 
r 
r 1 1 
VADD VMULT VSHIFT VMISC 
1add_out '.we, I vmull OUI I vres T vshifl_OUI I vres 1 vmlsc_oul_i vres 
I I , ~ L ~::2'dPct'''.'_'U_' • • • 
egs d31a.sbpaul_ res \ / \ vregs2vdp ctrl seLvs_r 
• VbpaSS1J85 sbpassl res 
• ovCstl 
pred_stl 
rags data vbpass I_ res 
I s\agel_res_r 
''9_ 
r 
.03 
maskJlfocess v8dd2_ou\ wes 
staget res_, 
~UII vlsu_res I 
vaocreduce res \ / VLSU 
l vacc Impelala vaccreduce 
"-' 'f ~vaCC_dala 
I-~ ~oprl srffpr3_ r _w.., v~egs Patava =r VACC 
'" 
'/ 
FILE [Vrf_opt1 r VADD2 
stagel_ res_r 
vlsu res I ! vadd2 oul vres 
'------L.....J', 
vaccreduce_re5 1 vbpass2_res_tmp 
vdp2 vrags dala vtJpass2 1'8$ "-
• I 
• 
vbpass2 res ,'''-
'" 80S data sbpasS2_res [::r SRF 
Figure 6-19: Microarm iteclure of the VDI' stage 
The second stage accepts returning loads from the VLSU and performs the 
addition/subtraction part or the l11ultiply-add/sub as well as the setting-up of the write data 
to the register fi les. At the end of each of the VDP stages and right before they are latched 
in to the corresponding output (VDP l stage) or architectural (VDP2 stage) registers, the 
results are bypassed to the VREG stage in order to be availab le to dependent instructi ons 
6 Veclor Processor ImplemelllOlioll 151 
and avoid stalls due to Read-After-Write (RAW) dependences. The detailed schematic of 
the VDP stage is showll in Figure 6-19. Stage one consists of four vector data path units: 
The vector adder (vadd), the vector multiplier (vrnult), the vector shifter (vshift) and 
the vector miscell aneous (vrnisc) unit. Each such vector unit consist of VLMAXl2 
replications of their corresponding scalar unit that produces a 32-bit resul t. At lhe input of 
the vector units there are multiplexers that select which vector unit will accept the input 
operands and the control signals that are coming from the output registers of the VREG 
stage. The vector units not participating in the current computation cycle execute a nop 
instruction . This input operand gating is applied to eliminate redundant switching activity 
in the multiple functional units of the vector datapath. This ensures that unused functional 
units are kept in a quiescent state by maintaining constant inputs. This minimizes 
switching activity and as a result, dynamic power consumption. In the case that the 
coprocessor is performing a scalar instruction, the scalar operands along with the signal 
(vregs2vdp . c trl. sel_vs_r ;' 0') that indicates that it is a scalar operation are the 
inputs to the vector units. Special logic activates scalar lane (lower 32-bits) of the 
particular unit that comprises the selected vector path instead of implementing a 
dedicated scalar datapath. This results in reduced silicon area and control logic overheads 
and also to less verification effort [6]. At the output of the vector units there is another 
multiplexer that selects the vector result (vbpassl_res) or the scalar result 
(sbpass l _res) to be passed on to VDP2 stage, depending on the operation . At the same 
time, the result is bypassed to the VREG stage as an intermediate result and also it is 
written to the output registers of the first stage (reg_s tl) . At the second VDI' stage the 
latched result from stage one (stagel_res_r) or the load data (vlsu_res) re turned 
fTOm the VLSU, are sent directly for writing back at the end of the cycle. In the case of 
the multiply-add/sub instruction the latched result from stage one (stagel_res_r) is 
used as the second input operand to the vector adder unit (vadd_snd_stage) of the 
second stage for the addition/subtraction part of the operation. When other instructions 
that employ accumulators occur, the registered result (stagel_res_r) is dri ven as the 
input data to the accumulator file with the exception of lhe vaccreduce instruction 
where the latched result of stage one is sent as an input to the adder tree (red uction unit). 
At the end of the second VDI' stage, there is a multiplexer that selects the fina l result 
from the vector adder, the adder tree, the vector accumulator file , the load data from the 
VLSU or the pipelined result from the previous stage to commit to the vector/scalar 
6 Vector Processor ImplemellTatioll 152 
register Fi les. The Final resull is bypassed as di scussed previously to the VR EG tage, in 
order that depe ndent instructions don' t stall the vector pipeline. 
6.S.1 Vector Adder Unit (gxx_vadd_dp) 
The vector adder unit (vadd) is an array of VLMAXI2 ide ntica l units, whe re every such 
functional unit rakes [WO 32-bit operands and produce a 32-bit resull . The vadd unit can 
perform short (l6-bit) or long (32-bit) addition, subtractio n, comparison or 32-bit to 16-
bi t round operation. The electrical interface of the vector adder unit is depicted in Figure 
6-20. As shown , every such fun ctional unit takes 'IS operands the correspondent ele ment 
of the input vector and produce a 32-bit vector resull along with the overflow and 
predicate bits. 
I VL~AX12 --vadd_out.vovf(i) 
r2==~==;~3 . o--vadd_out.vpred(i) 2 .-vadd_out.vres(32"i) 
clko-- -'; vadd_in . SIMD( i)--~ 
vadd_in.seLsub(i) --~ 
vaddJ n.sel_sfctn(i) --. 
vadd_ln.seUound(i) --, 
vadd_in.seLcmp(i) --_I 
vadd_in.vrCovCr(i) __ 
vadd_in.vrC opr1_low_r(1 S0i) --~ 
vadd_in.vrf_opr1_high_r(15·i) --, 
vaddJn. vrCopr2_low _r(1S-i) --. 
vadd_in.vrf_opr2_high_'(15·i) --J 
'--------' 
Figure 6·20: Electrical interface of the vector adder unit 
The vadd unit co mpri. es two mirrored combinati onal logic blocks that are call ed the 
"Iow" and "high" part of the unil. The low pan ca lcul ates the least-signi\1cant 16-bits of 
the 32-bit resull and the hi gh part ca lcul ates the most-significant 16-bits. Additional logic 
exists between the low and hi gh pan that combines the m in order to perform a long (32-
bit) instruction. When a shon operation is performed (vadd_in. S I MD (i) =' 0') the two 
blocks work in parallel and produce tWO 16-bit results along with separate overfl ow and 
predicate bits. When a long operation takes place then the two blocks are linked together 
e.g. the carry out of the low pan is driven to the carry in of the hi gh pan of the fun cti onal 
un il. A detail ed schemati c of the vadd functi onal uni t's mi croarchitec[ure is ill ustrated in 
6 Vector Pro(,essor Implementation 153 
Figure 6-2 1. The remaining control s ignals that define which operati on the vector adder 
unit will execute are described in more detail in Appendi x B. 
Figure 6·2] : Microarchi tecture of a fun ctional unit of the vector adder 
6.5.2 Vector Multiplier Unit (gxx_vrnuICdp) 
The vector multiplie r unit (vmult) is an a rray of VLMAX/2 identica l datapath units, 
where each such datapath takes two 16-bit ope rands and produces a 32-bit result. The 
vmult Oln execute all kinds of multiplicati ons that the target speech codi ng work loads 
requi re. The electrical interface o r the vrnul t i illustrated in Figure 6-22. 
6 VecfOr Procl'JSOr Imvlemelflatioll 154 
I V:~AXl2....-- vmu lt_out .vOvf(i) 
,c!:~=~2(13 ~.-- vmulL oul.vres(32 ' 1) 
clk .'--~ 
vmultjn.seLmult(i) --. 
vmuIUn.sel_mult_r(i) ---.I 
vmuIUn,vrCovU (i) ---I 
Figure 6-22: Eleclrical inlerface of Ihe \'eclor mulliplier unil 
Every functional uni t takes as in puts the corresponding 16-bit e lement s. even or odd , of 
the full input vector operands and produces a 32-bit result for long multiplicat ion or a 
zero extended 16-bit result for result consistency with the other types of multiplication 
along with an overflow bit. This is because every such unit comprises a 16x 16 igned 
multiplier ,md every mult iplication operation executes in instruction pairs for the even 
and odd elements of the input vector operands. The reason behind th is choice comes from 
previous simulation studies which showed an improvement in the dynamic instruction 
count metric of the order of 2 - 4% when multiplying the even and odd elements of the 
input operand in parallel. Therefore by uti lising a single mult iplier per 32-bit sca lar 
datapat h, the nu mber of multipliers is halved and at the same time the performance 
penalty is very litt le. The complete schematic of the microarchitecture of a functi ona l unit 
of the vector multiplier is shown in Figure 6-22. 
6 Vector ProcessfJ r l l1l(}lemell lalioll / 55 
""""-' lres2...mu~ 
""'-" 
o· ., 
""""0001. 
U Ul_1ftI1 M' II" . ..I1i 
.oxOOOO71n\. 
" 
mull.Ju_' 
Figure 6-23: Microarchitecture of a functional unit of the vector multiplier 
6.5.3 Vector Shifter Unit (gxx_vshift_dp) 
The vector shifler unit (vshift) is an array of VLMAXI2 identical un its, where each 
such functi ona l unit takes as inputs two 32-bit operands and produces a 32-bi t resull and 
overflow bit. The vshif t can perform hon and long shift left and shi ft right operati ons 
to all or the even/odd elemel1ls of the input vector operand. The electrica l interface of the 
vector hifter un it is depicted in Figure 6-24. 
6 Vector Processor Implemelllarioll 156 
I VUy1AX/2 _ __ vshift_ou! vovf(i) 
,r:~=~==~2~3~ vshift_out.v,es(32'J) 
elk ~ 
vshifU n.SIMD(i) --J 
vshiftJ n.cmd_shift (i} --, 
vshiftJ n.vn_ovLr(l) --. 
vshifUn.vrCopr1_'ow_r(1S·i) --, 
vshifUn.vn_opr1 _high_r(1S"j) --, 
vshifUn.vn_opr2_low_r(1S"i) --.I 
vshift_in.vrt_ opr2_high _,( 15 'i) I 
Figllre 6-24: Electrical interface of the vector shifter unit 
Every functional unit compri es twO mirrored combinati onal logic blocks for the "Iow" 
and "high" pan for the input vectors. The corre ponded 16-bit e lements of the input 
vector operands drive each of them respecti vely. Additional logic links the two pans of 
the unit in order to execute the long shin (S IMD_ i= '1'). Each o f the logic blocks 
contains a specialised barre l shifler that impleme lll s the core functionality of the tTU-T 
shift operations. A barrel shifter i a common digital c ircuit that can shi ft or even rotate a 
data word by any number of bits in a single cycle [7] . For this panicul ar de ign, a 16-bit 
bi -directi onal barre l shifter is implemented that compri ses a network of multiplexers and 
can shiflup to 15 positions on either direction. Each fun ctional unit comain two such 16-
bit barrel shiflers that are connected in series in order to execute two shon hifts or one 
long shift. Figure 6-25 shows the connecti ons of the two barre l shifters ports. 
Barrel Shifte, 
Isi= O rso = lsi rso = OPEN 
'\.. 
_ high --=-./ ) low 
Iso = OPEN rsi = Iso ,si = 0 
Figure 6-25: Two Barrel Shifters connected in series for short or long shift operations 
The right shift output (rso) for the " Iow" barrel shifter and the le fl shi ft output (150) for 
the "high" barrel shi fter are left open as no rotation is spec ified in the coprocessor IS A. 
Due to this re.1son the right shift inpu t (rsi) of the " Iow" shifter and the le ft shift 
6 Vector Processor Implementation 
.. 
;; 
5 
t 
" 
I . 1 
! 
t' 
! 0' 
I 
} 
i' 
-, 
~ 
,-
f igure 6-26: Microa rchitecture of a functional unit of the vector shifter 
157 
6 Vector Pmcessor Impiemellltlfioll 158 
input (lsi) of the "high" shifter arc permanel1l ly ti ed to va lue zero. The remaining pon s 
are l inkcd together in order to execute the long shins. The additional combinationa l logic 
around the barrel shifter salllrates the shin result in case of unc1erflows or overnows and 
checks if the shirt amou nt is negative in order to perform the oppos ite-directi on shirt . A 
detailed schematic of the microarchitecture of a functional unit of the vector hifter i 
shown in Figure 6-26. 
6.5.4 Vector Miscellaneous Unit (gxx_vmisc_dp) 
The vector miscellaneous unit (vmisc) cOl1lains the logic that implements the 
miscellaneous vector operati ons of the coproce sor [SA . Every functi onal uni l of the 
vmisc accepts two 32-bil input vector operands in case or vector operati ons or one 32-bit 
sca lar operand for sca lar operations and produces a 32-b il resuh (or 16-bit result zero-
eX lended to 32-bits). The electrica l interface of the vmisc is depicted in Figure 6-27. 
elk 
vmisc_in.sel_misc(i) --0 
vrnisc_in.pre<Un(i) --. 
vmisc_in.srCopr1 _r(32"i) --j 
vmisc_in,vrCopr1 Jow_r(32"i) I 
vmisc_in.vrCopr2_low_r(32*i) --
Figlll"C 6·27: Electrica l intcrrncc of the vector miscellaneous unit 
6.5.5 Reverse Data Logic 
As previously mentioned for the case of store operations with a negative stride, the store 
data are reversed in the YREG stage prior to sending them to the YLSU. The same 
operation is performed for data that retum from the YLSU (vlsu2vdp. data) in the case 
of a load instructi on with negative stride (vldwn). This is achieved with the reverse data 
logic ( functi on). The function (reverse_data) output dri ves a multiplexer which elects 
6 Vector Processor /mplelll emarioll 159 
amongst the reversed (load with negati ve stride) and no reversed (standard load) for the 
returning result at the end of the second VDP stage. 
6.5.6 Masking Process Logic 
The masking process logic that is implemented in the second VDP stage selects the 
appropriate elements of a vector input and places them in the target vector accumul ator. 
The input control signal (reg_stl . ctrl . sel_evod_r) defi nes which elements, even 
or odd, should be extracted from the vector in put and be pl aced at the corresponding even 
or odd elements of the vector accumulator. T he second input operand 
(reg_stl.ctrl. vrf_opr2_r) determines whether the sixteen bi ts of the even or odd 
elements o f the vector in put should be placed at the MSB (deposit high operation) or LSB 
(depos it low operation) of the 32-bits elements of the vector accumulator. 
Mask for even e lements 
unmasked ament vector unmasked current vector 
-o VLMAX"6~ ~ 0 r-----~f-------r_----~~----~ ~m~a=sk~e~d~v=e=ct~or~----_r------~~-------
000 .... 0 1111 ... 1 
VLMAX"t6 47 16 o VLMAX·t6 4 7 16 o 
=0 
o 
st1_result 
Figure 6-28: Masking process logic for low (vrCopr2J='1') or high (vrCopr2J='O') deposit 
ror the even elements of the input vectors to the accumulator 
This is implemented by shift ing le ft by the am unt speci fi ed in the second operand of the 
16-bit scalar elements (within the vector input) t.hat will be placed inside the 
corresponding 32-bit e lements of the vector accumul ator; the amount can take onl y the 
6 Vector Processor Im plementation 160 
values of zero or sixteen. The remaining bits of each element of the accumulator are fi lled 
with zeros. Figure 6-28 depicts the masking process for the even elements of a vector 
value with the second operand being zero and sixteen respectively. 
6.5.7 Bypassing network of the fi rst VDP stage 
At the end of the first stage and prior to clocking the results into the VDP2 input registers, 
the intermediate vector and scalar results from the first execution stage 
(vdp2v regs . data . vbpass l _r es and vdp2vregs . data . sbpassl_res) are 
forwarded to the VREG stage as inputs to the bypass logic process for the source vector 
and scalar operands selection. In addition, the destination write-addresses for the vector 
(vdp2vregs . ctrl . vbpassl_vwr_addr_r) and the scalar 
(vdp2vregs . ctrl . sbpassl_swr_addr_r) register fi les are sent along with the valid 
signals. The bypass-valid vector and scalar signals 
(vdp2 vregs .ctrl .vbpassl_valid, vdp2vregs . ctr l . sbpass l _valid) are 
asserted in the same way as the register enable signals. For the vector bypass result, the 
current vector length (vdp2 vregs . da ta . vbpassl_vlen_r) is also sent to the bypass 
logic to determine the extend that the intermediate result will comprise the source 
operand. The bypass logic for the second VDP stage is described in section 6.4.15. 
6.5.8 Register Enable for the input VDP2 registers 
The register enable (reg_en3) for the registers of the first VDP stage is asserted when 
both hold signals that come from the Leon3 (hold) and the VLSU unit 
(v l su2vdp . hold) are set to zero. In addition, the latched register enable of the previous 
stage (vregs2vdp . ctrl. reg_en3_r) should be asserted. The registers at the end of the 
first VDP stage is of reg_stl type. 
6.5.9 Second stage adder 
When a multiply-add or a multiply-sub is executed, the multiplication is performed at the 
first VDP stage while the addition or subtraction part of the instruction is perfomled in 
the vector adder in the second VDP stage. 
6 Vector Processor Implementation 
VADD2 
~  elk 
'1' j 
- . SIMD 
reg st1 .ctrl.sel sub r -I I b 
- - ~se su 
'00· 1 -
-- sel sfetn 
'0' -I -
--~I sel_round 
'0 ' 
--. sel cmp 
pred_out 
reg st1 .data.vrf ovf r I -
reg st1 .data.srf OPr3~ - - ---J vrf_vv_r 
- - - 1 
1------jC 1 vrf_oprUow_r 
vacc opr1 -I vrf_opr1_high_r 
- 0 stage1 res r(15:0) rf 2 
- - --. v _opr _Iow_r stage1_reS_r(3~11 vrf_opr2_high_r 
' 00' 
--. pred_in 
161 
vadd2_out.vres 
vadd2_out.vovf 
"00· 
Figure 6-29: Electrica l interface of the second VDP stage vector adder 
Thi s vector adder is identical to the vector adder unit of the previous stage apan from the 
fact that the control signal s are pre-set to perform long addition or subtraction , Thi s 
pipeline scheme was chosen to all ow single cycle operations in VDPI stage and at the 
same time compound (pipelined) operations such as multipl y-add and multipl y-sub to be 
full y pipelined by us ing the multiplier in the first stage and the second instance of the 
vector adder in the second VDP stage since the vector adder unit is reasonabl y cheap, The 
first input operand comes from a vector accumulator (vacc_oprl) or a scalar register 
(reg_stl , data. srf_opr3_r), depending if the instmcti on is a vector or scalar one. 
The second input operand is t.he registered multipli cati on result from the previous stage 
(stage l _res_r) . T he result from the vector adder (vadd2_out, vres) is written in the 
target vector accumulator and is also bypassed to the end of the second stage for the 
dependent instmctions. 
6.5.10 Vector Accumulator File (gxx_vaccs) 
When the coprocessor is exec uting long ope rations (32-bits elements) or instnlctions that 
access the accumulator, one or two vector operands are read from the vector accumu lator 
fi le , The vector accu mulator fil e is a Iwo-d imensional storage array parameteri sed as to 
Ihe number of accumu lators and their width . The number of the elements per accumu lator 
is always equal to hal f the maximum vector length (V LMAX), 32-bit e lements and the re 
are ACC_ NUMBER vector accumu lators. In thi s parti cular in stance of the architecture 
6 Vector Processor [m p[emelllation 162 
the number of vector accumulators is set to 2 but can increase till 32, as the avai lable 
opcode bits a llow, for the multipl y-add and multipl y-sub operations. 
ctk 
reset 
.--
reg_st1 etrl vacc1_rdaddr_r I 
--. acc1_rd_addr 
reg st1 ctrl. vacc1 rden r I 
- - - acc1 rden 
reg st1 .ctrl.vacc2 rdaddr r I -
- - ---.acc2 rd addr 
reg st1 .ctrl.vacc2 rden r I --
- - - --.:acc2 rden 
reg st1 .ctrl.vacc waddr r I -
- - ----Iacc_wr_addr 
vacc wen 
- --jacc wen 
vacc data -
- --.:acc_din 
aCC1_doutr-- vacc_opr1 
acc2_dout -- vacc_opr2 
Figure 6-30: Elcclricallnterface of Vector Accumulator File 
The onl y restriction is that the rema ining long operati ons can use for source opera nds 
on ly the accumulator zero and accu mulator one as these are hardwi red to VREG stage for 
the source operands selection. The vector accumu lat.or fil e imple me ntati on is flip fl ops-
based and has two asynchronous read ports a nd one sync hronous wri te port . In addition, 
the re are an extra two hardwired read port s wi th the accumu lators that are used in the 
VREG stage to retrieve the source operands when an accumulator source is specifi ed. The 
accumulator fil e is located physicall y in the second stage of the VDP and its e lectrical 
interface is depicted in Figure 6-30. 
6 .5. 10 . I Parameterisatio n 
The vector accumulator fi le is a full y-configurable design. T he number of accumulators 
(ACC_NUMBER) is within the range of 2 to 32, with a defa ult setting of 2 . The 
accumulator width (ACC_ WIDTH) is always eq ual to half the maximu m vector le ngth 
(VLMAX), 32-bit eleme nts. The compile-time parameters with their default va lues that 
s pecify the structure of the vector accumulator are specified in gxx_confi g. vhd and are 
li sted to the Tab le 6-4. 
6 Vector Processor Imp/emelltatio" 163 
Table 6-4: Compile-time vector acclIlllulator file parameters for its architectural and 
microa rchileclural state that arc co ntained in gxx_collfig.vhd file 
Parameter 
VLMAX 
ACC_NUMBER 
ACe WIDTH 
Dcfaull 
2, 4,8. 16,32,64, 128 
2 
(VLM AXJ2)*32 
6.5.10.2 The veclor accumulator implementation 
In the vector accumu lator fi le both read addresses (reg_stl. ctrl. vaccl_rdaddr_r, 
and read-enable strobes 
(reg_st l. ctrl. vaccl_rden_r, reg_st l. ctrl . vacc2_rden_r) are coming 
pipelined from the vector decode stage as well as the write add ress 
(reg_stl. c trl . vacc_waddr_r) . The write enab le (vacc_wen) is set when the 
register enable (reg_en) of thi s stage is set in order to implement the sync hronous wri te 
when the result is ready at the end of the second VDP stage. T he accumulator write data 
(vacc_data) are coming from the seco nd vector adel unit when multiply-add/sub 
operation is perfonned or from the VLSU unit in the case of load or from the adder tree. 
vacc_dala 
Figure 6-31: Write data and write-enable selection logic for the "eelor accumulalor fil c 
Otherwi se, the wri te daLa o ri gi nate from the masked vector output (stl_result) of the 
VDP I . The masked value is formu lated with the use of a masking logic to implement the 
pair of in stmct'ions, even and odd , for deposit high (amount 16) and deposit low (a mount 
0) operations and it is descri bed in more detail in secti on 6.5.6. In the case of ot her long 
6 Vectnr Prnce,fsor I lIIp/emellflltioll 164 
instructi ons the veCIQr OtJlput is unchanged. Figure 6-3 1 ill ustrates the selectio n logic for 
the write data and write-enabl e for the vector accumulator fil e. 
6.5.11 Vector Adde.· Tree (gxx_adde.·_tree) 
The adder trec is utili sed in the vaccreduce instructi on in which all the scalar clemcms 
in an accumu lator are add-reduced IQ a fi na l 32-bit result The adder tree is a 
paramcteri sed two-dimensional matri x of adders 10g,(VLMAX ) rows deep and 
VLMAX/2 adders at the row zero that are decreased by half in every row. Figure 6-32 
hows an adder tree configuration for vector length of 256-bi t elements (VLMAX 16). At 
the begi nning (row zero) there are four (VLMAX/4) adders that perform 32-bit add iti on . 
T he 33-bit results are added in pairs from two adders that compri se the second row 
(row I). The two 34-bit result are added with each other to form the final 35-bit result 
whose least-signi ficant 32-bits are passed to the output (adder_tree_out) of the adder 
tree, 
rowO 
row1 
row2 
adder_lree_out 
Figure 6·32: Adder tree configuration ror VLMAX 16 
The inplll operand (adder_tree_in. data. vrf_oprl _r) is masked to the current 
vector length (adder_tree_in.data.vlen_cvalue_r) in order only the neces ary 
vector elemems to be processed as the remaining vector element s till VLMAX are set to 
zero. This process is performed for reducing power consumpti on as the non used fl ops are 
not swi tching. 
6 Vector Processor Implemematioll / 65 
6.5.12 VLSU unit interface with VDP2 
When a load instruction is perfornled (reg_s tl. c trl. s el_vu_r=vl oad) the 
requested data from the memory is returned, ifvalid (vlsu2vdp .data_va lid=' l ' ), by 
the VLSU in the second VDP stage as shown in Figure 6-14. As previously mentioned 
the VLSU has a cascade TAG/DATA configuration which translates to a minimum of 2-
cycle load/use latency if no cache miss takes place. The returned data (vlsu_res) has 
vector length ofVLMAX* 16 bits for vector load or 32 bits zero extended to VLMAX*16 
bits for the scalar load. 
6.5.13 Overflow and Predicate Flags 
At the end of the first VDP stage a multiplexer selects the result overflow (ovf_stl) 
from the vector unit that executed the co processor operation. In the case of a 
miscellaneous or shift operation, the overfl ow takes the pipelined va lue of the overflow 
register (vregs2vdp . da ta . vrf_ov f_r) as no new overflow value is produced by 
either operation. In the second VDP stage another multiplexer selects the overflow 
(ovCst2) from the latched overflow of the previous stage 
(reg_stl. data. vrf_ovCr) and the produced overflow (vadd2_out. vovf) of the 
second vector adder in the case of multiply-add/sub operation. The write enable signal 
(vdp2vregs . ctrl . ovf _wen) for the overflow fl ag/regi ster is asserted when the 
instruction is valid and no exception is detected (reg_en=' 1 , ) and it is pipelined along 
with the overflow val ue to the VREG stage to update the overflow flag/regi ster. A 
predicate value is produced only in the first VDP stage from the vector adder unit (vadd) 
in the case of a comparison instruction. A mu lti plexer selects, at the end of the first stage, 
the predicate value (pred_stl) from the pipelined value of the predication register 
(vregs2vdp.data.pred_r) or the predicate result of the vadd. At the second stage the 
latched predicate value (reg_stl.data. pred_r) along with the write enable 
(vdp2vregs. ctrl. pred_wen) are sent to the VREG stage in order to update the 
predicate register. The write enable of the predicate register is controlled by the same 
conditions that apply to the overflow flag/regi ster write enable signal. 
6 Vector Processor Imolemet1fa liOIl 166 
6.5.14 Bypassing network of the second stage 
Prior to writing back to the register files, the results (vdp2vregs . data. vbpass2_res 
and vdp2vregs . data. sbpass2_res) are forwarded again to the VREG stage as inputs 
to the bypass process for source operands selection. The valid signals 
(vdp2vregs. ctrl . vbpass2_valid and vdp2vregs. ctrl . sbpass2_valid) that 
are sent along with the bypassed results and the target register write addresses are 
asserted when the instruction is valid and no exception is detected from the previous 
stages (reg_en=' 1' ). Again the current vector length 
(vdp2vregs. data. vbpass2_vlen_r) is sent to detennine which part of the source 
operand will contain the forwarded vector result. 
6.5.15 Write Back 
This is the final stage prior to committing a result to the vector or scalar register files or 
the vector accumulators. This stage is actually incorporated in the end of the second VDP. 
It includes combinational logic that selects the results of the operations that took place in 
the second stage of the vector data path along with the results from the previous stage. The 
result thus can be derived from the VLSU unit Goad operation) or from the accumulator 
file (L_mac/L_msu operation) or the adder tTee unit (vaccareduce operation) or the 
registered result of the first stage of the vector datapath. The pipelined addresses and 
write enables are sent to VREG stage to select the destination registers for the results. 
6.6 Output Register Bunch 
At the end of each stage of the vector coprocessor pipeline, there are the output registers 
which contain the control and data signals that enter into the following stage. The signals 
for all the pipeline stages are li sted analytical ly in Appendix B. 
6.7 Leon3 
As discussed in the previous chapter, the vector coprocessor is tightly-coupled to the 
Leon3 32-bit CPU which was chosen as the basecase CPU. A number of modifications 
took place in the Leon3 pipeline in order to attach the vector coprocessor and its control 
6 Vector Processor Implemematioll 167 
and data channels. These changes will be described in the order they appear in every 
stage of the pipeline of the Leon3 in the following paragraphs. 
6.7.1 Decode Stage 
In the Decode stage the latched instruction ITom the Fetch stage (de_inst) is decoded in 
parallel from both the CPU and an additional combinational logic that inspects if the 
current instruction is for the vector processor or not. As mentioned above the instruction 
opcode that is targeted for the vector processor is the one embedded in the lower 22 bits 
of the UNIMP instruction (Figure 6-3). The additiona l combinational logic checks the bits 
31:30 and 24:21 of the latched instruction and if equal to zero an opcode va lid signal 
(v. a. opc_valid) is asserted and the bits 21:0 are pipelined as the coprocessor opcode 
(v . a. opc). In the case of a move data instruction from Leon3 to coprocessor (mvsr2gpr 
or rnvvr2gpr) a data enable signal (v. a. vcop_data_en) is also asserted. Additionally 
in this stage, the addresses of the source and the destination operands are extracted from 
the latched instruction in parallel with decoding. This allows the concurrent access of the 
regi ster file in order to prepare the operands for the next stage. When the address of the 
first source operand is calculated additional logic checks ITom which field to extract it 
depending on whether it is a move instruction from the main CPU to co processor or not. 
Similar combinational logic selects the destination address and sets the write enable 
signals according to whether a move instruction 5'om the coprocessor to the CPU has 
been decoded or not. 
6.7.2 Register Access stage 
In this stage the operands are read from the register file or from intennediate data bypass 
networks. When a coprocessor instruction is perfOlmed the selected default operation in 
Leon3 will be addition. This in combination with the zero operands passed to the next 
stage, will cause Leon3 to perform a NOP operation when the executed instruction is 
targeting the coprocessor. However, this is sti ll a valid instruction packet and can be 
interrupted like any other Sparc V8 instruction. In the case of a move from the CPU to the 
coprocessor instruction the first source operand (v . e . opl) is pipelined as data input 
(v . e .leon_data) to the latter. The opcode (v . e . opc), the opcode valid 
(v. e. opc_valid) and the data enable (v . e. vcop_data_en) signals for the 
6 Vector Processor Implementation /68 
coprocessor are pipelined to the next stage if there is no exception and the opcode valid of 
the previous stage (r. a. ope_va lid) is asserted. In addition, at the exception_detect 
process, the VCOP logic was added to deactivate the illegal_inst signa l when the 
Leon3 decoder detects the UNlMP format that is the case of a coprocessor instruction. 
This ensures that all UNlMP opcodes are "hijacked" and passed to the coprocessor for 
execution. 
6.7.3 Execute Stage 
In the Execute stage all the arithmetic, logical, shift and miscel laneous operations are 
performed along with the load/store address calculation. When a coprocessor instruction 
is executed the source operands are set to zero. Therefore, the Leon3 will perform an 
addition with zero operands and this will emulate a nop instruction. Similarly to the 
prevIOus stage, the coprocessor signals (v. In. opc, V . In . opc_val id, 
v . m. vcop_data_en) are pipelined to the next stage in the case of no exception and the 
opcode valid of the previous stage (r. e. opc_valid) is asserted. 
6.7.4 Memory Stage 
In the Memory stage the data cache is accessed and the store operation is perfomled. It is 
this stage where the vector coprocessor is attached to the Leon3 pipeline in order to avo id 
the majority of the exceptions and intemlptions of the Leon3 and to have enough time to 
transfer data to/from the main processor (write stage) if requested. Therefore, when a 
coprocessor instruction is performed and there is no exception and the opcode val id of the 
prevIOus stage (r . In. opc_valid) IS asserted the vector/sca lar instruction 
(iu2vcop_opc) along with the valid signal (iu2vcop_opc_valid) is sent to the decode 
sta ge of the vector coprocessor. In addition, the other control signals (v. x . opc_ valid 
and v. x. vcop_da ta_en) are pipelined to the next stage. 
6.7.5 Exception Stage 
~1 this stage, all the traps and interrupts are resolved and the data are aligned for data 
cache read. Even though the full functionali ty of Leon3 supports single issue, seven stage 
6 Vector Processor !lJIplelllelllatioll 169 
pipeline, in thi s applicati on the wri te back (7'" stage) is no t imp lemented and the outputs 
from the exception stage are go ing straight to the register Fi le . 
. , ~ 
--
v "'r 1110 
'''''' 
"""""" . 
• .• ",,1 
',-
--
,"-
,. ___ 0' 
_," 
"~ 
IQIIJ o'O' 
"~ ,.~ ... IiII'T 
)r ,,,,~ ..... -
Figure 6·33: Leo"3 integer unit and vector coproccssor datapath diagram 
The write data, prior to commit to the register fil e (rfi . wdata), is the output of a fin al 
multiplexer wh ich se lects the result from the main CP U (xc_result) or the coprocessor 
data (leon_din) from the VREG stage in the case o f a move instructi on from the veop 
(j Vector Prncessor /IIIJJ!emelllatioJ/ 170 
to the Leon3. The laller is performed only if there is no exception and the opcode va lid of 
th e previous stage (r.x.opc_valid) is assened . The detailed schemati c of the Leon3 
wi th the <luac hed vector coproces or i. illustrated in Figure 6-33. 
The vector processor was added in the proc3 hierarchy and connected with the in terface 
of the integer unit. In thi s hi erarchy, the Leon3 processor core with the integer unit and 
the co mp lete cache sub-system with controllers and ra ms arc contai ned. It also comprises 
the multiply and divide uni ts hardware. Figure 6-34 depicts the proc3 hie rarchy thm 
inc ludes the vector processor. 
P" 0C3 
CACHE 
.. .... 
dc.o 0 dc.o 
Ihbo o· 1Ibo 
Cf1I rnl of;ntrll 
6.8 Summary 
Figure 6-34: Leon3 processor core block diagram 
In thi s c hapter, the design and imple mentati on of the vector datapath was described . T he 
pipeline organizati on and its constituent components were presented along with a brief 
lescripti on of the VLSU. In addition, the modifications 10 the Leon3 pipeline to enable its 
tight-coupl ing to the vector processor were detailed . 
6 Vector Processor Im plementation 
6.9 References 
[1] S. R. Parr, "High Perfonnance Load/Store Unit for a highly configurable, 
embedded vector processor," in Electronic and Electrical EI/.gineering: 
Loughborough, 2007. 
[2] "The Sparc Architecture Manual Version 8 ", www.sparc.com. 
[3] J. L. Hennessy and D. A. Patterson, "Computer Architecture: A Quantitative 
Approach," 3 ed: Morgan Kaufmann , 2003. 
/71 
[4] S. Furber, "ARM: System·on·Chip Architecture," Second ed: Addison-Wesley, 
2000, pp. 80-81. 
[5] C. Kozyrakis, "Scalable Vector Media-processors for Embedded Systems," in 
Computer Science University of Cali fornia: Berkeley, 2002. 
[6] S. R. Parr, K. Koutsomyti, and V. A. Chouliaras, "A High Bandwidth 
Configurable Load/Store Unit for an Embedded VectorProcessor," in 
Postgraduate Workshop on Embedded Systems Binningham, UK, 2006. 
[7] P. A. Beerel, S. Kim, P .-c. Yeh, and K. Kim, "Statistically optimized 
asynchronous barrel shifters for variable length codecs," in In ternational 
symposium on Low power electronics and design , San Diego, California, 1999, 
pp. 261 - 263. 
CHAPTER 7 
VECTOR PROCESSOR VLSI IMPLEMENTATION 
7.1 Design Verification 
The vector datapath was verified using test vectors that were produced by recording the 
inputs operands, the state of the global overflow flag and output results from each of the 
C macros that implement the basic operations. The recording process was performed by 
inserting pre-processor directives to every basic operation in their definition fi le as it is 
shown in the code snippet of Figure 7-1. The figure depicts the C macro of the basic 
operation L_mult and as it can be seen the pre-processor directive (#ifdef 
GEN_TVEC_L_MULT) uniquely identifies the name of the operation under test and 
selects the inputs and the outputs for recording which are then piped to a file. The whole 
process was controlled by a Perl script. 
Word32 L_mult(WordI6 varl ,Wordl6 var2) 
{ 
Word32 L_ var_out; 
#ifdefGENJVEC_ L_MULT 
int Overflow _in=Overflow; 
#endif 
L _ var _out = (Word32)var 1 • (Word32)var2; 
if(L_var_out != (Word32)Ox40000000L) { 
L_var_out *= 2L; 
} 
} 
else { 
Overflow = I ; 
L_ var_out = MAX_32; 
#ifdef GEN _ TVEC _ L _ MUL T 
fprintf(tv,"%x,%x,%x,%x,%x\n" ,varl ,var2, Overflow_in , L_var_out , Overflow); 
#endif 
retum(L_ var_out); 
} 
Figure 7-J : Exa mple of recording the inputs a nd the outputs of the L_ mult operation C 
macro 
7. Vector Processor VLSllmpiemelllatioll 173 
The two scripts for producing test vectors for the speech coding algorithms run the 
workloads by using the architecture-level simulator for all the ITU-supp lied bitstreams. 
The produced test vectors are subsequently applied to the vector datapath via a FLI-based 
testbench. The testbench is a self-contained VHDL model in a testing system and is 
designed to perform an automatic sequence of operations to validate the functionality of a 
design-under-test. The latter is instantiated and driven with a long sequence of the test 
vectors, created during the normal execution of the ITU-T workloads. These vectors are 
imported into the testbench and read by a Foreign Language Interface (FLI)-based 
stimulus process. The FLI provides a way for software components written in a high-level 
language, sllch as C, to interact with components written in VHDL or Veri log. In this 
particular case, the FLI allows for the C code, which reads in the test vectors from the 
stimulus file, to be used within the VHDL simulation environment and for each variable 
in the test vector to drive the correct signals of the VHDL testbench. The designs were 
simulated and verified with Mentor Graphic's ModelSim [2]. This software package 
allows for the event driven simulation of a VHDL or Veri log design and performs direct 
comparison between the outputs from the design-under-test and the expected (golden) 
results, stored also in the stimulus file . From the comparison an error report is produced 
that is used to validate the functionality of the design-under-test. 
d tb--9xx_mult_ p 
elock...,proce55 
-
~ 
JVL 
tesler mull gxx mull do check...,process 
-.-
..... _ ..... _~ ........ J 
.... - ........ ' error repor1 
--",,-, ~ - --., /' -.... "'N ",,-.,.~. t'--. ./ lest vectors 
""_GJIt1-"""-" ~: ""_"""-"""J /' -..., ~-"""-' ... .J'PC2_~ 
t'--. ./ 
'. r "- ./ ~-~. I ""' __ J 1 "'" --' __ J 
Figure 7-2: Test bench for the vector mul! unit of the vector data path 
The functionality is based on the specifications imposed on the design and can be 
confirmed by producing the expected results. The vector datapath testbench consists of 
four testbenches, one for each vector unit. Each such testbench was designed for the 
7. Vector Processor VLSllmplemefltatioll 174 
particular datapath blocks and their functionality was validated on a per-workload basis. 
Figure 7-2 shows the configuration of the testbench for the vrnult unit. It comprises the 
clock process, the stimulus process (tester _ffiult) that reads the test vector from the 
stimu lus file via the FLI (tdp_ini t. c), the vrnult unit (design-under-test: 
gxx_ffiult_dp) and the check process that performs the comparison of the outputs of the 
vrnul t unit and the expected golden results fTom the FLI. The testbenches for the other 
vector units of the data path have similar configurations. This kind of verification is call ed 
block-level verification. After the block-level verification, system level verification was 
performed. Figure 7-3 depicts the configuration of the testbench for the overall design of 
the vector coprocessor. 
tb gxx YCOP 
- -
clock...,Pfocess 
... 
JUl... 
-l~ ·..JL gxx YCoP 
stlmulusJ>rocess - --"'fIIIM-.. VH11 
_J~_ ............ -,. OUI 
....,r----. .. 
..... .wJr--" .... WId IOI!..-J~"""" 1III_'r----. .. 
-"" r----. ....... 
...... --, r----. ...... -
""",-_J rc- '""--
..... --'~- .,...-
Figure 7-3: Vector coprocessor testbench configuration 
The full coprocessor testbench consists of the clock process that generates the periodic 
clock signal, a hardwired stimulus process and the VHDL simulator output. The 
hardwired stimulus process drives the inputs of the vector coprocessor interface and the 
produced response is observed on the VHDL simulation environment. 
7.2 Synthesis and Place & Route Design Flow 
The design flow of the vector coprocessor is completed via a fully-automated 
synthesis/place-and-route campaign . This process is driven by using a grand (master) 
script whose pseudocode is depicted in Figure 7-4. This script runs Design Compiler for 
logical synthesis (statement S9), Cadence SoC Encounter for place and route (S 10) and 
aga in DC (SI I) for statistical power analysis. These are performed for different vector 
7. Vector Processor VLSl lmplemelltatiol1 175 
lengths (VLMAX) and different periods in order to have a complete view of the vector 
coprocessor. 
Main driver script 
{ 
S3 for each VLMAX 
{ 
SS for each period 
{ 
Change period; 
58 Modify processor conf iguration; 
S9 DC run1 : Logical Synthesis; 
510 Encounter run: Place and Route; 
Sll DC run2 : Power Analysis; 
} 
} end; 
Figure 7-4: Script in a pscudocode for the design now of the veclor co processor 
These steps of the master script and the produced results are described in more detail in 
the following sections. 
7.2.1 Design Compiler Stage (Logical Synthesis) 
After the design verification the next step is the synthesis phase. Synthesis is the 
automatic transformation of a Register Transfer Level (RTL) design description to a gate 
level netlist implementation. The synthesis process takes as inputs the RTL HDL 
description, timing constraints and attributes for the design and a technology library and 
produces a fully-mapped gate level netlist. Synthesis is an iterative process that starts by 
defining the constraints for each RTL block of the design and optimising the gate-level 
netlist for area, timing and power [I]. The synthesis tool that was used for the coprocessor 
design is the industry standard Synopsys Design Compi ler (DC) [3]. The target standard-
cell technology chosen for the design was Taiwan Semiconductor Manufacturing 
Company 's (TSMC) O.13lJm standard-cell library (IPoly, 8 Copper) [4]. Using this 
technology, each design was synthesised varying both VLMAX and target clock 
frequency (period). The design constraints that contain the timing and the area 
information are defined in the design compiler's TCL (Tool Control Language) driven 
script. This script is used to guide the synthesis and optimization process of the design 
with the ultimate aim of meeting the user-specified constraints. The output of the DC run 
7. Vector Processor VLSllmplemelllatioll 176 
are design timing constrains in Synopsys design constraints (*'sdc) fom1at in addition to 
the new netlist representing the mapped and optimised design . 
7.2.2 SoC Encounter script Stage (Place and Route) 
After Logical Synthesi s with Design Compiler, it is the turn of the Place-and-Route 
encounter script to run . This script dri ves the place and route process which produces the 
necessary files for statistica l power analysis . The script starts by running Cadence First 
Encounter (FE) [5] in batch mode and by reading in the physical view of the RAMs and 
the library along with their timing view. The optimised verilog netlist (*'v) from the 
previous stage and the Synopsys Design Constraints (*'sdc) file that specifies the timing 
constraints are then imported. The place and route tool performs floor planning, power 
grid specification (power/ground ring and stripes), placement of RAM macros and 
standard cells, and clock tree synthesis. These are followed by global and detail routing 
(multithreaded mode), extraction of RC data and post clock tree synthesis timing 
optimization to fix the setup time. This is achieved by the tool inserting to the setup-
violating paths buffers or inverters and doing gate resizing (including flip-flops) and 
instance cloning. After that step, filler cells (dummy cells) are added to fill the area 
between the placed and routed standard cells and connect their VDD and VSS rai ls to the 
power ring. Wl,en the final layout is ready it needs to be checked against the veri log 
netlist (Layout vs. Schematic L VS). In addition, Design Rule Check (ORC) takes place 
that checks the enforcement of the technology library design rules in the final layout. The 
outputs from this stage include area, and maximum frequency reports, of the design along 
with path delays, timing constraint values, interconnect delays in standard delay format 
(*'sdi) fi le, standard parasitic extraction format (*'spei) file and a new gate-level netli st 
representing the very final placed-and-routed design . These outputs are then read back 
into DC for the final stage of statistical power analysis. 
7.2.3 Statistical Power Analysis Stage (Design Compiler) 
Power analysis in stati stical mode is run immediately after the end of place and route. The 
post placed-and-routed verilog netlist is loaded along with the timing constraints ('*sdc) 
and the standard parasitic extraction format (* .spei). When the stati stical power analysis 
is performed several fil es are created which include average power di ssipation, area, 
7. Vector Processor VLSllmp/emellftllioll 177 
worst LR drop etc. The fin al such results for the vector datapat h, coprocessor and the 
overall syste m are presented in the fo llowin g secti ons. 
7.3 Implementation Campaign for Vector Datapath 
For the vector datapath three different metrics we re obtained, namely power, area and 
max imum operati ng frequency f"m. Figure 7-5 depi cts the stat istical power consumption 
observed for varying VLMAX and clock period of the vector datapath des ign. Here each 
requested period is pl a ned against its corres pondin g power. Observing thi s set of result s it 
is obvious that the general shape of the graph is of a simil ar nature. It can be seen that as 
VLM AX increases the amount of power inc reases proporti onall y. T hi s is due to the fact 
that the higher the VLMAX the hi gher becomes the phys ica l number of gates placed on 
the silicon. This rise in the number of the gates inevitabl y leads to an increase in the 
power consumption. Ln addit ion, this pl ot reveals how the consumed power has a di rect 
relati onship with the speed that the design can operate. As the des ign is pushed into 
operat ing at higher frequencies the power di ssipated at these frequencies increases as 
well. This is an ex pected result as the design's clock frequency affects the number of 
switching gates thu s leading to a ri se in dynamic power. 
300 
250 
200 
l< ,. 
i 150 
• 0 
"-
lOO 
50 
0 
0 50 100 
Statistical Power Results 
150 200 250 
Requested Frequency (MHz) 
-
300 350 
• vlmax4 
- . - vlmaxS 
• vlmax16 
vlmax32 
Figure 7-5: St.atistical power resulls of vector datapath for different vector lengths 
Another interesting observation is the signi fica nt di ffe rence in power consumption that 
observed fo r a ll VLM AX and max imum frequencies . At a vector length of 4 and a 
7. Vector Processor VLSllmplellleflllllioll 178 
frequency range of 100 to 333MHz the vector datapath power consumption ranges from 
9.94 to 82. 17mW whereas for vector length 32 at the same frequency range the power 
consumption ranges between 33.69 to 272.96mW. T his can be seen as a fairly constant 
three fold increase in power consumption. In addition to study ing the power consumption 
or each design methodology the physica l area of each design was recorded. Figure 7-6 
shows how this area changes for different va lues or VLMAX and at different frequencies 
(periods) for the vector datapath design. 
3SOOOOO 
300000O 
2500000 
g- 2000000 
= 
• ~ 1500000 
"" 
1000000 
500000 
0 
0 50 
Post-Synthesis Area (no wireload) 
_0-
- " 
.--- -
0 
_ - 0 
~ . 
" 
+ 
lOO lOO 200 250 
Requested Frequency (MHz) 
300 350 
• vlmax4 
• vlmaxS 
• vlmax16 
-- vlmax32 
Figure 7-6: Statistical area results of vector datapath for d ifferent vector lengths 
A s it can be observed from the graph the required area for a given VLM AX shows a 
marginal change for the srudied frequency range. The vast maj ority of the sili con area 
within the chip is used by the logic gates that perform the functionality of the design. A s 
the frequency requi rement increases vari ous synthes is optimi zation methods are 
automatically appl ied to allow for the des ign to operate at thi s higher frequency. These 
methods often lead to an increase in silicon area as they employ faster and larger buffers 
for timing optimization of the crit ical paths and consequently affect the whole system 
layout. A ll these methods for pushing the design to achieve ever increasing speeds have 
an adverse affect on both power and area. Another observation that can be made from the 
above graph is that the area of the device is directly related to the vector length (VLMAX). 
This is due to the effect of the vector length on the quanti ty of the des ign logic as each 
increase in VLMAX in volves additiona l vector elemenl instanliations. T he increase in area 
7. Vector Processor VLS llmplemell tatioll 179 
required for higher vector length completely overshadows Ihe increase due 10 Ihe 
operation at higher frequencie . Thi s effect of the operating frequency on the area of a 
device is less apparenl as the effect of Ihe dramatic increase in logic required for each 
change in VLM AX. Due to Ihese reasons the graph shows a near parallel set of lines for 
the veClor datapal h. 
Post-Synthesis Frequency Results 
300 .-----------------------------------------------, 
250 
• 
---.--
50 -
oL-----------__________________________________ ~ 
50 
''''' '" 
12. ' 43 200 250 333 
Requested Frequency (MHz) 
• ~max4 
• 'llmax8 
vlma;.:1 6 
- - vlmsIC32 
Figure 7-7: Frequency results of vector datapath for different vector lengths 
Figure 7-7 illustrates the maximum achievable frequency against the requested frequency 
for different vector lengths. It is observed Ihm Ihe relati onship between the achieved 
frequency and the requested is near- linear for frequencies up to 333 MHz and veclor 
lengths frolll 4 to 16. For VL M AX 32 and frequencies below 200 MHZ the same near-
linear relationship is observed. As the requested frequency is increased above 200 MHz, 
the achieved freqllency becomes more unpredictable due to Ihe enormOllS size of the 
netl ist optirnised by DC in a top-down mode. 
7.4 Implementation Campaign for Vector Coprocessor 
The stat istical power analysis was performed for the vector coprocessor as a whole. Thi s 
includes the vector datapath (previous seclion) and the VLS U (other proj ecl !) unit. Figure 
! This is a parallel running projec!, addressing the design of the Veclor Load/Store Unit of the 
processor. 
7. Vector Processor VLl)/llIIp/emelllafioll 180 
7-8 illustrates the power consumpti on for different VLMAX and peri ods (frequencies). 
From the results it ca n be seen that as the requested peri od (frequency) increases the 
amount of the power di ssipated increases proporti onall y. This direct relationship is due to 
the number of switching ga tes and their size, as the la tter is affected with increas ing the 
requested cloc k frequency. The hi gher the frequency, the hi gher the capac ity load 
switching, whic h leads to the rise in the dynami c power. 
Statistical Power Results 
OOOr-----------------------------------------------. 
500 
--vtmax4 
• vtmaxB 
• vlmax16 
100 
oL-----------------________________________ ~ 
o 50 100 150 200 250 300 350 
Powe ... (~W) 
Figure '·8: Statistical power results of vector coprocessor for different vector lengths 
For different VLMAX the graph shows a margina l change for the dissipated power. This 
is because the size o f the VLSU is much larger than the vector coprocessor and 
consequentl y the power it di ssipates. At low requested periods the difference in statistical 
power between VLMAX 4 and VLMAX 16 is approximately 12.3 0/0 which see ms 
constant over the frequency range. The stati stica l power results were obtained up to 
VLMAX 16 and reveal tlwt the power consumpti on increase is a fairly constant 6.6 fo ld 
for the period range of 20ns (50MHz) to 3ns (333 MHz). No results were obtained for 
higher VLMAX as the des ign was too large for the synthesis run to complete 
successfully. Apart from the power consumpti on the physica l area of the vector 
co processor design was also recorded. Figure 7-9 depicts the area for different values of 
VLMAX and for various requested peri ods (frequenc ies) for the vector coprocessor 
des ign. Again the required area for a given VLMAX show a marginal change for all the 
freque ncy range, as the silicon area is propOlliona l to the number of gates that perform 
7. Vector Processor VLSI Implemell tatioll 181 
the functionality of the design. A s the frequency increases however there is a slight ri se in 
the si licon area as the timing optimization methods affect the whole design layout. 
Post-Synthesis Area (no wireload) 
'500000 
' 000000 
... .. .. ~ 
'500000 
300000O 
.., 
~ 
• 2SOOOOO --vlmax4 
.: 
::l 2000000 . - - -.-. ~-- .- . 
__ • • vlmaKS 
.. vlmax1 6 ;; 
ISOOOOO 
1000000 
SOOOOO 
0 
0 50 100 ISO 200 2SO 'SO 
Requested Frequency (MHz) 
Figure 7-9: Stntistical area results of vector coprocessor for different vector lengths 
Additi onally from the graph it can be seen that the area is directl y related to the vector 
length (VLMAX). Thi s was expected as the vector length affects the number of the 
functi onal units in the vector datapath along with the size of the register files and the 
VLS U unit. The last graph in Fi gure 7- 10 il lustrates the achievable frequency against the 
requested frequency for different vector lengths of the whole vector coprocessor. A s it 
can be seen the achievable frequency matches or even is higher than the requested for 
frequencies up to 200 MHz and vector lengths from 4 to 16. For higher frequencies 
however logical synthesis is unable to achieve the requested frequency an effect 
exacerbated at higher vector lengths. This is due to the increased design size which can ' t 
be handled efficientl y by the ynthes is tooi. 
7. Vector Processor vLr; //Jl/p/ellle1llcuioll 182 
Post-Synthesis Frequency Results 
300 
250 
N" 200 
:I: 
~ 
~ - - vimax4 g 150 
- """"" • , • Ylmax16 
l 
'00 
• 
50 
0 
50 67 '00 
'" 
125 200 250 333 
Requested Frequency (M Hz) 
figure 7-10: Frequency results of vector coprocessor for different vector lengths 
From the graph it can be observed thm the maximum operati ona l frequency for the vector 
coprocessor is 256 MH z for vector lengths up to 8 and 208 MH z for a vector length o f 16. 
These fi gures fa ll well within the acceptable range of high performance industrial-l evel 
AS IC design for the given silicon technology. 
7.5 VLSI Layout 
T ile foll owin g secti ons present the resulting VLSI macrocell s a long wit h their phys ical 
characteri sti cs for the Vector Datapath and the Vector Processor designs respecti vely. 
7.5.1 Vector Datapath Layout 1'0." VLMAX 16 
The vector datapath with VLMAX= 16 (256-bit length) was take n through the full front 
e nd (l ogical synthesis) and the back end (Place and Route) fl ows. The des ign was read 
into Synopsys des ign compi ler and synth esized for a target frequency of 250 MH z, 
targeting the TSMC 0 . 1 3~m ( I Po ly, 8 Copper) process. A top-down fl ow and no 
wireload models were used. Thi s fl ow was chosen as our ex pe rience shows that the back 
end too l (Cadence SoC Encounter) is capabl e of very advanced netli st re-sylllhes is thus 
maki ng the use of front end wireload model unnecessary. After synthesis the optimized 
ne tl ist of the vector datapath wit h length 256 bits was imported into SoC Encounter and 
7. Vector Processor VISllmplementation 183 
the flat ph ysica l fl ow was carried ou!. The phys ica l characteri stics of the VLSI cell are 
given in Table 7- 1. 
Table 7-1: VLSI Layoul physical paramelers for VDP willt VLMAX 16 
Pnrameters 
X dim (pm) 
Y dim (1Im) 
Area (nll11 sq) 
Ce lls (RAMs) 
Cell rows 
Speed (MHz) 
Value 
10 10 
10 10 
1.02 
63945 (7) 
279 
186.2 
The VLSI resuhs show a worst case (O.9V, 125 C) maximum frequency of 186.2 MHz 
post-route. The achieved frequency is well within the domain of high performance 
implementati ons of wide parallel processors. 
Figu re 7-1 
7. Vector Processor VLSl lm plementatiol1 184 
It is anticipated that further work at the back-end will result in a substantially faster cell. 
The power consumption based on statistical activity (not workload-based) of the cell is 
also moderate; at 61.3 mW when optimized for 4ns period. The design includes 
approximately 64 K gates, 7 RAM macros in 279 standard cells rows. The cell area is 
1.0 I by 1.0 I mm'- The resulting VLSI macrocell is shown in Figure 7-1 I. 
7.5.2 Vector Datapath Layout for VLMAX 32 
The same methodology was fo llowed for the vector datapath with VLMAX=32 (512-bit 
length). The design was synthesized for a target frequency of 200 MHz, targeting the 
TSMC 0.13 iJ m (I Poly, 8 Copper) process. The physical characteristics of the VLSI cell 
are given in Table 7-2. 
Table 7-2: VLSI Layout physical pa ra meters for VDP with VLMAX 32 
Pa rameters 
X dim (~m) 
y dim(~m) 
Area (mm sq) 
Cells (RAMs) 
Cell rows 
Speed (MJiz) 
Va lue 
180 1 
1800 
3.24 
209809 ( 11 ) 
453 
126.7 
Again no wireload models were used and the physical fl ow was carried out for the vector 
datapath with length 512 bits. The resul ting VLSI macrocell is shown in Figure 7- 12. 
7. Vec/or Processor VLSllmplemellla/ioll 185 
The design achieved a much lower frequency of 126.7 MHz post-route, worst case (0.9V, 
125 C) max imum frequency when optimised for 5ns peri od . Thi s discrepancy between 
logical synthesis (200MHz) and final post-route speed (126 .7M Hz) is attributed to very 
wide datapath (5 12 bit s) which resulted in a substantially congested VLSI mac ro. The 
VLSI macro includes approximately 2 10 K gates, II RAM macros in 453 standard cells 
rows. The cell area is 1.8 by 1.8 mm2• 
7.5.3 Vector Processor Layout for VLMAX 16 
Finally, the Full vector processor (incorporating the Vecto r Datapath and the VLS U) with 
VLMAX= 16 and Vector Data Cache confi guration 4-way, 8Kbytes, 128 bytes block 
length and 2 sub-blocks per block, was taken through the full front end (logical synthesis) 
and the back end (Place and Route) fl ows . The design synthesized for a target Frequency 
of 200 MHz, targeting the TSMC 0 . 1 3~m ( I Poly, 8 Copper) process . 
7. Vector Processor VLSi i lllplemellfClrioll 
Table 7-3: VLSI Layout physical parameters for VCOP with VLMAX 16 
Parameters 
X dim (~m) 
y di m (~m) 
Area (mm sq) 
Cells (RAMs) 
Cell rows 
Snced (MHz) 
Value 
1802 
349 1 
6.29 
257308 (22) 
92 1 
182 
186 
A top-down fl ow and no wireload mode ls were used. AFter synthesis the optimized netlist 
was imported into SoC Encounter and the physical fl ow was carried out with the two 
majo r pal1itions being the vector data path and Vector LoadlS tore Unit (VLSU). T he 
resulting VLSI macrocell is shown in Figure 7- 13. 
Figure 7-13: Layoul for the whole veclo r processor (veclor dalapalh and VLSU uni t) 
7. Vector Processo r VLSlllllplelllellfatioll 187 
The physical characteri stics of the VLSI cell are given in Table 7-3 . The des ign includes 
approx imalely 257 K gates, 22 RAM macro in 92 1 standard cell rows. The ce ll area is 
1.8 by 3.5 mm2 T he design ac hieved 182 MH z post-route, worst case (O.9V, 125 C) 
maX 1111Um frequency that clearly indicates that the critica l path lies withi n the Vector 
Datapath. 
7.6 ESL Implementation 
This section discusses briefiy the SystemC-based methodology, whi ch alllomalica ll y 
generates a technology independent Veri log netli st from the vector in structi ons of a 
vectorized applicatjon. This applicati on involves bot h ITU-T speec h coders, G.729A and 
G.723. 1. The vector instruction set extensions, which were described in Chapter 5, were 
formed by C-source vector mac ro-opcodes and were introduced to a nex t-gene rat ion 
multi -paralle l, configu rable application-specific processor known as SS_SPARC. The 
SS_SPARC platform along with the ES L methodology and the statisti cal power analy is 
results obta ined from the SystemC-accelerator synthesis and the handed-code RTL 
synthesis are presented in tJle fo ll owing ecti on [6]. 
7.6.1 SS_SPARe Platform 
SS_SPARC is a configurab le, extensible, chip multi -processor where each processor is a 
5-issue, simultaneous multithneaded vector processor [6]. A hi gh-level view of a 3-
instance SS_SPARe kernel is depicted in Figure 7- 1. 
Coofogur.Wle 
nun~of 
SS_SPARe 
SMT COfes 
Streaming 
SWKlalone 
"""" ... te'" 
I 
I 
I 
I 
I 
I 
I 
Banked L2 Cache 
"[ 
~ 
m 
:z: § Channel 
• ~ 
'-c'liiliI",,"'''le system 
memory port 
Figure 7-14: High level view of a 3-instance SS_SPARe kernel 
The SS_SPARC platform cons ists of a confi gurab le number of SMT processing un it s, a 
number of user-de fi ned, loosely-coupl ed coprocessors, a pipe lined switch matri x, and a 
7. Vector Processor VLSllmplemelllatiol/ 188 
multi-banked, level-2 memory system with a standard AI-IB interface. Additionally, a 
generic, transaction-Ieve l-pipelined memory interface which connects to the next 
generation AMBA 3 Advanced eXtensible Interface (AXI) [7] standard is availab le. The 
design is parameterized as to the number of SMT processing units, the number of 
contexts per processor unit, the vector infrastructure, the instruction and data caches 
configuration and buffering schemes and the switch matrix configuration [6]. Figure 7-15 
illustrates the schematic diagram of the superscalar pipeline of a SMT processing unit. It 
comprises the instruction Front-End (!FE), the scalar core (SCORE), the vector core 
(VCORE) and the load/store unit (LSU). 
VCORE 
Figure 7-15: Superscalar SMT pipeline organisation 
The !FE consists of a configurab le, multi-way instruction cache (ICache) and supplies an 
instruction block (5 instructions) per cycle to the per-context instruction buffers. A 
programmable arbitration mechanism is employed to select one of the non-blocked 
contexts. The lCache services one block request per cyc le and supports pipelined 
transactions to the main memory. In case of a cache miss only the particular context is 
blocked whi le the remainder are allowed to proceed. The employed branch predictor is 
configurable as to the numbers of branches it can predict per cache block and it is 
7. Vector Processor VLSllmplemelllatioll 189 
relati vely simple with good prediction rate 111 the computationa ll y intensi ve loops 
dominant work loads of the teleco l11 domain. A fter the instructi on buffers there is a 
dispatch logic which checks the buffered instructi on per process ing unit to resolve data 
dependences and prepare the instruction packet for execution. T he instructi on packet is 
dispatched to the register/bypass stage in the SCORE block, for subsequent cOlllex t 
prioriti zation and transfer to the execution block [6]. 
The sca lar core (SCORE) block consists of the microarchitectural units equal with the 
number of supported cOlllexts, the context selecti on un it (CCU) and a 3-stage pipeline 
that implements the Sparc V8 ISA [8] . The instructi ons that were dispatched in the 
previous cycle access per-context the register files. These instructi ons are prioriti zed by 
the CCU, and progress to the registers of the execution datapat h. T he datapath compri ses 
two 32-bi t integer ALUs in a cascade conf iguration. Figure 7- 16 illustrates the SCORE 
pipeline organi zati on [6] . 
--- --
-
Figure 7-16: Scalar core (SCORE) pipeline orga nization 
The dual -pipeline vector core (VCORE) is highly configurable and extensible for the 
architecture (programmer's model and ISA) as well as the microarchitecture (width of 
vector registers, number of stages of the vector pipeline, bypassing etc) and it i s the 
primary DSP engine [6] . In the first pipeli ne, custom instructi ons can be easily inserted 
as 'plug- in datapaths' in the vector core by using the exposed interface of the latter. The 
second pipeline is dedicated to returning vector loads from the high-bandwidth LSU and 
7. Vector Processor VLSllmplementation 190 
it is not accessible from the system architect. As shown in Figure 7-17 the vector core 
comprises the architected state (one per context), the vector bypass logic, and a 
configurable number of vector execute stages for the custom datapath. In a multi-context 
configuration, multiple threads access the architected state of the processor. In the case 
that there are no regi ster or resource dependences, multiple contexts are prepared to be 
dispatched to the single-issue vector data path. The CCU arbitrates the ready CPU 
contexts by using context arbitTation algoritlm1 and issues one to the vector pipeline. The 
results are made available (via bypassing) to dependent vector instructions . The exposed 
microarchitecture allows the system architect to design and implement custom 
instructions using a number of methodologies including RTL-based and, ESL-based. The 
interfaces that facilitate this are: a) the Dispatch IF that is the input interface to the user 
defined vector datapath b) the Bypass IF consists oflhe vector result buses, one per stage, 
vector masks and valid stTobes to determine the bypass paths c) the LSU return path IF is 
the entry point of the return vector load from the LSU d) the write-back IF is the point 
where the produced vector results (two per cycle) from the vector datapath are passed to 
the vector register file of the specific context for writing [6]. 
CS" -~2?1 "~I 
""" 
IGgJ 
Figure 7-17: Dml l-pipcli ne vector unit organization 
7. Vector Processor VLSllmplementation 191 
7.6.2 ESL Methodology 
The input of the flow of the developed methodology is the vectorized source code of the 
ITU·T G.729A and G.723.1 speech coders. The vectorization was perfonmed by using a 
number of assembly-like C-macros. The C-Ievel macros define precisely the vector 
instruction set extensions that were described in Chapter 5. The custom flow parses these 
C-macros and creates a SystemC module that instantiates these SIMD instructions. The 
SystemC model is verified by using the test vectors that were produced by running the 
vectorized algori thm in order to ensure that this "packing" of the SIMD ISA hasn' t 
change the functiona lity of the operations. A number of pipeline registers and the bypass 
taps are specified in the synthesis tool. The SystemC datapath is then synthesized to 
technology independent gates RTL-YHDL using a commercial SystemC synthesizer. 
Afterwards the RTL model is validated again by the same test vectors (as they applied 
before) to ensure that the SystemC-RTL transfonmation was successful. The resulting 
RTL datapath is instantiated in the exposed vector unit of the SS_SPARC processor and 
further decoding logic is added to the core processor to enable the execution of these 
extensions [6] . The combined RTL (vector extensions and SS_SPARC platfonm) goes to 
the standard design flow which was described in sections 7.2.1 to 7.2.3. The results from 
the statistical power analysis results for both the SystemC-accelerator and the RTL-
accelerator synthesis along with a YLSI layout are presented in the following section. 
7.6.3 Micro-Architecture Results 
In thi s work the statistical power consumption and the area were obtained for the 
SystemC-defined accelerators as well as the RTL-accelerators. Figure 7-18 depicts the 
power consumption of both implementations for al1 the configurations: vector length 256-
bit (YLMAX 16) and 512-bit (YLMAX 32), vector contexts 1, 2,4 and 8 for different 
clock periods. In thi s fi gure each requested period is plotted against its corresponding 
power. From the set of the results it is obvious that the general shape of the graphs is of a 
similar nature. 
7. Vector Processor VLS//lllplemell ta tiol/ 
ITU Vector Engine Power Consumptlon 
, 
CL 
/ i 
, 
~ ~ • 
• iJ; 
""""'" 
Figure 7-18: ITU Veore Power Results 
+ 
-, 
n 
~ ~ 
~ • iJ; 
~ 
• iJ; 
- , .. 
· .. 
.. 
7m 
· .. 
· .. 
". 
- b 
192 
The SystemC-accelerators shows a pre-route overhead of 3% to 15% compared to the 
hand-coded (RTL) des igns over the sy nthesis campai gn. These results demonstrate that 
the SystemC synthes is is fairl y reli able and can achieve power consumpti on c lo e to the 
traditi onal RTL synthesis [6]. Additionally, from the RTL results it can be seen that the 
power consumption is affected signifi cantl y from the vector length (VLM AX). The 
power consumption shows a 4 fo ld increase for context I between VLMAX 16 and 
VLMAX 32 whereas the increase for context 8 is 2 fold between these vector lengths. 
Figure 7-1 9 shows the pre-route area of both sets of acce lerators al 0 for all 
configurati ons. In this case, the SystemC-implementati on ex hibited even better area usage 
cha racteri stics with a reduction in the range of 2% to 18% compared to the hand-coded 
(RTL) des igns. Thi s is due to the fact that the SystemC synthes izer that makes more 
inte ll igent resource allocation compared to the traditi onal RTL design fl ow [6]. 
Additionall y from the graph it can be seen that the area is directly related to the vector 
length (VLMAX). From the results it can be seen that the re is a fairl y constant two fo ld 
inc rease in area allocati on between VLMAX 16 and VLMAX 32 and for all the range of 
contex ts. Thi s was ex pected as the vector le ngth affects the number and the width of the 
vector datapath s that for VLMAX 32 is double. 
7. Vector Proc;essor VLSllmp/elllellfClrioll 
""""'" 
""''''''' 
i "'''''''''' t tSOOOOO 
! 
f """"'" , 
-I ItOOOXl 
i 
. ""'" 
ITU Vector Engine Arn v, D. lay 
1 I I j 
Contt gur.llon 
Figure 7·19: ITU VCore Arca·Delay Resul ts 
· , .. 
· .. 
.. 
,~ 
· .. 
· .. 
... 
- .. 
193 
The SystemC-<lefined datapat h configurat ion (VLMAX=32, T" .,,=250 MHz) was 
through the enti re fl ow to a VLS r macro. The resulting VCORE (includ ing the datapath, 
the vector contexts. all ITIu lti plex ing/bypassing and the LSU return path ) is shown in 
Figure 7·20. The design includes approx imately 70K gates and six 16x 128·bit dual, p0rl 
RAM macros, three for each vector register fi le of the two CPU contexts. A two·stage 
pipelined architecture was speci fi ed which resulted in a worst-case (0.9V, 12SC) 
max illlulll frequency of 2 13 MHz [6]. 
Figure 7·20: Two·context, 256·bit ITU vector engine 
7. Vector Processor VLSllmplemellfatioll 194 
7.7 Summary 
This chapter discussed the verification methodology used to validate the vector processor 
and its associated units along with the synthesis and back-end now of the vector data path. 
Statistical power/area/frequency results were presented for the vector datapath and the 
vector coprocessor as a whole for different configurations (VLMAX, frequency) after a 
scripted synthesis/place-and route campaign. The VLSI layouts and their physical 
parameters of the vector datapath and the vector processor were also illustrated. This was 
followed by the description of the SS_SPARC ASIC platfonn, the SystemC modelling of 
the vector instruction set extensions and their subsequent synthesis to low-level RTL. The 
ESL-implemented of the vector extensions was inserted after to the exposed vector 
engine of the SS_SPARC processor and statistical power analysis resu lts for both the 
SystemC-accelerator and the RTL-accelerator data paths were presented and compared. 
7. Vector Processor VLSllmplementGtio" 195 
7.8 References 
[I) S. Akella, "Guidelines For Design Synthesis Using Synopsys Design Compiler," 
Department of Computer Science Engineering, University of South Carolina, 
Columbia, December 2000. 
[2) G. R. Beck, D. W. L. Yen, and T. L. Anderson., "The Cydra 5 
mtntsupercomputer: Architecture and implementation," The JOllrnal of 
SlIpercomplltillg. vol. 7, pp. 143-1 80, May 1993 . 
[3) "Design Compi ler 2003 .06," Synopsys Inc. , 2003. 
[4) "Advanced Logic Technology - 0 . 13~," Taiwan Semiconductor Manufacturing 
Company, 2006. 
[5) . Kozyrakis, "A Media-Enhanced Vector Architecture for Embedded Memory 
Systems," Technical Report: CSD-99-1059, University of California at Berkeley 
1999. 
[6) V. A. Chouliaras, K. Koutsomyti, T. Jacobs, et aI. , "SystemC·defined SIMD 
instructions for high SystemC·defined SIMD instructions for high," in i3th IEEE 
international Conference on Electronics, Circllits alld Systems, Nice, France, 
2006, pp. 822-825. 
[7) "AMBA AXl Specification," http://www.aml.comJarmtechlAXl. 
[8) "The Sparc Architecture Manual Version 8 ", www.sparc.com. 
CHAPTER 8 
CONCLUSIONS 
The aim of this thesis was to study the potential acceleration of both speech coding 
algorithms, namely 0.729A and 0 .723.1, through their efficient implementation on a 
configurable extensible vector embedded CPU architecture. The outcome of this work 
was the optimization of both C reference codes and the design and implementation of a 
parametric (configurable) vector processor, to explore the effects of different 
configurations (VLMAX, number of regi sters and accumulators) and thus, probe the 
microarchitecture space. The optimized reference codes and the vector architecture were 
fully validated with the use of the ITU-supplied test vectors. This chapter presents the 
main contributions of this research and proposes further work which leads on from this 
project. 
8.1 Contribution of this thesis 
At the beginning of thi s work and in order to investigate the potential acceleration of both 
speech codecs, the proliling of both C reference codes was performed to identify the 
computation workload distribution. This revealed that the most CPU-intensive parts of 
the codes were in the DSP emulation functions (e.g. in 0.723.1 decoder 66.7% of the 
total machine instructions) of the reference implementations. Additionally, these 
algorithms exhibited a large amount of data-level para llelism. Therefore it was decided 
that efficient implementation of these basic operations in the foml of a configurable 
vector processor with a targeted, data-parallel architecture, could achieve a leading 
area/power/cost result. 
An optimization methodology was developed, in which custom vector and scalar ISA 
extensions were identified and inserted into both reference codes in place of the DLP-
loops and other non-vectorizable pal1s of the codes respectively. The optimized codes 
were verified and run on the SimpleScalar tool set for aIlITU-T test vectors, over a range 
of vector lengths, to evaluate the performance of the vector architecture prior its 
implementation in hardware. For this purpose the simulator was modified and extended to 
8. Conclusions 197 
include the added state (coprocessor scalar and vector state) and the scalar and vector 
extensions. 
The architectural results were very promising, demonstrating a reduction in the dynamic 
instruction count metric of 58% and 71% for G.729A and G.723 .1 speech coders 
respectively when the vector instructions were introduced and a further 18% and 9% 
reduction in dynamic instruction count when the scalar instructions were applied . The 
overall simulation results indicated that the area/performance points of interest lie in 
between 64-bit (VLMAX 4) to 256-bit (VLMAX 16) wide configurations as there was 
not much more improvement over a vector data length of 16 (256 bits) due to the size of 
the speech frames. These speech codecs operate on frames (blocks) of 240 samples and 
these frames are also divided into subframes of 60 samples and hence fast performance 
improvement can be seen for lower vector lengths. At vector length of 4, the coprocessor 
would save 7 I .6% of the dynamic instruction count of the G.729A encoder and almost 
75% for the G.723.1 encoder. For vector length 16, the coprocessor would only save 
another 4.4% and 5% for G.729A and G.723.1 respectively and no significant 
improvement emerges beyond that. In addition both sets of results revealed that the 
maximum benefit is achieved by the combination of custom vector and scalar 
architectures. These results conclusively showed the potential benefit of applying custom 
instructions and having associated coprocessor vector functional units. 
Another aspect of this work was the SystemC modelling of the vector instruction set 
extensions and their subsequent synthesis info low-level RTL. This work was undertaken 
to explore faster routes to silicon for SIMD extensions, compared to the established RTL 
now. These ESL-implemented vector extensions were inserted into the exposed vector 
engine of the SS_SPARC ASIC processor and statistical power analysis results, for both 
the SystemC-accelerator and the RTL-accelerator datapaths, were presented and 
compared. From the synthesis results it was shown that the SystemC synthesis was fairl y 
reliable and achieved power consumption close to the traditional RTL synthesis. 
The main contribution of this research project was the full design and implementation of 
the proposed vector datapath of the vector processor. The vector pipeline is a SIM D array 
of functional units with a configurable 2-way S IMD or sca lar organization. It has a four 
stage-pipeline organization and it is parameterised along both the architecture and the 
8. COl/clusions 198 
microarchitecture axes. Few modifications took place to the Leon3 pipeline to enable its 
tight-coupling to the vector processor. 
The vector datapath was verified by using an FLI-based testbench that applied the ITU-
supplied test vectors. Finally, statistical power/area/frequency results were obtained for 
the vector datapath and the vector coprocessor as a whole for different configurations 
(VLMAX, frequency) after a scripted synthesis/place-and route campaign. In addition, 
the VLS[ layouts and their physical parameters of the vector datapath and the vector 
processor were obtained. From these results, the vector datapath with VLMAX= 16 
configuration showed a worst case (0 .9V, 1 25C) maximum frequency of 1 86.2MHz, area 
1.02 mm2 and power of61.3 mW. The whole vector coprocessor with VLMAX= 16 and 
vector data cache configuration 4-way, 8Kbytes, 128 bytes block length and 2 sub-blocks 
per block achieved maximum frequency of 182MHz, area of 6.29 mm2 and power of 
74.97 mw. 
8.2 Suggestions for future research 
The vector processor was developed to efficiently execute the G.729A and G.723. 1 
speech coding standards in an embedded application. Since its vector and scalar lSA are 
based on the basic operations of these algorithms, all the rru G.7xx speech coding 
standards which share the same (or a subset) emulation operations such as G.711 , G.726, 
G.727 , G.728 and G.729 can also be accelerated by adapting them for this vector 
processor. This adaptation involves optimization with the insertion of vector and scalar 
extensions. 
The developed vector processor can be attached to any scalar CPU with very little 
modifications in its interface. This gives it the great advantage of being able to interface 
to different architectures and ASIC platfoffils. Thus al lows further research on novel 
multimedia architectures that incorporate VoIP/speech coding functionality . 
Since the VLSU unit is also parametTic, several different configurations can be 
implemented and their perfomlance in terms of area and power di ssipation investigated. 
In addition, entirely different VLSUs can be attached to the vector datapath with cascade 
or parallel TAGIDATA organization with few modifications to their interface with the 
8. Conclusions 199 
vector datapath. As the current VLSU has cascade TAGIDATA organization an extra 
signal in the output multiplexer of the VDP I stage needs to be added. This signal will 
select the return load data from the VLSU at the end of the VDPI stage as the load takes 
on ly one cycle for a paralle l TAG/DATA configuration instead of two which is the case 
for the cascade configuration. 
As already di scussed, the vector coprocessor implementation is technology independent 
therefore it can be re-targeted to different silicon technologies. The multiple 
configurations (VLMAX, number of registers and accumulators) lead to different 
statistical power/area/frequency points thus covering a large part of the implementation 
spectrum. 
Another area of research would be to investigate the bene'fits of ESL techniques instead 
of programmable architectures by coupling the ESL-implemented vector datapath to other 
ESL defined architectures. 
As multimedia applications consist of more than one time-critical execution threads there 
is a significant amount of coarse-grained paralleli sm. Therefore by attaching the vector 
coprocessor to a multi threaded architecture could accelerate even more multimedia-rich 
applications that incorporate speech coding [I) . 
Another interesting approach will be an architecture that combines the best of LLP and 
DLP techniques for an optimal implementation . This architecture would combine vector 
instructions with out-of-order execution with register renaming and even simultaneous 
multithreaded execution. Such implementations are very promising according to Espasa 
[2) and Quintana [3] and the Tarantula project [4] in which a vector unit is attached to the 
superscalar Alpha engine. This is also the domain of the SS_SPARe processor [5] . 
8. Conelusions 
8.3 References 
[I] K. Diefendorffand P. Dubey, "How Multimedia Workloads Will Change 
Processor Design," in IEEE Computer. vol. 30, September 1997, pp. 43-45. 
200 
[2] R. Espasa and M. Valero, "Exploiting Instruction- and Data-Level Parallelism," 
in IEEE Micro. vol. 17, September 1997, pp. 20-27. 
[3] Francisca Quintana, Roger Espasa, and Mateo Valero, "A Case for Merging the 
!LP and DLP Paradigms," in 6th Euromicro Workshop on Parallel and 
Distributed Processing, Madrid, Spain, 1998, pp. 217-224. 
[4] R. Espasa, F. Ardanaz, J. Gago, et aI., "Tarantula: A Vector Extension to the 
Alpha Architecture" in the Proceedings ojthe 29th Annual fntemotionol 
Symposium 0 11 Computer Architecture (fSCA '02) Anchorage, Alaska, 2002, pp. 
281-292. 
[5] V. A. Chouliaras, K. Koutsomyti , T. Jacobs, et aI. , "SystemC-defined SIMD 
instructions for high performance SoC architectures," in 13th IEEE flll em alional 
COlljerellce 0 11 Electrollics, Circuils and Systems, Nice, France, December 2006, 
pp. 822-825. 
APPENDIX A VECTOR AND SCALAR ISA 
Idvlen r 
Instruction Format 
Idvlen r format 
0000000 
21 15 
Syntax 
where: 
imm is a numeric constant 
Description 
o imm 
8 
The vlen_r instruction loads an immediate into the Vector Length Register 
Example 
Idvlen_r(16) ; //Vector Length Register is set to 16 
vldw 
Instruction Format 
vldw format 
0000001 vrd srs1 
21 15 10 
Syntax 
vldw(vrd, srsl) 
is the destination vector register 
where: 
vrd 
srs1 is the address of the variable in memory 
Description 
o 
5 
o 
o 
The vldw instruction loads the vector register vrd from memory address given in scalar 
register srs1 . 
201 
Avpendix A Vector and Scalar ISA 202 
Example 
vldw(2, 3); //Load vreg2 from address given in sreg3 
vldwn 
Instruction Format 
vldwn format 
0000010 vrd srs1 
21 15 10 
Syntax 
vldwn(vrd, srsl) 
is the destination vector register 
where: 
vrd 
srsl is the address of the variable in memory 
Description 
o 
5 o 
The vldwn instruction loads vector regi ster vrd downward from memory address given 
in sca lar register srsl . 
Example 
vldwn(2 , addr); //Load vreg2 downwards from address given 
in sreg3 
vstw 
Instruction Format 
vstw format 
0000011 o 
21 15 
Syntax 
vstw(vrs2, srsl) 
where: 
vrs2 
srsl 
is the source vector register 
is the memory address 
Description 
vrs2 srsl 
10 5 
The vstw instruction stores the vector register vrs2 to memory address given from 
scalar register srsl. 
o 
Appelldix A Vector alld Scalar ISA 
Example 
vstw(3, 1) ; II Store vreg3 to memory addre ss 
vstwn 
Instruction Format 
vstwn fannat 
I 0000100 I 0 
21 15 
Syntax 
v s twn (vrs2, srsl ) 
where: 
vrs 2 
s r sl 
is the source vector register 
is the memory address 
Description 
vrs2 
10 
given in 
I srs1 
5 
The v s twn instruction stores a vector register downward to memory address 
Example 
vstwn (3, addr) ; II Store vreg3 downwards to addr 
vldaccw 
Instruction Format 
vtdaccw fannat 
0101000 vaccd srs1 
21 15 10 
Syntax 
vldaccw(vaccd, srsl) 
is the destination vector accumulator 
where: 
vac cd 
srsl is the address of the variable in memory 
Description 
5 
o 
203 
s r egl 
I 
0 
o 
The vldaccw instruction loads 32-bit word to the vector accumulator [Tom memory 
Example 
vldaccw( O, addr); II Load vacc O from addr 
Appendix A Vec/or alld Scalar ISA 
vstacc 
Instruction Format 
vslacc format 
0010100 vacc vaccelem 
21 15 10 
Syntax 
vstacc(vacc, vel em, srsl) 
is the source vector accumulator 
where: 
vacc 
velern 
srsl 
is the element of the source vector accumulator 
is the memory address 
Description 
srs1 
5 
The vs tacc instruction stores a vector accumulator element (32-bit) to memOlY 
Example 
vstacc(l,O,addr); I I Store element 0 of vaccl to addr 
vaccclr 
Instruction Format 
vaccclr format 
0010000 vacc 
21 15 
Syntax 
vacclr (va c c) 
where: 
vacc is the vector accumulator 
Description 
o 
10 
204 
o 
o 
The vaccclr instruction sets the value in the vector accumulator vac to zero (clear) 
Example 
vaccclr(l); IISe t vacc l t o zer o 
Appelldix A Veclor alld Scalar ISA 
vsplatacci 
Instruction Format 
vsplatacci format 
I 0010001 vaccd srs1 o 
21 15 10 5 
Syntax 
vsplatacci(vaccd, srsl) 
is the destination vector accumulator 
where: 
vaccd 
srs1 is the value (32-bits) that is splated into the vector accumulator 
Description 
The vsplatacci instruction splats the 32-bi t word sca lar va lue into the vector 
accumulator. 
Example 
vsplatacci (O, 3) ; IISplat vaccO with the value of the 
scalar register 3 
vldacceli 
Instruction Format 
vldaccoli format 
0010010 vaccd vaccelem 
21 15 10 5 
Syntax 
v ldacce1i (vaccd, vel em, imm) 
is the destination vector accumulator 
where: 
vaccd 
velem 
imm 
is the destination element of the vector accumulator 
is the immediate to be loaded 
Description 
imm 
205 
o 
o 
The vldacceli instruction loads an immediate value into a vector accumulator element 
Example 
vldacceli(1,O,16); IILoad immediate 16 into element ° of 
vacc l 
Appelldix A Vector alld Scalar ISA 206 
Instruction Format 
I 0100010 vrd 
21 15 10 
Syntax 
vsplat_h_r(vrd, srs l) 
where: 
vrd 
srsl 
is the destination vector register 
is the scalar register value 
Description 
srs1 o 
5 
The vsplat_h_ r instruction splats a 16-bit word of scalar register srsl to all the 
elements of vector register vrd. 
Example 
vspla t_h_r(1 , 3) ; IISplat 16-bit value of sreg3 t o v r egl 
vmvacctre 
Instruction Format 
vmvacctre format 
0011000 vrd 
21 15 10 
Syntax 
vrnvacc tre (vr d, vacc l, amo unt) 
where: 
vrd 
vacc 
amount 
is the destination vector register 
is the vector accumulator 
is the shift amount 
Description 
vacc1 amount 
5 
The vrnvacctre instruction extracts high (amount=O) or low (amount= 16) the even 
elements of vector accumulator and loads them into the even elements of the vector 
register vrd 
Example 
vrnvacctre (2 , 1, 16 ); I IExt ract s h igh the eve n elements of 
o 
o 
Appelldh A Vector alld Scalar ISA 
vaccl and loads them to vreg2 
vmvacctro 
Instruction Format 
vmvacctro format 
0011001 vrd 
21 15 10 
Syntax 
vrnvacctro(vrd, vaccl, amount) 
where: 
vrd 
vacc 
amount 
is the destination vector register 
is the vector accumulator 
is the shift amount 
Description 
vacc1 amount 
5 
The vrnvacctro instruction extracts high (amount=O) or low (amount= 16) the odd 
elements of vector accumulator and loads them into the even elements of the vector 
register vrd. 
Example 
vrnvacctro(3,0, 0) ; //Extract s low the odd elements of 
vaccO and loads them to vreg3 
vmvrtacce 
Instruction Format 
vmvrtacce format 
0100110 vaccd 
21 15 10 
Syntax 
vrnvrtacce(vaccd, vrsl, amount) 
where: 
vaccd 
vrsl 
amount 
is the destination vector accumulator 
is the destination vector register 
is the shi ft amount 
vrs1 amount 
5 
207 
o 
o 
Appendix A Vector and Scalar ISA 208 
Description 
The vmvrtacce instruction deposits high (amount=16) or low (amount=O) the even 
elements or vector regi ster to the vector accumulator. 
Example 
vmvrtacce(O,3,16); IIDeposits high the even e l ements of 
vreg3 to vaccO 
vmvrtacco 
Instruction Format 
vmvrtacco format 
0100111 vaccd 
21 15 10 
Syntax 
vmvrtacco(vaccd, vrsl, amount) 
where: 
vaccd 
vrsl 
amount 
is the destination vector accumulator 
is the destination vector register 
is the shift amount 
Description 
vrsl amount 
5 
The vmvrtacco instruction deposits high (amount=16) or low (amount=O) the odd 
elements or vector register to the vector accumulator. 
Example 
vrnvrtacco(1,3 ,O ); IIDeposits low the odd elements of 
vreg3 to vaccl 
mvgpr2vr 
Instruction Format 
mvgpr2vr format 
I 1001101 vrd 
21 15 10 
Syntax 
mvgpr2vr(vrd, vel em, grsl) 
where: 
vrd is the destination vector register 
velem grsl 
5 
o 
o 
Appendix A Vector and Scalar ISA 209 
velem 
grsl 
Description 
is the destination element orthe vector register 
is the source general purpose register (Leon) 
The mvgpr2vr instruction moves the scalar contents (32-bit) of the general purpose 
register to the vector regi ster element. 
Example 
mvgpr2vr(1,2,5); //Move the contents of the general 
purpose register 5 t o the 2nd element 
of vregl 
mvvr2gpr 
Instruction Format 
mvvr2gpr format 
I 1001 11 0 grd velem 
21 15 10 
Syntax 
mvvr2gpr(grd, velem, vrsl) 
where: 
grd 
velem 
vrsl 
is the destination general purpose register (Leon) 
is the vector register element 
is the source vector register 
Description 
vrs1 
5 
The mvvr2gpr instruction moves the contents of the vector register element to tile 
general purpose register (Leon) . 
Example 
mvvr2gpr(2,3,5); //Move the contents of the 3d e lement 
of vre g5 to the general purpose 
registe r 2 
vaddh 
Instruction Format 
vaddh format 
0011010 vrd vrs1 vrs2 
21 15 10 5 
o 
o 
Appendix A Vector and Scalar 'SA 210 
Syntax 
vaddh(vrd, 
where: 
vrd 
vrsl 
vrs2 
Description 
vrsl, vrs2) 
is the destination vector regi ster 
is the first source vector register (operand I) 
is the second source vector register (operand 2) 
The vaddh instruction performs short addition ( 16-bit) of source vector registers vrsl 
and vrs2 and places the result to the destination vector register vrd. 
Example 
vaddh(5,2,3); //vreg5=vreg2+vreg3 (16-bits) 
Instruction Format 
vitu_sub r format 
001101 1 vrd vrs1 
21 15 10 
Syntax 
is the destination vector regi ster 
where: 
vrd 
vrsl 
vrs2 
is the first source vector reg ister (operand I) 
is the second source vector register (operand 2) 
Description 
vrs2 
5 
The vi tu_sub_r instruction performs short subtTaction (16-bit) of source vector 
registers vrsl and vrs2 and places the result to the destination vector register vrd. 
Example 
vitu_sub_r(5,2 , 3); // vreg5=vreg2 -vreg3 (16-bits) 
vaddacc 
Instruction Format 
vaddacc format 
0010111 vaccd vacc1 vacc2 
21 15 10 5 
o 
o 
Appelldix A Vector alld Scalar ISA 
Syntax 
vaddacc (vaccd, vaccl, vacc2) 
where: 
vaccd 
vaccl 
vacc2 
Description 
is the destination vector accumulator 
is the first source vector accumulator (operand I) 
is the second source vector accumulator (operand 2) 
211 
The vaddacc instruction performs long addition (32-bit) of source vector accumulators 
vaccl and vacc2 and places the result 10 the destination vector accumulator vac cd. 
Example 
vaddacc(O,O,l); //vaccO=vaccO+vaccl (32-bits) 
vsubacc 
Instruction Format 
vsubacc format 
0100101 vaccd vacc1 
21 15 10 5 
Syntax 
vsubacc(vaccd, vaccl, vacc2) 
is the destination vector accumulator 
is the first source vector accumulator (operand I) 
where: 
vaccd 
vaccl 
vacc2 is the second source vector accumulator (operand 2) 
Description 
vacc2 
The vsubacc instruction performs long subtraction (32-bit) of source vector 
accumulators vaccl and vacc2 and places the result to the destination vector 
accumulator vaccd. 
Example 
vsubacc(O,O,l); / / vaccO=vaccO-vaccl (32-bits) 
o 
Appelldix A Veclor alld Scalar ISA 
vaccaddreduce 
Instruction Format 
vaccaddreduce fonnat 
0010011 vacc 
21 15 
Syntax 
vaccaddreduce(vacc) 
where: 
vacc is the vector accumulator 
Description 
o 
10 
The vaccaddreduce instruction add-reduces all the elements orlhe vector 
accumulator vacc to a 32-bit val ue that is placed to its zero element. 
Example 
vaccaddreduce(l); //Add-reduce vector accumulator 1 
Instruction Format 
vitu mult e r format 
0011110 vrd vr51 
21 15 10 
Syntax 
is the destination vector register 
where: 
vrd 
vrsl 
vrs2 
is the first source vector register (operand I) 
is the second source vector regi ster (operand 2) 
Description 
vr52 
5 
ZIZ 
o 
o 
The vi tu_mul t_e_r instruction performs signed short mUltiplication (16-bit) to the 
even elements of the source vector registers vrsl and vrs2 and places the result to the 
even elements of the destination vector register vrd. 
Example 
vitu_mult_e_r(3 ,l, 2); //vreg3=vregl*vreg2 (even elements) 
Appelldix A Vector alld Scalar ISA 
Instruction Format 
vitu mult 0 r format 
0011111 vrd vrs1 
21 15 10 
Syntax 
is the destination vector register 
where: 
vrd 
vrs1 
vrs2 
is the fi rst source vector register (operand 1) 
is the second source vector register (operand 2) 
Description 
213 
vrs2 
5 o 
The vi tu_mu 1 t_o_r instruction perfomls signed short multiplication (16-bit) to the 
odd elements of the source vector registers vrs1 and vrs2 and places the result to the 
even elements of the destination vector register vrd. 
Example 
vi tu_rnu 1 t_o_r (3 , 1 , 2 ) ; //vreg3=vreg1*vreg2 (odd elements) 
Instruction Format 
vitu mult r e r format 
0011100 vrd vrs1 
21 15 10 
Syntax 
is the desti nation vector regi ster 
where: 
vrd 
vrs1 
vrs2 
is the first source vector regi ster (operand I) 
is the second source vector register (operand 2) 
Description 
vrs2 
5 o 
The vi tu_mu 1 t_r_e_r instruction perfonns signed short multiplication (16-bit) with 
rounding to the even elements of the source vector registers vrs1 and vrs2 and places 
the result to the even e lements of the destination vector register vrd. 
Appelldix A Vector alld Scalar /SA 
Example 
vitu_mult_r_ e_ r(3,1,2); //vreg3=vreg1*vreg2 (with 
rounding - even elements) 
Instruction Format 
I 0011101 I vrd vr51 
21 15 10 
Syntax 
is the destination vector register 
where: 
vrd 
vrs1 
vrs2 
is the first source vector register (operand I) 
is the second source vector register (operand 2) 
Description 
vr52 
5 
2 /4 
o 
The vi tu_mu 1 t_r_o_r instruction performs signed short multiplication (16-bit) with 
rounding to the odd elements of the source vector registers vrs1 and vrs2 and places 
the result to the odd elements of the destination vector register vrd. 
Example 
vi tu_mu1 t_r_o_r (3 , 1 ,2) ; //vreg3=vreg1*vreg2 (with 
rounding - odd elements ) 
Instruction Format 
vi!u i mult e r format 
21 
Syntax 
where: 
vrd 
vrs1 
vrs2 
0101001 vrd vr51 
15 10 
is the destination vector register 
is the first source vector regi ster (operand I) 
is the second source vector register (operand 2) 
5 
vrs2 
o 
Appendix A Vector and Scalar ISA 2 15 
Description 
The vi t u_i_mul t_e_r instruction performs integer short multiplication (16-bit) to 
the even elements of the source vector registers vrsl and vrs2 and places the result to 
the even elements of the destination vector register vrd. 
Example 
vitu_i_mult_e_r(3,l , 2 ); Ilvreg3:vregl*vreg2 (integer-
even elements ) 
Instruction Format 
vi tu mult 0 r format 
0101010 vrd vrs1 
21 15 10 
Syntax 
is the destination vector regi ster 
where: 
vrd 
vrsl 
vrs2 
is the first source vector register (operand I) 
is the second source vector register (operand 2) 
Description 
vrs2 
5 o 
The vi tu_i_mult_o_r instmction performs integer short multiplica60n (16-bit) to 
the odd elements of the source vector registers vrsl and vrs2 and places the result to 
the odd elements of the destination vector register vrd. 
Example 
vitu_i_mult_o_r(3,l,2) ; Ilvreg3:vregl*vreg2 (integer-
odd elements) 
vmace 
Instruction Format 
vmace format 
0001100 vaccd vrs1 vrs2 
21 15 10 5 
Syntax 
vrnace(vaccd, vrsl, vrs2) 
o 
Appelldix A Vecroralld Scalar ISA 
where: 
vaccd 
vrsl 
vrs2 
Description 
is the destination vector accumulator 
is the first source vector register (operand I) 
is the second source vector register (operand 2) 
216 
The vrnace instruction performs long multiplication (32-bit) to the even elements of the 
source vector registers vrsl and vrs2 and adds the product to the even elements of the 
destination vector accumulator vaccd. 
Example 
vmace(O,1,2); // Perform mac to even elements of vaccO, 
vregl and vreg2 
vmaco 
Instruction Format 
vmaco fonnat 
0001101 vaccd vrs1 
21 15 10 
Syntax 
vrnaco(vaccd, vrsl, vrs2) 
is the destination vec tor accumulator 
where: 
vaccd 
vrsl 
vrs2 
is the first source vector register (operand I) 
is the second source vector register (operand 2) 
Description 
vrs2 
5 o 
The vrnaco instruction performs long multip lication (32-bit) to the odd elements of the 
source vector registers vrsl and vrs2 and adds the product to the odd elements of the 
destination vector accumulator vac cd. 
Example 
vrnaco(O,1,2); //Perform mac to odd elements of vaccO, 
vregl and vreg2 
Appendix A Vector and Scalar ISA 
vmsue 
Instruction Format 
vmsue fonnat 
0001 110 vaccd vrs1 
21 15 10 
Syntax 
vmsue(vaccd, vrsl, vrs2) 
is the destination vector accumulator 
where: 
vaccd 
vrs l 
vrs2 
is the first source vector register (operand I) 
is the second source vector register (operand 2) 
Description 
217 
vr52 
5 o 
The vmsue instruction performs long mu ltiplication (32-bit) to the even elements of the 
source vector registers vrsl and vrs2 and subtracts the product to the even elements of 
the destination vector accumulator vaccd. 
Example 
vmsue(O,1, 2 ); // Perform multiply-subtrac t to even elements 
of vaccO , vregl and vreg2 
vmsuo 
Instruction Format 
vmsuo fonnat 
0001111 vaccd vrs1 
21 15 10 
Syntax 
vrnsuo(vaccd, vrsl, vrs2) 
is the destination vector accumulator 
where: 
vaccd 
vrsl 
vrs2 
is the first source vector register (operand I) 
is the second source vector register (operand 2) 
Description 
vrs2 
5 o 
The vrnsuo instruction performs long multiplication (32-bit) to the odd elements of the 
source vector registers vrsl and vrs2 and subtracts the product to the odd elements of 
the destination vector accumulator vac cd. 
Appelldix A Vector nud Scalar {SA 218 
Example 
vrnsuo(0,1,2 ) ; /IPerforrn multiply-subtract to odd elements 
of vaccO, vregl and vreg2 
vshli 
Instruction Format 
vshli format 
0001010 vrd vrsl 
21 15 10 
Syntax 
vshli(vrd, vrsl, amount) 
is the destination vector register 
where: 
vrd 
vrsl 
amount 
is the vector register (operand I) to be shifted 
is the shift amount (immediate) 
Description 
amount 
5 o 
The vshli instruction performs short shift left (IG-bit) to the vector register vrsl by 
immediate (amount). 
Example 
vshli{3,1 ,4); I/Shift left vregl by 4 and put result to 
vreg3 
vshri 
Instruction Format 
vshr; format 
000101 1 vrd vrs 1 
21 15 10 
Syntax 
vshri(vrd, vrsl, amount) 
is the destination vector register 
where: 
vrd 
vrsl 
amount 
is the vector register (operand I) to be shifted 
is the shift amount (immediate) 
amount 
5 o 
Appendix A Vector mui Scalar ISA 219 
Description 
The vshri instruction performs short shift Tight (16-bit) to the vector register vrsl by 
immediate (amount). 
Example 
vshri (3.1 . 4); //Shift right vregl by 4 and put resul t to 
vreg3 
vshlr 
Instruction Format 
vshlr format 
0100001 vrd vrs1 vrs2 
21 15 10 5 
Syntax 
vshlr (vrd. vrsl . vrs2 ) 
is the destination vector regi ster 
is the vector regi ster (operand I) to be shifted 
where: 
vrd 
vrsl 
vrs2 is the vector register (operand 2) that contains the shift amount 
Description 
o 
The vshlr instruction performs short shift left (16-bit) to the vector register vrsl by 
the amount of the vector register vrs2. 
Example 
vshlr (5.1 . 3 ); / /Shift left vregl by amount that is in 
vreg3 and put result to vreg5 
vshrr 
Instruction Format 
vshrr format 
0100000 vrd 
21 15 10 
Syntax 
vshrr(vrd, vrsl, vrs2 ) 
where: 
vrd is the destination vector register 
vrs1 vrs2 
5 o 
Appelldix A Vector alld Scalar ISA 
vrsl 
vrs2 
Description 
is the vector regi ster (operand I) to be shifted 
is the vector regi ster (operand 2) that contains the shift amount 
220 
The vshr r instruction performs short shift right (16-bit) to the vector register vrsl by 
the amount of the vector regi ster vrs2 and places the result to the destination scalar 
register vrd. 
Example 
vshrr(5,l,3); IIShift right vregl by amount that is in 
vreg3 and put r esult to vreg5 
vlshlacc 
Instruction Format 
vlshlacc format 
0010110 o 
21 15 
Syntax 
vlshlacc (vacc, amount) 
where: 
vacc 
amount 
is the vector accumulator 
is the shift amount 
Description 
vacc amount 
10 5 o 
The vlshlacc instruction performs long (32-bit) shift left to the vector accumulator 
vacc by amount (immediate) . 
Example 
vlshlacc ( 1,3) ; //Long shift left vaccl by 3 and put result 
to vaccl 
vlshracc 
Instruction Format 
vlshracc format 
I ~~ __ 0~O_10~1~0_1 __ ~~I ____ ~o~ __ ~I ____ ~va~cc~ __ ~IL-~a~m~0~un~t __ ~1 
21 15 10 5 0 
Syntax 
v l shracc (vacc, amount) 
AppelldLr A Vector alltl Scalar ISA 221 
where: 
vacc 
amount 
Description 
is the vector accumulator 
is the shift amount 
The vlshracc instruction performs long (32-bit) shift right to the vector accumulator 
vacc by amount (immediate). 
Example 
vlshracc(1,3); //Long shift right vaccl by 3 and put 
result to vaccl 
vlshlaccr 
Instruction Format 
vlshlaccr format 
0100100 o vacc 
21 15 10 
Syntax 
vlshlaccr (vacc, vrs2) 
is the vector accumulator 
where: 
vacc 
vrs2 is the vector register with the shi ft amount 
Description 
vrs2 
5 o 
The vlshlaccr instruction performs long (32-bi t) shift left to the vector accumulator 
vacc by the amount of the vector register vrs2. 
Example 
vlshlaccr(1,2); //Long shift left vaccl by amount that is 
in vreg2 and put result to vaccl 
vlshraccr 
Instruction Format 
vlshraccr format 
0100011 o vacc vrs2 
21 15 10 5 o 
Appendix A VecTOr and Scalar ISA 
Syntax 
vlshraccr (vacc, vrs2) 
where: 
vacc 
vrs2 
Description 
is the vector accumulator 
is the vector register with the shift amount 
222 
The vlshraccr instruction performs long (32-bit) shift right to the vector accumulator 
vacc by the amount of the vector register vrs2. 
Example 
vlshraccr(1,2); / / Long shift right vaccl by amount that is 
in vreg2 and put result to vaccl 
vcmp 
Instruction Format 
vcmp format 
I 0000101 o vacc1 
21 15 10 5 
Syntax 
vcrnp (vaccl, vacc2) 
is the fi rst source vector accumul ator (operand I) 
where: 
vaccl 
vacc2 is the second source vector accumulator (operand 2) 
Description 
vacc2 
o 
The vcrnp instTuction compares two vector accumulators (vaccl, vacc2). Ifvacc l 
is greater than vacc2 then the predication fl ag becomes I (true) else ° (fa lse). 
Example 
vcrnp(O,l); //Cornpares vaccO with vaccl 
vrcmp 
Instruction Format 
vrcmp format 
I 0000110 o vrs1 vrs2 
21 15 10 5 o 
Appendix A Vector and Scalar ISA 223 
Syntax 
vrcmp (vrsl . vr s 2) 
where: 
vrsl 
vrs2 
Description 
is the fi rst source vector regoister (operand I) 
is the second source vector register (operand 2) 
The vrcmp instruction compares two vector registers (vrsl . vrs2 ) . If vrsl is 
greater than v r s2 then the predication fl ag becomes 1 (true) else 0 (false). 
Example 
vrcmp(1.2) ; //Compares vregl with vreg2 
Instruction Format 
vcmp h-lle formal 
0000111 o 
21 15 10 
Syntax 
where: 
vr sl is the vector register to be compared 
Description 
vrs1 o 
5 o 
The vcmp_h_ge instruction checks ifvector register vrsl is greater than or equal to 
zero and if it is true sets the predication nag to 1 (tTue) else 0 (false) . 
Example 
vcmp_h_ge(2) ; //Compares vreg2 with zero 
Instruction Format 
vmerge I h r formal 
I 0001000 I vrd vrs1 vrs2 
21 15 10 5 o 
AppendLt A Vector and Scalar ISA 
Syntax 
where: 
vrd 
vrsl 
vrs2 
Description 
is the destination vector register 
is the first source vector register (operand I) 
is the second source vector register (operand 2) 
224 
The vmerge_t_h_r instruction merges two vector registers (vrsl, vrs2) 
according to the predication flag value. Ifpred is I then vrd=vrsl else vrd=vrs2 . 
Example 
vmerge_t_h_r(3,4,2); Il if pred=l vreg3=vreg4 else vrd=vreg2 
vmerge 
Instruction Format 
vmerge fonnat 
I 0001001 vaccd vacc1 
21 15 10 5 
Syntax 
vmerge (vaccd, vaccl, vacc2 ) 
is the destination vector accumulator 
is the first source vector accumulator (operand I) 
where: 
vaccd 
vaccl 
vacc2 is the second source vector accumulator (operand 2) 
Description 
vacc2 
o 
The vmerge instruction merges two vector accumulators (vaccl, vacc2) according 
to the predication flag value. If pred is I then vaccd=vaccl else vaccd=vacc2 . 
Example 
vmerge(O,O, l ); Il if pred=l vaccO=vaccO else vaccO =vaccl 
m2sld16 
Instruction Format 
m2sld1 6 format 
I 0101011 srd srs1 o 
21 15 10 5 o 
Apeelldix A VeclOr alld Scalar ISA 
Syntax 
m2s1d16(srd, srsl) 
where: 
srd 
srsl 
Description 
is the destination scalar register 
is the address of the variable in memory 
The m2 sld16 instruction loads a 16-bit value to scalar regi ster srd from memory 
address given in scalar register s r sl. 
Example 
215 
m2s1d16(2 , 3 ); I ILoad (16-bit) to sreg2 from address that 
i s i n sreg3 
m2sld32 
Instruction Format 
m2sld32 fonnat 
I 0101100 srd srs1 
21 15 10 
Syntax 
m2s1d32 ( srd, srsl ) 
is the destination scalar register 
where: 
srd 
srsl is the address of the variable in memory 
Description 
o 
5 
The m2s1d32 instruction loads a 32-bit value to scalar register srd from memory 
address given in scalar register srsl. 
Example 
o 
m2sld16 (4, 3) ; II Load (32-bit) to s r e g4 from address that 
is in sre g 3 
m2sst16 
Instruction Format 
m2sst16 fo nnat 
I 0101101 o srs2 srs1 
21 15 10 5 o 
Appelldix A Vector alld Sca/ol' /SA 
Syntax 
m2sst16(srs2, srsl) 
where: 
srs2 
srs l 
Description 
is the source scalar register 
is the memory address 
The m2 ss t16 instruction stores a 16-bit va lue of scalar register srs2 to memory 
address given from scalar register srsl. 
Example 
m2sst16(4, 3); //Store (16-bit) sreg4 to memory address 
that is in sreg3 
m2sst32 
Instruction Format 
m2sst32 format 
0101110 o 
21 15 
Syntax 
m2sst32 (srs2, srsl) 
where: 
srs2 
srs1 
is the source scalar register 
is the memory address 
Description 
SIs2 515 1 
10 5 
The m2 ss t3 2 instruction stores a 32-bit va lue of scalar register srs2 to memory 
address given from scalar register srsl. 
Example 
m2sst32(2, 1); // Store (3 2-bi t) sreg2 to memory address 
that i s i n sreg1 
mvgpr2sr 
Instruction Format 
mvgpr2sr format 
I 1001011 srd o 9,s1 
21 15 10 5 
226 
o 
o 
Appelldix A Veclor alld Scalar ISA 117 
Syntax 
rnvgpr2sr(srd, grsl) 
where: 
srd 
grsl 
Description 
is the destination scalar register 
is the source genera l purpose register (Leon) 
The rnvgpr2 sr instruction moves the scalar contents (32-bit) of the general purpose 
register grsl to the scalar register srd. 
Example 
rnvgpr2sr(1,2); / / Move the contents of the greg2 to sregl 
mvsr2gpr 
Instruction Format 
mvsr2gpr format 
I 1001100 grd o 
21 15 10 
Syntax 
rnvsr2gpr(grd, srsl) 
where: 
grd 
srsl 
is the destination general purpose register (Leon) 
is the source vector regi ster 
Description 
srs1 
5 o 
The rnvsr2gpr instruction moves the contents of the scalar register srsl to the genera l 
purpose register (Leon) grd. 
Example 
mvsr2gpr(2,3); //Move the contents of sreg3 to greg2 
m2sladd 
Instruction Format 
m2sladd format 
0101111 srd srs1 srs2 
21 15 10 5 o 
Appendix A Vector and Scalar ISA 
Syntax 
m2s1add(srd, srsl, srs2) 
where: 
srd 
srsl 
srs2 
Description 
is the destination scalar register 
is the first source scalar register (operand I) 
is the second source scalar register (operand 2) 
228 
The m2s1add instruction perfOlms long addition (32-bit) of source scalar registers srsl 
and srs2 and places the result to the destination scalar register srd. 
Example 
m2s1add(4 . 2.3); // sreg4=sreg2+sreg3 (32-bit) 
m2slsub 
Instruction Format 
m2slsub format 
0110001 srd srs1 srs2 
21 15 10 5 o 
Syntax 
m2s1sub(srd, srsl, srs2 ) 
where: 
srd is the destination scalar register 
srsl is the first source scalar register (operand I) 
srs2 is the second source scalar register (operand 2) 
Description 
The m2 slsub instruction performs long subtraction (32-bit) of source scalar registers 
srsl and srs2 and places the result 10 the destination scalar register srd. 
Example 
m2s1sub(4.2.3); // sreg4=sreg2-sreg3 (32-bit) 
m2sadd 
Instruction Format 
m2sadd format 
I 0110000 srd srs1 srs2 
21 15 10 5 o 
Appelldix A Veclor alld Sctt/ar ISA 
Syntax 
m2sadd(srd, srsl, srs2) 
where: 
srd 
srsl 
srs2 
Description 
is the destination scalar register 
is the fi rst source scalar register (operand I) 
is the second source scalar register (operand 2) 
229 
The m2 sadd instruction perfonns short addition (16-bi t) of source scalar registers srsl 
and srs2 and places the result to the destination scalar register srd. 
Example 
m2sadd(4,2,3); //sreg4=sreg2+sreg3 (1 6-bit) 
m2ssub 
Instruction Format 
m2ssub format 
1000000 s rd srs1 
21 15 10 
Syntax 
m2ssub(srd, srsl, srs2) 
is the destination scalar register 
where: 
srd 
srsl 
srs2 
is the first source scalar register (operand I) 
is the second source scalar register (operand 2) 
Description 
srs2 
5 o 
The m2 ssub instruction perfonns short subtraction (16-bit) of source sca lar registers 
srsl and srs2 and places the result to the destination scalar register srd. 
Example 
m2ssub(4,2,3); //sreg4=sreg2-sreg3 (16-bit) 
m2slmac 
Instruction Format 
m2stmac format 
0111011 srd srs1 srs2 
21 15 10 5 o 
Appendix A Vector Gild Scalar ISA 
Syntax 
m2s1mac(srd, srsl, srs2 ) 
where: 
srd 
srsl 
srs2 
Description 
is the destination sca lar regis ter 
is the first source scalar regis ter (operand I) 
is the second source scalar register (operand 2) 
230 
The m2s1mac instruction performs long mu ltiplication (32-bit) to source scalar registers 
srsl and srs2 and adds the product to the destination scalar register srd. 
Example 
m2s1mac(4 . 2.3); Il sreg4=sreg4+(sreg2*sreg3) 
m2slmsu 
Instruction Format 
m2slmsu format 
01 11 100 srd srs1 
21 15 10 
Syntax 
m2s1msu(srd, srsl, srs2) 
is the destination sca lar register 
where: 
srd 
srsl 
srs2 
is the first source scalar regi ster (operand I) 
is the second source scalar register (operand 2) 
Description 
srs2 
5 o 
The m2s1msu instruction perfomls long multiplication (32-bit) to source sca lar registers 
srsl and srs2 and subtracts the product to the destination scalar register srd. 
Example 
m2s1msu(4.2.3); II sreg4 =sreg4-(sreg2*sreg3) 
m2slmult 
Instruction Format 
m2slmult format 
0111111 srd srs1 srs2 
21 15 10 5 o 
Appelldix A Vector alld Scalar ISA 231 
Syntax 
m2s1mult(srd, srsl, srs2) 
where: 
srd 
srsl 
srs2 
Description 
is the destination scalar register 
is the first source scalar register (operand I) 
is the second source scalar register (operand 2) 
The m2s1mul t instruction perfonns long multiplication (32-bit) to source scalar 
registers srsl and srs2 and places the result to the destination vector register srd. 
Example 
m2s1mult(4,2,3); //sreg4=sreg2*sreg3 (32-bit) 
m2smult 
Instruction Format 
m2smult format 
1000101 srd srs1 
21 15 10 
Syntax 
m2smult(srd, srsl, srs2) 
is the destination sca lar regi ster 
where: 
srd 
srsl 
srs2 
is the first source scalar register (operand I) 
is the second source scalar register (operand 2) 
Description 
5 
srs2 
o 
The m2 smul t instruction perfonns short mUltiplication (16-bit) to source scalar registers 
srsl and srs2 and places the result to the destination vector register srd. 
Example 
m2smult(4,2,3) ; //sreg4=sreg2*sreg3 (16-bit) 
m2smultJ 
Instruction Format 
m2smult r format 
0111110 srd srs1 srs2 
21 15 10 5 o 
Appendix A Vector and Scalar lSA 232 
Syntax 
m2smult_r(srd, srsl, srs2) 
where: 
s rd 
srsl 
srs2 
Description 
is the destination scalar register 
is the first source scalar register (operand I) 
is the second source scalar register (operand 2) 
The m2smu l t _r instruction performs short multiplication (16-bit) with rounding to 
source scalar registers srsl and srs2 and places the result to the destination vector 
register s r d . 
Example 
m2smult_r( 4 , 2,3) ; //sreg4=sreg2*sreg3 (with rounding) 
m2simult 
Instruction Format 
m2simult format 
1001010 srd srs1 
21 15 10 
Syntax 
m2simult( s rd, s rsl, srs2 ) 
is the destination scalar register 
where: 
srd 
srsl 
srs2 
is the first source scalar register (operand I) 
is the second source scalar register (operand 2) 
Description 
srs2 
5 o 
The m2 s i mul t instruction performs short integer multiplication ( I 6-bit) to source scalar 
registers srsl and srs2 and places the result to the destination vector register srd. 
Example 
m2simult(4,2,3); //sreg4=sreg2*sreg3 (integer) 
Appendix A Vector and Scalar ISA 
m2slshl 
Instruction Format 
m 2515hl format 
0110101 srd srs1 
21 15 10 
Syntax 
m2s1shl(srd, srsl, amount) 
is the destination scalar register 
where: 
srd 
srsl 
amount 
is the scalar register (operand I) to be shifted 
is the shift amount (immediate) 
Description 
233 
amount 
5 o 
The m2 slshl instruction performs long shift left (32-bit) to the scalar regi ster srs 1 by 
immediate (amount) and places the result to the destination scalar register srd. 
Example 
rn2slshl(3,l,4); II Long shift left sregl by 4 and put 
result to sreg3 
m2slshr 
Instruction Format 
m2slshr format 
0110111 srd 5(51 
21 15 10 
Syntax 
rn2slshr(srd, srsl, amount) 
is the destination sca lar register 
where: 
srd 
srsl 
amount 
is the scalar register (operand I) to be shifted 
is the shift amount (immediate) 
Description 
amount 
5 o 
The m2 slshr instruction performs long shift right (32-bit) to the scalar register srsl 
by immediate (amount) and places the result to the destination scalar register srd. 
Appelldix A Vector alld Scalar /SA 
Example 
rn2slshr(3,1,4 ); //Long shift right sregl by 4 and put 
result to sreg3 
m2slshlJg 
Instruction Format 
m2slshl r9 format 
I 011 011 0 I srd srs1 
21 15 10 5 
Syntax 
rn2slshl_ rg(srd, srsl, srs2) 
is the destination scalar register 
is the scalar register (operand I) to be shifted 
where: 
srd 
srsl 
srs2 is the scalar register (operand 2) with the shi ft amount 
Description 
srs2 
234 
o 
The rn2slshl_rg instruction perfonns long shift left (32-bit) to the scalar register 
srsl by the amount of the vector register srs2 and places the result to the destination 
sca lar register srd. 
Example 
rn2slshl_rg(3,1,4) ; // Long shift left sregl by amount that 
is in sreg4 and put result to sreg3 
m2slshrJ9 
Instruction Format 
m2slshr r9 format 
I 0111000 I srd srs1 
21 15 10 5 
Syntax 
rn2s1shr_rg(srd, srsl, srs2) 
is the destination scalar register 
is the scalar register (operand I) to be shifted 
where: 
srd 
srsl 
srs2 is the scalar register (operand 2) with the shift amount 
srs2 
o 
Appelldix A Veclor alld Scalar ISA 135 
Description 
The m2slshr_rg instruction perfonns long shift right (32-bit) to the scalar register 
srsl by the amount of the vector register srs2 and places the result to the destination 
scalar register srd. 
Example 
m2slshr_rg(3,1,4); //Long shift right sregl by amount that 
is in sreg4 and put result to sreg3 
m2sshl 
Instruction Format 
m2ssht format 
I 1000010 srd srs1 
21 15 10 
Syntax 
m2sshl(srd, srsl, amount) 
is the destination scalar register 
where: 
srd 
srsl 
amount 
is the scalar register (operand I) to be shifted 
is the shift amount (immediate) 
Description 
amount 
5 o 
The m2sshl instmction perfonns short shift left ( 16-bit) to the sca lar register srsl by 
immediate (amount) and places the result to the destination scalar register srd. 
Example 
m2sshl(3,1,4); //Short shift left sregl by 4 and put 
result to sreg3 
m2sshr 
Instruction Format 
m2sshr format 
1000011 srd srs1 amount 
21 15 10 5 
Syntax 
m2sshr(srd, srsl, amo unt) 
where: 
o 
Appendix A Vector and Scalar ISA 
srd 
srsl 
amount 
Description 
is the destination scalar register 
is the scalar register (operand I) to be shi fted 
is the shift amount (immediate) 
236 
The m2sshr instruction performs short shift right ( 16-bit) to the scalar register srsl by 
immediate (amount) and places the result to the destination scalar register srd. 
Example 
m2sshr(3 , l,4) ; IIShort shift right sregl by 4 and put 
result to sreg3 
m2sshlJ g 
Instruction Format 
m2sshl rg format 
I 01 11 101 srd srs1 
21 15 10 5 
Syntax 
m2sshl_rg(srd, s rsl, srs2) 
is the destination scalar regi ster 
is the scalar register (operand I) to be shifted 
where: 
srd 
srsl 
srs2 is the scalar register (operand 2) with the shift amount 
Description 
srs2 
o 
The m2 sshl_rg instruction performs short shift left ( 16-bit) to the scalarregister srsl 
by the amount of the vector register s r s2 and places the result to the destination sca lar 
register srd. 
Example 
m2sshl_r g(3,l,4) ; IIShort shift left sregl by amount that 
is in sreg4 and put result to sreg3 
m2s shrJg 
Instruction Format 
m2sshr rg format 
I 1000100 I srd I srs1 I srs2 I 
21 15 10 5 0 
Appendix A Vector and Scalar ISA 
Syntax 
m2sshr_ rg(srd, srsl, srs2) 
where: 
srd 
srsl 
srs2 
Description 
is the destination scalar register 
is the scalar register (operand I) to be shifted 
is the scalar regi ster (operand 2) with the shift amount 
237 
The m2 sshr_ rg instruction performs short shift right (16-bit) to the scalar register 
srsl by the amount of the vector register srs2 and places the result to the destination 
scalar register srd. 
Example 
m2s s hr_rg (3 ,1,4); IIShort s h ift righ t sregl by amount that 
is in sreg4 and put result to sreg3 
m2sinegate 
Instruction Format 
m2s lnegate format 
I 0110010 I srd srs1 
21 15 10 
Syntax 
m2slnegate(srd, srsl) 
is the destination scalar regi ster 
where: 
srd 
srsl is the source scalar register (operand I) 
Description 
o 
5 
The m2 sln ega te instruction negates the 32-bit value in scalar register srsl with 
saturation and stores the resu lt to the destination sca lar register srd. 
Example 
m2slnegate(3,1); I INegates (32-bit) value sregl and put 
result t o sreg3 
o 
Appendix A Vector and Scalar ISA 
m2slabs 
Instruction Format 
m2slabs fonnat 
011001 1 
21 15 
Syntax 
m2slabs_s(srd, srsl) 
srd 
10 
is the destination scalar regi ster 
srs1 
where: 
srd 
srsl is the source scalar register (operand I) 
Description 
o 
5 
The m2 slabs instruction produces the absolute value of the 32-bit value in scalar 
register srsl and places the result to the destination scalar register srd. 
Example 
m2slabs(3.1); //Absolute (32-bit) value of sregl and put 
result to sreg3 
Instruction Format 
m2snorm I format 
0110100 srd srs1 
21 15 10 
Syntax 
rn2snorm_l(srd. srsl) 
is the destination scalar regi ster 
where: 
srd 
srsl is the source scalar regi ster (operand I) 
Description 
o 
5 
238 
o 
o 
The rn2snorm_l instruction produces the number of left shifts needed to normalise the 
32-bit va lue in scalar register srsl and places the result to the destination scalar regi ster 
srd. 
Example 
m2snorm_l(3.1); //Normalise value (32-hit)of sregl and put 
Apoendix A Vector alld Scalar ISA 
result to sreg3 
m2s/deposiC/ 
Instruction Format 
m2sldeposit I format 
I 0111001 I srd srs1 
21 15 10 
Syntax 
m2sldeposit_l(srd, srsl) 
is the destination sca lar register 
where: 
srd 
srs1 is the source scalarregister (operand 1) 
Description 
239 
o 
5 o 
The m2sldeposi t _ l instruction deposits the 16 LSB of sca lar register srs1 into the 
LSB 32-bit of destination scalar register srd. The 16 MSB of srd are sign extended. 
Example 
m2sldeposit_l(3,1); //Deposit 16 LSB of sregl into 16 LSB 
of sreg3 
m2s/deposit_h 
Instruction Format 
m2sldeposit h format I 0111010 I srd srs1 
21 15 10 
Syntax 
m2sldeposit_h(srd, srsl) 
is the destination scalar regi ster 
where: 
srd 
srsl is the source scalar register (operand 1) 
Description 
o 
5 o 
The m2sldeposi t_h instruction deposits the 16 LSB of scalar register srsl into the 
MSB 32-bit of destination scalar register srd. The 16 LSB of srd are zero extended. 
Appelldix A Vector alld Scalar/SA 240 
Example 
m2sldeposit_h(3,1); IIDeposit 16 LSB of sreg1 into 16 MSB 
of sreg3 
m2snegate 
Instruction Format 
m2snegate format 
I 1000110 I srd srs1 
21 15 10 
Syntax 
m2snegate(srd, srsl) 
is the destination scalar register 
where: 
srd 
srs1 is the source sca lar register (operand I) 
Description 
o 
5 o 
The m2 snega te instruction negates the 16-bit value in sca lar register srs 1 and places 
the result to the destination scalar register srd. 
Example 
m2snegate(4,2); II Negate value (16-bit)of sreg2 and put 
result to sreg4 
Instruction Format 
m2sabs s format 
1000001 srd srs1 
21 15 10 
Syntax 
m2sabs_s (sr d, srsl) 
is the destination sca lar register 
where: 
srd 
srs1 is the source scalar register (operand I) 
o 
5 o 
Appendix A Vector alld Scalar ISA 
Description 
Thcm2sabs_s instruction produces the absolute value orthe 16-bit value in scalar 
regi ster s rs1 and places the result to the destination scalar register srd . 
Example 
m2sabs_s (4,2 }; //Abs olu te value (1 6-b i t} o f sreg 2 and pu t 
r e sult t o sreg 4 
m2sextracCh 
Instruction Format 
m2sextract h format 
10001 11 srd srs1 o 
21 15 10 5 
Syntax 
m2 sex trac t_h( srd , srsl ) 
where: 
srd is the destination scalar register 
s r s l is the source scalar regi ster (operand I) 
Description 
241 
o 
The m2 sextract_h instruction extracts the 16 MSB orlhe 32-bit value of scalar 
regi ster srs 1 and places them into the 16 LSB orthe destination scalar register s rd. 
The 16 MSB of srd are zero extended. 
Example 
m2sextract_h(4 ,2 }; //Extract 16 MSB of sreg2 and put 
them to sreg4 
m2sextract_' 
Instruction Format 
m2sextra ct I fo rmat 
1001000 srd srs1 o 
21 15 10 5 
Syntax 
m2sextract_l (srd, srsl } 
o 
Anpendix A Vector and Scalar ISA 
where: 
srd 
srsl 
Description 
is the destination scalar register 
is the source sca lar register (operand \) 
242 
The m2 sex tract_l Instruction extracts the 16 MSB of the 32-bit value of sca lar 
register srsl and places them into the 16 LSE of the destination scalar register srd. 
The 16 MSB of srd are zero extended. 
Example 
m2sextract_l(4 , 2 ); //Extract 16 LSB of sreg2 and put 
them to sreg4 
m2sround 
Instruction Format 
m2sround format 
1001001 srd srs1 
21 15 10 
Syntax 
m2sround(srd, srsl) 
is the destination scalar register 
where: 
srd 
srs1 is the source scalar register (operand I) 
Description 
o 
5 
The m2 sround instruction rounds the 16 LSB of the 32-bit value of scalar register 
o 
srs1 into its most significant 16-bits with saturation. The result is shifted right by 16 and 
placed in the destination scalar register srd. 
Example 
m2sround(4,2) ; //Round 32-bi t value of sreg2 and put 
the result to sreg4 
APPENDIX B SIGNAL DESCRIPTION 
Signals for Vector Datapath 
Signal Type Width Brief Description 
SfMDJ Control I bit Selects two 16-bit (when - '0') operations 
or one 32-bit (when = ' I ') 
sel_subJ Control I bit Selects addition (when='O') or subtraction (when=' I ') 
sel_sfctnJ Control 2 bits Selects function fo r vadd unit 
sel_ round_ r Control I bit Selects round operation (when= ' I ') 
sel_cmpJ Control I bit Selects compare operation (when=' I ') 
sel_ muHJ Control 2 bits Selects multiplication type for vmult unit 
scl_muH_r_r Control I bit Selects mult (wben='O') or mult_r (' I ' ) 
cmd_shiftJ Control cmd_shift_type Selects shi ft operation 
scl_misc_r Control 4 bits Selects rrllscellaneous operation 
sel vu r Control sel_vu_ type Selects vector unit for operation 
vrs l _rdaddr_r Control Log2(VREGS) Source I vector register address 
vrs l 
-
rden r Control VLMAX*2 Source I vector register read-enable 
vrs2Jd.dd rJ Control Log2(VREGS) Source 2 vector register address 
vrs2_rdco_ f Control VLMAX*2 Source 2 vector register read-enable 
vrd_addr Control Log2(VREGS) Destination vector register address 
vrd_wen Control VLMAX*2 Destination vector register write-enable 
srs1 
-
rdaddrJ Control Log2(SREGS) Source I scalar register address 
srs1 
-
rden r Control 4 bits Source I scalar register read-enable 
srs2JdaddrJ Control Log2(SREGS) Source 2 scalar register address 
srs2 _rdcn_ r Control 4 bits Source 2 scalar register read-enable 
srs3Jdadd rJ Control Log2(SREGS) Source 3 scalar register address 
srs3 _rden _ r Control 4 bits Source 3 scalar register read-enable 
srd waddr r Control Log2(SREGS) Destination scalar register address 
- -
srd_wen_f Control 4 bits Destination scalar register write-enable 
vacel rd addrJ Control Log2 Source I vector accumulator address 
- (ACC_NUMBER) 
vace l 
-
rtlCIl_ f Control ACC_ WIDTHJ32 Source I vector accumulator read-enable 
vacc2 rdaddr r Control Log2 Source 2 vector accumulator address 
- - (ACC_NUMBER) 
vacc2_rd cn_r Control ACC WIDTHJ32 Source 2 vector accumulator read-enable 
vacc waddr r Control Log2 Destination vector accumulator address 
- - (ACC_ NUMBER) 
Control ACC_ WIDTHJ32 Destination vector accumulator write-vacc_wcn_r 
enable 
vlen wcn r Control I bit Write enable for the vlen register 
ovf_wen Control I bit Write enable for lhe overnow register 
prcd_wen Control I bit Write enable ror the predicate register 
vi en nvalue Data Log2(VLMAX) ew value for the vlen register 
Ist_neg Control I bit Selects load/store negative stride (when=' I ' ) 
opc_valid Control I bit Valid signal for ule register output 
"ddr_va lid Control I bit Signal to indicate the address is valid 
read Control I bit Selects load (' I ') or store ('0 ') instruction 
sel vs Control I bit Selects vector (' I') or scalar ('0') ruction 
243 
Appendix B Signal Descrintioll 
Sienal Type Widlh 
scl_width Control I bit 
sel_evod Control even_add_type 
sel_opr) Control opr_type 
sel_opr2 Control opr_type 
slg2_ vadd Control I bil 
sel vaccred Control I bil 
gpdala Data 32 bi ls 
sel_sl Control I bit 
sel mask Control I bit 
Control signal for Vadd Unit 
sel sfcln Instruction 
00 
01 
10 
11 
Control signal for Ymult Unit 
scl mull 
00 
01 
10 
II 
sel mull r 
o 
Contro l si 7lal for Vrnisc Unit 
sel misc 
0000 
000 1 
00 10 
00 11 
0100 
0101 
011 0 
0111 
1000 
1001 
1010 
addlsub/vrcmp 
vcmp_h_ge 
vcmp 
round 
lnstruction 
L mult 
mull 
mull r 
Instruction 
mult 
mult r 
Instruction 
L_negate 
negate 
nonnJ 
L_abs 
abs_s 
eXlr3ct_1 
cxtr3ct_h 
l_deposil_1 
1_ deposi t_ b 
merge 
merge t h 
244 
Brief Description 
Selects 16 (' I ' ) or 32 ('0') bits data width 
Selects even or odd or nomlal operation 
Selects the Iype of Ihe first operand 
Selects the type of tbe second operand 
Stage 2 vadd tmit enable 
Vaccreduce unit enable 
Data from Leon register fil e 
Selects store instruction (whcn=' I') 
Selects to mask Ille result (when=' I ') 
APPENDIX C G.729A AND G.723.1 FUNCTION 
RESULTS 
This section presents the resulis fro 111 the G.729A speech codec showing the improvemelll 
made from a functi on perspecti ve. 
Acetp_Code_A (Full Optimization) 
-Alg!hm - Fixed 
'" 
- Pitch 
--
-T~ ~ 0 
c - Tesl , 
0 0 
<.> 
c 
.g 
0 u 
Iv--, ~--v ~ .AV" V" ~ r ' ,..."...." ..... V .A ~ -----, le 
~ ~ IF"" v- IP" .... .... 
-"v .A ~ \1"-- - A ...,... . y y 
5 
• E 0 u 
'E 
• c 
~ o. c 
• > 
';; o. 
.. 
a: 
0 
0 
" 
32 . , .. 80 .. 112 12' 
Vector l ength (VLMAX) 
Copy (Full Optimization) 
- Algd'IJn Fbled 
Lsp Puch 
- Speed'! Tame 
T.~ 
0.3 
l 
E 0,:> , 
0 
<.> 
c 0.2 0 
.~ 
I 2 02 ;; 
E 
.~ 0,1 
': • c 
~ 
C 0.1 
• ,< 
V-.. 
OO! .. er: V 
0, 
0 16 32 . , .. 80 .. 11 2 128 
Vector Length (VLMAX) 
245 
ApIJendix C G.729A and G. 723. 1 Functio1l Re.wlts 246 
Corr_xy2 (Full Optimization) 
o. 
• 
-"",lhm • F.", 
l 0.4 
"" 
PItch 
-- Speech - - Tame 
E Tesl , O • • 0 
<.> I g 0,3 
'" 
I 0 
2 o. ;; 
E • ,~ 0.2 \ E • e O. \ ~ 0 
• • > 0. ' 
• i l \ .. • o. • a: ......•.• 
• •• ••• •• •••••••••• • ••••••••••••••••••••••••• 
0.0 
0 •• 32 .. .. 60 .. ." '" Vector Length (VLMAX) 
G_pitch (Full Optimization) 
0.' 
--Alglhm • Fixed 
~ O. u, PI tch 
--
- - Tame 
C 
- Tesl , 0.' 0 I <.> 
e O. 0 g 
.;; O.l! r-
E 
0· 1-+-0 
'e 
• §. 02 ~ 
0 
• O. > 
.~ 
£ 0.1 r--
0.1 V 
0 
" 
32 .. .. 60 .. 
"' '" Vector Length (VLM AX) 
Gain_predict (Full Optimization) 
.. 
Alglhm -. 
""," 
l , ~ 
up Pitch 
- --Speech _ Tome 
C 
- Teo< , 
0 
0 I . 
e 
\ 0 n 2 
;; 
E \~ 0 o. E 
• e 
~ 
'. 
0 0 . 
• 
• .. ..... ... . ... ;; 
...... e.. ~ .. _ . " __ . ::.. • r . .... --..... ____ . ---. _ . ......... .. .... ~ ;; 0 
a: ........... ~ ....... -.... --. ,,~ ................ -............. 
o. 
0 16 32 ., 
" 
eo .. 
"' '" Vector Length (VLMAX) 
AopelldLr C C.729A alld C.723. 1 Fllllet ioll Results 247 
GeC wegt (Full Optimization) 
o. 
--
• Filt8d 
~ 0.7 c., """ ....... - Speech r .... 
E o. r .. , 0 
0 I ~ 0.6 
.£ O. 
2 j a.S! 
.~ 
E 0 ~ 
• e 
1; 0 .• 
• i > o. :; ;; C< 0.3 
O. 
0 ,. 32 .. 54 60 go 112 <2' 
Vector Length (VLMAX) 
InCqlpc (Fu ll Optimization) 
0 
-- AI9,hm . Filled 
~ 0 ,4 Cs, ~ PITch 
E - S""",,, - Tama 
0 Tes! 0 O. 
" g 
t; 0.3 
S 
• 0 ---
.E 
.~ ~ 5 0.2 
e ,.. j " • 0 > ~ • ~ 01 
_ .•.. _ ..................... -..................... _ ............. 
O. 
0 ,. 32 48 64 '0 .. 112 <2. 
Vector Length (VlMAX) 
Lag_window (Full Optimization) 
<2 
-_m . F • ..., 
i! 1. 1 - -- .... PiTch - -8""",,, - _ Tame 
§ 10 I- --Tes! 
0 
~ 09 
0 'r-
~ g 0.8 
~ E 0.7 
u 
. ~ 0." r-
e 0 0.5 
• ~ 0." t \..· ....... • ......... -... --. ................................. ;; 
a: 0_3 . l.F-'- . 
• 
0.2 
0 ,. 32 •• 54 .0 96 1<2 ,28 
Vector Length (VLMAX) 
Aopelldix CC. 72911 " lid C. 723. 1 Fllllctioll ReslI!ts 248 
lSP-get_quant (Full Optimization) 
0 
-Alglhm . Fixed 
l Up PilCh 
E o. I 
-" 
-
T . ... 
0 Too' 0 
U 
c o. 0 
'U 
2 ;; o. ,. 
•  
..... ~ .•..•........•...........••.....•..•.......•........... 
E 
• ~ o. 
0 
• 
• ;; 0 
.. ... .. -_ .. ..-...... . . . . ..-.. .. _._ .... ,.,- _ ... ....-.. ..-. ..... - .. .. ...-...... ~--
a: 
o. 
0 
" 
32 4. 64 80 96 
'" 
12. 
Veelor Length (VLMAX) 
lsp_geU dist (Full Optimization) 
os Afglhm . Fixed 
o. up . Pitch l ~s_ --Tame 
E 0.4 T,,, 
0 
0 
U O. 
• ~ 
5 
0.3 
• o. 
" u 
'E 02 
• c 
·t ~ 0, 0 • ~ 0' 
• .. . .....................•.............•..•.................. 
a: 0. ' 
0, 
0 
" " 
.. .. .. .. 
"' '"~ VKtot Lenglh IVLM,u) 
lsp_prev_compose (Full Optimization) 
o. 
--A1gthm • FiKed 
! o Up Pitch 
E -- Speech 
-
T,me 
0 
0 T .. , 
U O. 
c 
0 
n 
2 o. ;; 
" u 0 t 
'E 
• c 
~ 
0 0, 
• .~ ....................................................... . ~ • .. 0 a: 
0 
0 ,. 32 4. .. so 98 
'" 
12. 
Vector Length (VLMAX) 
Appendix C G. 729A and G. 723.1 Function Results 249 
lsp_prev_extract (Full Optimization) 
\ -- Algthm • Fixed 0 
~ .... Pitch . s_ 
· 
Tom. 
E o. T", , 
0 
U 
c 0 
.2 
U 
E o. , 
• £ 
u 
'e 0 C-
o 
c 
~ 
C o. ., 
• Z ;; 
0; o. \ 0: • •.........................•..•••.•.....•..•.•.•••.•..•••.. 
0 
0 ,. 32 .. .. 80 96 112 128 
Vector length (VLMAX) 
l sp_select_, (Full Optim ization) 
o. 
1 
-Algthm 
· 
Fixed 
l O.5 .... 
Pilch 
E -. Speech ---- Tame , 
0 Tesl U 0 -- ~ 
~ 
u 
2 o. ;; 
£ 
u o. 
'e 
• c 
~ 
C 03 
• Z ;; .-......................... _ ...... _ ........ _ ...... .......... 
.. o. t-o: 
0' 
0 ,. 32 .. 54 80 96 ", 128 
Vector Length (VLMAX) 
lsp_select_2 (Full Optimizati on) 
o. 
Algthm • Fixed 
C O,S .... 
Pitch 
E 
-
-
- Tomo 
, 
t 0 . T", U 0 c 
.2 
U 
E o. 
• £ 
." o. E 
• c l-~ C 03 
• Z ;; •• a -II- ••••• 11- .... a.a-_ ....... a -II- •• 11- ............. _. __ •••• _ ••••• 11-.11-
.. 0 
0: 
0' 
0 16 32 .. 54 80 96 ", 128 
Vector l ength (VLMAX) 
Appelld;x C C.729A alld C.72J.1 Fllllct;OIl Reslllts 250 
Pitch _lr3_last (Full Optimization) 
o. 
--Algtnm • fiJted 
ii' Up ~)'" 
::- . 
-
-
- - Tame 
0 i , T", 0 <.> 0 o. ~ 
2 
• E O. 
"" E 
• 
;' 0 f 
0 
.~ 
j o. .. \-
• I a: ~..I 
-
... __ ... _ ... . .......... - -_ .... _ ..- .. -. 
0 ,s 32 .. .. eo .. 112 12. 
Vector Length (VlMAX) 
Relspwed (Full Optimization) 
0." 
1 
- ..... A1glhm - 0 FI)(ed 
.... , PilCh 
l - ....... Speech - - Tame 
E - Test 
---~ 
" < o. 
,g j 
• E 
~ 
ti 0.7 -
• 
.  
~ .... --' .. _ ...................... -............................. ~ 
o. 
0 
'" 
32 ... .. 80 96 112 '2. 
Vector Length (YlMAX) 
Set_zero (Full Optimization) 
Alglhm • Fixed 
l 0 .... , Pitch 
C - -Speech - Tame , 
- Test • <.> • 
< 
• ~
u ~ 2 o. 
• E 
.2 
E 
• o. < ~ 
"" 
0 
1  • ~ .. ~ > j 0.1 ~. 
• a: 
• 
" 
32 48 64 80 .. 112 12. 
Vector Length (VLMAX) 
AppeJ/dix C G.729A lInd C. 723. / Fllllcrioll Re.m/IS 251 
Decod_ACELP (Full Optimization) 
0.' 
Alglhm • Fixed 
O. 
'" 
Pitch 
t ..... Speech ~·~ Tame 
] 05 I j 
i 0 
• • 1 O~ T 
j ~ 
Q.3! 
0 
0 
" 
32 •• .. 60 .. 112 ". 
Vector Length (VLMAX) 
Post_Filter (Full Optimization) 
--A/g1hm ~_ Flxed 
lo.sl 
"" 
Pitch 
<0.'" - 0 Speech - - Tame ---~ g 0.." 1 ~ 
'ii 
5""" 
• c 
~ 0..31 -0--
E 
• ;. 0.26 ~-
0 
~ 0.21 l-" 
~ 
~ 0.16 ~ 
O.lt 
0 16 32 .. .. 60 .. 112 12. 
Vector Length (VLMAX) 
Appelldix CC. 729A "lid C.723.1 FUIIClioll Resull.' 252 
Th is section presents the results from the G.723. 1 speech codec showing the 
improvement made from a functi on perspecti ve. 
Acelp_Lbc (Full Optimization) 
o. 
- Mue Code Aale to.7 
--Code Rate 53 
E o. . , 
0 
0 0.6 
< 0 o. -:g 
2 ;; 0.5 
~ E u 0 'e ~ 0.4 ~ C O • 
• > 
:; 0.3 
;; 
'" 
O. 
0.2 
0 ; 
" 
32 
" " 
80 
" 
1t2 128 
Vector length (VLMAX) 
Calc_Exc_Rand (Full Optimization) 
, 
- MixB(! Rate 
lo ... - 53Rale -&3Ae!. 
< ---63bRate 
.3 0·9< 1 --,,- 63e Rate 
< 
.g 0.9< 
~ . 
t~\ 
~ o ] 
o. 
0 16 32 
" " 
80 96 1t2 128 
Vector Length (VLMAX) 
Appelldix C G.729A (l lId G.723. J FlIlICl ill ll ReslIll.\' 253 
Coder (Full Optimization) 
o 1\ 
-MtlledAate 
t -~3R8te 
C O. I - 63 Ratfl , 
0 
U 
c o. 0 I ~ ~ o . 
.E 
0 
'E o. 
• c ,. 
C o. ---
• > ~ 
.. o. .... .,,-a: 
ol 16 32 .. .. eo 96 '12 '2' 
Vector Length (VLMAX) 
COd_cng (Full Optimization) 
'l~ --Cooe Mix t • 53 Aale C 63Al!lle , 
0 [L u g ~ o. 
;; 
'\, .E •  E o. 
• •• c •••••••••••...•..•...•................ .................••• ,. 
C 
• .~ 0 
• 
.. 
a: 
o. 
0 16 32 •• .. eo 96 112 '2' 
Vector Length (VlMAX) 
Comp_Vad (Full Optimization) 
0., 
_Mlqd~ 
~O.6 -53R11te 
C -63R11te , 
0 o. I u 
c 
.2 o.~ U g 
• o . 
.E 
.~ ~ . 0." c ,. 0 ) ~ o. 
.~ , ~ 0.3 
a: 
o )! 
0, 16 32 .. 64 .0 96 '12 128 
Vector Length (VLM AX) 
Appelldix C.729A 1I 11d C.72J. / FUllctioll Resu/t., 254 
Cor_h_x (Full Optimization) 
o. 
-Mjnd Coc» 
~ 0 -SS R.t. 
C , 
0 
" c 
0.3 
.g 
u 
2 o. ;; 
E 
u 02 E 
• c 
+-~ c 0 
• > 
~ .. 
• a: 
0." ! O2 ,. 32 '8 .. 80 96 112 '28 
Vector length (VLMAX) 
Find_' (Full Optimization) 
0.' 
- MIlled Rate ~53A8te 
l O. ~63R8t8 63b Rate --63& Rale 
C , 0.3 0 
" c 
'e 
O. -
2 0' ;; 
jl 
E 
. ~ o . -
E 
• ;. 0.1 
C 
.~ 0 -
;; 
.. a: 0,0 
0 ,. 32 .. .. 80 .. 112 '28 
Vector Length (VlMAX) 
FilC Pw (Full Optimization) 
0.' I -MiltedRate 
l - 53 Rale 0 
- 63 Aale 
C 
0.3< , 0 
" c 03 
0 
'" 0.2 u ,
" 0.2 • E 
.~ o. 
E 
~ 0.1 
~ 
~ 01 i t 
~ o.~ 
• a: 0." I --
oj ,. 32 '8 .. 80 9 • 11 2 '28 
Vector Length (VlMAX) 
Apvelldix CC. 729A (/lid C. 723. 1 FlIll ctioll ReslIlts 255 
G_Code (Full Optimization) 
0.' 
--MIl(edRate 
l 0 .• - 53 Aale 
C 
~ 0.55 
" c os 0 
'D 
U E 0.45 
• E 
u 0.' 
'f 
• ~ 0.35 
0 
• 0.' ,~ 
• £ 025 
--o~ ! 
" 
32 •• 2 
.. OD os 112 128 
Vector length (VLMAX) 
tniC Cod_Cnd (Full Optimization) 
0'1\ 
-MI~edRote 
l - 53 Ral& 
C -63 Rale , O. I 
0 
" .~ O. ;; 
2 ;; 0 
E 
.~ 0 
• c 
~ 
0 0 
• 
•  <; 
.. o. 
..........., ... a: , .1.,-
O. , 0 
" 
32 '8 54 BD 96 112 128 
Vector length (VLMAX) 
tnit_Coder (Full Optimization) 
-Mlx&dRale 
l 0 -53 Rat& 
C - 63 Rale , 
0 
" 
o. 
c 
0 
Z 
~ o. 
E 
u 
'f 
• 0 c 
~ 
0 
• 
•  0 <; 
.. 
a: 
o. 
0 
" 
32 •• .. OD 96 112 128 
Vector Length (VLMAX) 
Appelldix C G. 729A alld G.723. / FlIllcrioll Re.I"II/1.f 256 
IniC Dec_Cng (Full Optimization) 
o. 
-Mixed Rale 
.!! -$l ABIa 
~ O. - &lAala 
, 
0 
U 
g o 
.~ 
u 
E 
~ o . 
. ~ 
• ~ O. 
0 
• > i o. 
• a: . "'" .--.. _-_ ....... , ..... --
0 0 ,. 3' 48 .. eo 96 
"' '" Vector Length (VlMAX) 
IniC Oecod (Full Optimization) 
o. 
- Milled Aate 
~ o.7 - 53Ra!e 
" 
o. 
- 63 Rate 
, 
.3 0." 
c 
.g o. i u , 
~ 0.5 
E 
u o. 
·e 
~ 0 4 
~ 
0 o. 
• > 
'-; 0.3 
.. I. 11" 11, •• " " '. a: o. 
0.' 0 
" 
32 . , .. eo .. 
'" 
'28 
Vector l ength (VlMAX) 
Init _Vad (Full Optimization) 
o. ---
- M1U1d Rale 
l O.7 I-Sl Aale 
-&JRa!e 
" 
o. i 0:', f ~ oJ 
E 
u o. 
·e 
t :! 0,4 ~ 0 o. ~ 
~ 0.3 
.. , , 
a: o. j 0.' I 0 
" 
3' .. 
" 
80 96 
"' '" Vector l ength (Vl MAX) 
Appelldix C G.729A alld C.723. 1 FUllctioll Resltlts 257 
Lsp_lnt (Full Optimization) 
-MIx.dRat. 
__ 0.7 
- S3Aa1. 10 
C - S3RBle 
~ 0.6 
" c 
.g 
ti 0.5 
2 ;; 
c 
- 0 4 ' ~ 
• .,. c 
O O.~ 
• ~ 
-i 0.2 .. .. -. ... 51' , --" •• I r . I •• .- •••••• -. - , ,- •• , -. . ' -~'FI-' " --'- , .-': 
a: 
01 
0 
" 
32 4' .. 80 go 
'" '" Vector l ength (VLMAX) 
LPCDiff (Full Optimization) 
0.6 j ~MI .. d Cod, ' 
" 0 .• 
- 53 Rate 
!".. O. - S3Rale 
C 8 0.5-
g o .
." 
U 2 OA 
;; 
" 
0.' 
u 
] 0.31 -...-
c 
II o.~ I~\-.! o. • 
" er 0.2 1 '
o. 
oj 
" 
32 ., .. 80 96 
'" '" Vector Length (VLMAX) 
Lsp_int (Full Optimization) 
o. 
--+-Mixed Rata ~53 Rale 
l __ 63 Rata - 63bRala 
C O. - 63eRate 
, 
0 
" a o. 
z 
~ 
.E O. T 
u 
E 
• +--~ ~ o. 0 
• > 1+- 1,-'" • 0 
" a: 
o. 
0 
" 
32 48 64 so 96 112 12' 
Vector Length (VLMAX) 
Apoendix C C. 729A and C.723. ! Function Resu!ts 258 
Mem_Shifl (Full Optimization) 
:~ -Mixed Rate _ 0 - 5JAB1S C. o. - 63Rale ~ 0~ 1 ~ OA 
~ 0J 
'; o. 
E 0 u 
.~ 0.2 
, \ c ~ ~ 0 ~---.~ 0.1 .. O ... 0: 00 , , •• I • I I 1I "'--... 
0 16 32 .. .. 80 g. 
"' 
, .. 
Vector Length (VlMAX) 
Wght_Lpc (Full Optimizat ion) 
o. 
-Mbo:.ed Rale 
_ o.7 
- S3Aa!. C O. 
- B3Aets C 
5 0.6 
<.> 8 o. 
·il 0.0 
2 in o. 
c 
~ 0.4 
·e o . 
• c 
~ O.3 
t • o . . < ~ 0.2 0: O. .. ..... • •• 1 ••••• "' 
0.' 0 ,. 32 .. .. 80 96 11 • 12. 
Vector Length (VLMAX) 
AUTHOR'S PUBLICATIONS 
The following are the publications that have resulted from the work in this thesis. 
[I] V. A. Chouliaras, J. L. Nunez, S. R. Parr, K. Koutsomyti, D. J. Mulvaney and S. 
Datta, "Development of custom vector accelerator for high-performance speech 
coding", lEE Electrollic Letters, vol. 40, November 2004, pp.1559-156 1. 
[2] K. Koutsomyti, S. R. Parr, V. A. Chouliaras, J. L. Nunez, D. J. Mulvaney and S. 
Dalta, "Scalar and Parametric Vector Accelerators for the 0.729A Speech Coding 
Standard", in Proceedillgs of IEE/ACM Soc Desigll. Test alld Techllology 
Postgradllate Semillar, Loughborough University, September 2004 , pp. 53-57. 
[3) K. Koutsomyti , S. R. Parr, V. A. Chouliaras, J . L. Nunez, D. J. Mulvaney and S. 
Datta, "Configurable Scalar and Vector Accelerators for the G.729A and 0 .723.1 
Speech Coding Standards", in Proceedillgs of Postgraduate Research COllf erellce 
ill Electrollics. Photollics. Commllllicatiolls alld NellVorks. alld Comp"lillg Sciellce 
(PREP2005), Lancaster University, March 2005, pp. 62-63. 
[4) S. R. Parr, K. Koutsomyti , V. A. Chouliaras, J. L. Nunez and D. J . Mulvaney, 
" onfigurable scalar and Vector Coprocessors for accelerating the 0.723.1 and 
0 .729.A speech coders", in Proceedillgs of the IlItematiollal COllferellce 011 
Sigllal alld Image Processillg, Novosibirsk, Russia, June 2005, pp.340-344. 
[5) K. Koutsomyti, S. R. Parr, V. A. Chouliaras and J. L. Nunez, "Applying Data-
Parallel and Scalar Optimizations for the efficient implementation of the 0.729A 
and 0.723.1 Speech Coding Standards", in Proceedillgs of the rh lASTED 
llllematiollal Conferellce 0 11 Sigl/al alld Image Processillg (SIP 2005), Honolulu, 
USA, August 2005, pp. 40-45. 
[6) V. A. Chouliaras, K. Koutsomyti, T. R. Jacobs and S. R. Parr, D. J. Mulvaney and 
R. Thomson, "SystemC-defined SI:MD instructions for high performance SoC 
architectures", in 13th IEEE IlIIerl/atiollal COllferellce 011 Electrollics. Circuits alld 
Systems , Nice, France, December 2006, pp. 822-825. 
[7) V. A. Chouliaras, K. Koutsomyti, T. R. Jacobs, S. R. Parr, D. J. Mulvaney and R. 
Thomson, "SystemC-defined Sf MD instructions for a MPISMT ASIC platfom,", 
in Proceedillgs of the 24th IEEE Norchip cOllferellce ill ASIC desigll , Linkoping, 
Sweden, November 2006, pp. 285-288. 
[8) K. Koutsomyti, V. A. Chouliaras, S. R. Parr, J. L. Nunez and S. Dalta, 
"Accelerating speech coding standards through System synthesized SlMD and 
Sca lar accelerators", in Proceedillgs of the IEEE IlItematiollal COllferellce 011 
COllsumer Electrollics (lCCE06), Las Vegas, USA, pp. 279-280. 
[9) S. R. Parr, K. Koutsomyti , V. A. Chou \iaras, "A High Bandwidth 
Con figurab \e Load/Store Unit for an Embedded Vector Processor", in 
Postgraduate Workshop 011 Microelectrollics alld Embedded Systems, 
Bim,ingham, UK, October 2006. 


