Application-specific instruction set processor for speech recognition. by Cheung, Man Ting. & Chinese University of Hong Kong Graduate School. Division of Electronic Engineering.
Application-specific Instruction Set Processor 
for Speech Recognition 
CHEUNG Man Ting 
A Thesis Submitted in Partial Fulfillment 
of the Requirements for the Degree of 
Master of Philosophy 
in 
Electronic Engineering 
© T h e Chinese University of Hong Kong 
September 2005 
The Chinese University of Hong Kong holds the copyright of this thesis. Any 
person(s) intending to use a part or whole of the materials in the thesis in a 
proposed publication must seek copyright release from the Dean of the Graduate 
School. 
[Uf统系馆書因y\ 
i Q T ^ i 
UNIVERSITY 
\ "fj-K LIBRARY SYSTEMy>^  
Acknowledgements 
I would like to thank my supervisor, Professor Choy Chiu-Sing. He gives 
me the vision, guidance and support that contribute to the formation of this 
thesis. He also provides me with excellent research and study environment in 
a well organized and established VLSI team. His intelligence, preciseness and 
patience make his students work conscientiously and always try to strive for 
excellence in the research. I would like to thank Professor Chan Cheong-Fat 
and Professor Pun Kong-Pang for their constructive comments on the work. 
Thanks also to Mr Yeung Wing-Yee, who maintains our laboratory equipment 
and design tools with great effort. Special thanks are given to Qin Chao for his 
continuous support and discussion on the speech algorithms and training toolkit. 
I am greatly grateful to the graduate students in VLSI laboratory, especially 
Yu Chun-Pong, Xin Ling, Xu Ke, Ha,n Wei, So Pui-Tak, Chan Chi-Hong, Leung 
Pak-Keung and Andy Kwok for their kind assistance throughout my resarch 
work. Lastly, I highly appreciate my parents for encouraging and supporting 
me during two-year MPhil study. 
ii 
Abstract of thesis entitled: 
Application-specific Instruction Set Processor for Speech 
Recognition 
Submitted by CHEUNG M A N TING 
for the degree of Master of Philosophy 
in Electronic Engineering 
at The Chinese University of Hong Kong in 
September 2005. 
This research presents an investigation on the implementation of automatic 
speech recognition (ASR) using Application Specific Instruction Set Processor 
(ASIP) methodology. The ASIP approach bridges the gap between traditional 
purely software and purely hardware designs. Combining the optimized hard-
ware datapaths with some application specific instructions, it can speed up the 
operation with significant improvement. 
Our research implements the double-mixture HMM-based isolated word 
speech recognizer. Specialized instructions have been developed for the com-
putationally intensive calculation of the output probability in the process of 
Viterbi search algorithm. 
The ASIP is fabricated with a 0.35 n CMOS technology. The time required 
to complete one recognition is about 1 second at the working frequency of 5 
MHz. This is at least one magnitude faster than a comparable design. Further 
enhancement can reduce this speed by a half, doubling the performance. The 
recognition time is fast enough in real-time speech applications. On the other 
hand, the proposed speech recognition running on ASIP platform attains the 
accuracy of 93.2 %, which is approximately the same recognition accuracy as 
the software recognition. It is obvious that our design can meet the stringent 








這次硏究是實現雙混合 (doub le -mixture )的隱馬爾可夫模型 
(Hidden Markov Model, HMM)的單字語音識別器。我們開發 
了 專 用 指 令 ， 用 來 執 行 運 算 量 巨 大 的 維 特 比 搜 尋 演 算 法 
(Viterbi search algorithm),以計算各輸出槪率 ° 
專 用 指 令 處 理 器 使 用 0 . 3 5 微 米 互 補 性 氧 化 金 屬 半 導 體 
(Complementary Metal-Oxide Semiconductor, CMOS )技術 










1 Introduction 1 
1.1 The Emergence of ASIP 1 
1.1.1 Related Work 3 
1.2 Motivation 6 
1.3 ASIP Design Methodologies 7 
1.4 Fundamentals of Speech Recognition 8 
1.5 Thesis outline 10 
2 Automatic Speech Recognition 11 
2.1 Overview of ASR system 11 
2.2 Theory of Front-end Feature Extraction 12 
2.3 Theory of HMM-based Speech Recognition 14 
2.3.1 Hidden Markov Model (HMM) 14 
2.3.2 The Typical Structure of the HMM 14 
2.3.3 Discrete HMMs and Continuous HMMs 15 
2.3.4 The Three Basic Problems for HMMs 17 
2.3.5 Probability Evaluation 18 
2.4 The Viterbi Search Engine 19 
2.5 Isolated Word Recognition (IWR) 22 
3 Design of ASIP Platform 24 
3.1 Instruction Fetch 25 
3.2 Instruction Decode 26 
3.3 Datapath 29 
V 
3.4 Register File Systems 30 
3.4.1 Memory Hierarchy 30 
3.4.2 Register File Organization 31 
3.4.3 Special Registers 34 
3.4.4 Address Generation 34 
3.4.5 Load and Store 36 
4 Implementation of Speech Recognition on ASIP 37 
4.1 Hardware Architecture Exploration 37 
4.1.1 Floating Point and Fixed Point 37 
4.1.2 Multiplication and Accumulation 38 
4.1.3 Pipelining 41 
4.1.4 Memory Architecture 43 
4.1.5 Saturation Logic 44 
4.1.6 Specialized Addressing Modes 44 
4.1.7 Repetitive Operation 47 
4.2 Software Algorithm Implementation 49 
4.2.1 Implementation Using Base Instruction Set 49 
4.2.2 Implementation Using Refined Instruction Set 54 
5 Simulation Results 56 
6 Conclusions and Future Work 60 
Appendices 62 
A Base Instruction Set 62 
B Special Registers 65 
C Chip Microphotograph of ASIP 67 
D The Testing Board of ASIP 68 
Bibliography 69 
vi 
List of Tables 
1.1 Trade-off Comparison among GPP, ASIP and ASIC 2 
3.1 The Architectural Parameters of the ASIP 24 
3.2 The Processor Usage of Different Functional Units 28 
3.3 Exploiting Data Parallelism by Observing Data Access Pattern . 33 
4.1 The Booth Encoding Table 40 
4.2 The Pointer Update Algorithm of Circular Addressing Mode . . 46 
4.3 The Speech Parameters for Recognition 53 
5.1 The Specification of Fabricated Chip 56 
5.2 The Simulation Results of Recognition Accuracy 58 
A.l The Data Processing Instructions 62 
A.2 The Bit Manipulation Instructions 63 
A.3 The Flow Control Instructions 63 
A.4 The Boolean Operation Instructions 63 
A.5 The Configuration Instructions 64 
A.6 The Memory Manipulation Instructions 64 
B.l The Organization of Special Purpose Registers 65 
vii 
List of Figures 
1.1 ASIP Fill the Gap Between Performance and Flexibility [1] . . . 2 
1.2 Overview of ASIP Design Flow 7 
1.3 The Short-time Stationary Characteristic of Speech Signal . . . 10 
2.1 Automatic Speech Recognition System 11 
2.2 The Algorithm for Front-end MFCC Computation 12 
2.3 The Simplified View of Hidden Markov Model 14 
2.4 The HMM with Single-mixture Distribution 16 
2.5 The HMM with Double-mixture Distribution 16 
2.6 The Lattice Structure of Simplified Viterbi Search 21 
2.7 The Training Process for Generating HMM Reference Models . 23 
2.8 Using HMMs for Isolated Word Recognition 23 
3.1 The Organization of the ASIP Architecture 25 
3.2 The Structure of Instruction Fetch Unit 26 
3.3 The Structure of Instruction Decoder 27 
3.4 The Structure of the Base Datapath 30 
3.5 The Structure of the Memory Hierarchy 31 
3.6 The Structure of Register File for Data Parallelism 33 
3.7 The Large-Scale View of Register File for Data Parallelism . . . 34 
3.8 The Datapath of Address Generation Unit 35 
4.1 The Datapath of MAC unit 38 
4.2 The Encoding of Booth Using Multiplier 39 
4.3 The Booth Encoding Logic 40 
4.4 The Diagram of 4:2 Compressor 40 
viii 
4.5 The CSA with 4-bit CLA 41 
4.6 The Whole Process of Fast Multiplication 42 
4.7 The Pipeline Organization of the Platform 43 
4.8 The Datapath of Saturation Logic 45 
4.9 The Output of FFT Algorithm 46 
4.10 The Structure of Loop Controller 48 
4.11 The Content of Stack 48 
4.12 Examples of Instruction Condensation and Distillation 54 
4.13 Examples of Instruction Condensation and Distillation 55 
5.1 The Program Skeleton of Viterbi Search 57 




1.1 The Emergence of ASIP 
Application Specific Integrated Circuits (ASICs) was once the best solution for 
different applications. An ASIC implementation has fully customized datapaths 
and control logic. Though its performance can be optimized in terms of size, 
speed and power consumption, ASIC is not flexible enough since its focus is 
mainly on hardware part only. Any design errors found in the chip will lead to 
additional manufacturing delays and costs. 
On the contrary, Programmable implementations allow high degree of 
reusability by reprogramming the devices to perform various tasks. The key 
benefits of relying on software design part are lower development costs, shorter 
time-to-market cycles and easier adaptation to the modification of market re-
quirements. However, it is known that the programmable devices like general-
purpose processors reach limitations in running critical missions of real-time 
applications. Compared with ASICs, general-purpose processors dissipate more 
power and show lower performance. To balance the trade-off among perfor-
mance, flexibility and other design constraints as shown in Table 1.1, this leads 
to the development of Application Specific Instruction Set Processors (ASIP). 
ASIP is a programmable device whose architecture and instruction set are 
optimized for a specific application area. It speeds up the application by short-
ening the crucial path in the way of tuning the instrucion set and introducing 
1 
Chapter 1. Introduction 
GPP ASIP ASIC 
Performance Low High Very High 
Flexibility Excellent Good Poor 
H W Design Effort Nil Large Very Large 
S W Design Effort Small Large Nil 
Power Large Medium Small 
Engineering Cost Low Medium High 
Time-to-market Short Medium Long 
Market Volume Low Medium High 
Table 1.1: Trade-off Comparison among GPP, ASIP and ASIC 
special hardware accelerators. It turns out that defining an optimal instrucion 
set and formulating the composition of hardware functions and software func-
tions are the key tasks in ASIP design. For this reason, ASIP are currently 
developed to bridge the gap between the performance and flexibility of pure 
hardware and pure software solutions as shown in Figure 1.1. 
i i 
f ASIC j 
V ...�� �V -x^ v. 
a ff，* 
tS f-'〜？. ASJ P � �； 
rt ^ ^ � >!• sV vV .... ••..Jr 
Urn ^^^ � J> » 
( GPP J 
~ • 
Flexibility 
Figure 1.1: ASIP Fill the Gap Between Performance and Flexibility [1 
2 
Chapter 1. Introduction 
1.1.1 Related Work 
Currently existing ASIP design environments can be divided into two streams. 
Some design environments are just based on pre-defined processor platforms 
and provide limited options for user customization. Other environments provide 
architectural processor description language for the designers to describe their 
target architectures by inserting some user-defined structures in the base one. 
R.E.A.L of Philips [2], Xtensa of Tensilica [3], Jazz DSP processor of Im-
prov Systems [4] and ARCtangent-A5 of ARC [5] are the commercial design 
environments using pre-defined processor platforms approach. 
Xtensa of Tensilica is a configurable, extensible and synthesizable RISC 
(reduced instruction set computer) processor with load store architecture. Its 
base architecture has a compact 16- and 24-bit instruction set comprising of 80 
instructions. The configurable parameters include the choice of 32 or 64 general-
purpose 32-bit registers, the size of cache, the write buffer size, the availability 
of designer defined instruction execution unit and etc. Designers can define 
the mnemonic, the encoding, and the semantics of single cycle instructions 
using TIE language. In addition, the development environment includes ANSI 
C / C + + compiler, linker, assembler, debugger, code profiler, and instruction set 
simulator. 
The R.E.A.L of Philips is customizable DSP having two independent 16x16 
bit multipliers, four parallel 16-bit ALUs which can be combined into two 40-bit 
ALUs (including eight overflow bits each)，and a number of parallel shifters and 
saturators in base architecture. Besides a standard 16- and 32-bit instruction 
set, there are additional Application Specific Instructions (ASIs), which allow 
the full parallelism of the DSP to be exploited. The ASI concept allows up 
to 256 VLIW instructions in a 96-bit width look-up table inside the R.E.A.L. 
DSP. These are triggered by a special class of 16 bit instructions, stored in the 
normal program memory. The ASI look-up table can be a RAM (for prototype 
chips), ROM, a synthesized netlist, or a combination of these. If the ASI table 
is implemented in RAM, then its contents can be modified using the JTAG 
port, or under DSP program control by writing to dedicated registers within 
3 
Chapter 1. Introduction 
the DSP. 
The ARCtangent-A5 of ARC is a four stages 32-bit RISC processor that 
can be configured and extended match the application requirements. Designers 
can customize the processor in two ways: configuration and extension. Con-
figuration is the ability to change existing features of the processor, such as 
the main-memory and auxiliary-bus widths; the size and organization of the 
instruction and data caches; or the size of local memory and DSP XY memory. 
Extension is the ability to add entirely new features to the processor such as 
a 32x32-multiply instruction, a USB peripheral and user-defined application-
specific extensions. The resulting core is generated to HDL code together with 
synthesis scripts, simulation make-files, documentation and an automated test 
environment. 
The Jazz DSP processor of Improv Systems is a configurable VLIW pro-
cessor for their proprietary Programmable System Architecture (PSA). Improv 
employs this architecture that can scale from a single, uniquely configured Jazz 
DSP processor core, to a system level platform implementation that consists 
of many of these uniquely configured Jazz processors in an interconnect struc-
ture defined by shared memory maps between the processors. Each processor 
instance can be customized by custom RTL blocks and instructions to create 
a designer-defined DSP core. The Jazz PSA Composer Tool Suite provides de-
signers with automatically generated synthesizable HDL code and a full set of 
software design tools including the debugger, simulator and profiler. 
Other design environments using architecture description languages include 
the design environment of Retarget Compiler Technologies [6]，LISA Processor 
Design Platform [7] [8], MetaCore [9] and PEAS-III [10:. 
The design environment of Retarget Compiler Technologies is based on the 
processor modelling language nML. nML offers designers the abstraction level 
for describing a processor architecture and instruction set (ISA), which serves 
as an input to the various tools. nML captures the specification of the proces-
sor's instruction set, together with sufficient structural information to enable 
efficient compilation. Processor designers can describe alternative instruction-
4 
Chapter 1. Introduction 
set architectures in nML. The support-tools for corresponding architecture are 
automatically available. Once the architecture has been optimized in nML, the 
processor description can be translated automatically into an HDL model. This 
HDL description can be synthesized with commercially available synthesis tools, 
for ASIC or FPGA implementation. 
The LISA Processor Design Platform (LPDP) tool-suite is based on the ma-
chine description LISA. Starting from architecture descriptions in the LISA lan-
guage, software development tools can be generated including HLL C-compiler, 
assembler, linker, simulator, debugger frontend. LISA is a language which aims 
at the formal description of programmable architectures, their peripherals, and 
external interfaces. The language elements of LISA enable the description of 
different aspects of processor architectures like behaviour, instruction set cod-
ing and syntax. The language LISA and its generic machine model are able to 
produce bit- and cycle/phase-accurate models of systems that consist of pro-
grammable architectures and peripheral hardware components. Moreover, syn-
thesizable HDL (VHDL, Verilog, SystemC) code of the target processor can be 
generated which can be processed by the standard synthesis tools. 
MetaCore is a DSP-oriented ASIP development system that can generate 
efficient ASIP using benchmark-driven design methodology. The heart of the 
MetaCore system is a predefined microarchitecture. The design style of the 
predefined microarchitecture is parameterized and pipelined. The architectural 
parameters include register file size, bus width, address space of each memory, 
and bit width of functional blocks. The specification of the target ASIP in the 
MetaCore system is described using the structural specification language MSL 
and behavioural specification language MBL. MSL is used to specify the data 
path structure of the target microarchitecture, while MBL is used to specify 
the architectural parameters and the behaviour of instructions for the target 
ASIP. The MSL description consists of declarations of hardware resources such 
as busses, latches, multiplexer, functional units, and interconnections among 
the hardware resources. A synthesis tool called SMART is used to translate 
the given processor specification into the corresponding HDL code of the target 
5 
Chapter 1. Introduction 
ASIP equipped with the user-defined application-specific instructions. 
PEAS-III is an architectural level processor design environment based on 
a micro-operation description of instructions. In the environment, designers 
model the target processor with the following five items: 
1. Architecture parameters such as the number of pipeline stages and the 
number of delayed branch slots 
2. Declarations of resources to be included in processor (e.g. ALUs, registers) 
3. Instruction format definitions which include interrupt conditions and the 
number of execution cycles of interrupt conditions and the number of 
execution cycles of interrupt 
4. Micro-operation descriptions of instructions and interrupts. PEAS-III 
synthesizes the datapath and the control logic of the processor, and gen-
erates a simulation model and synthesizable VHDL descriptions of the 
processor. 
1.2 Motivation 
We try to design the speaker-independent isolated word speech recognizer using 
ASIP methodology. There are many alternatives of accomplishing the speech 
recognition application, ranging from entirely hardware ASICs to completely 
software implementation. It is inflexible to cope with late design changes in 
ASICs approach while it largely ignores the potential enhancement in hardware 
structure in software approach. Related ASIP work mentioned in previous sec-
tion mainly emphasize the hardware architecture that is common to the generic 
DSP applications, but not a particular application domain. Their work also 
overlooks the possibility of optimizing the application in the eyes of software 
developers. It is equally important to consider the design from software's point 
of view in the hope of converting the complex algorithm into a simpler mathe-
matical form, extracting application-specific instruction for repetitive operations 
as well as fully integrating the inherent advantages of software and hardware 
6 
Chapter 1. Introduction 
co-design. We try to find out any optimizations which are feasible in the ASIP 
from both software and hardware designers' perspectives. 
1.3 ASIP Design Methodologies 
The philosophy of ASIP is the exploitation of optimized user-defined instruction 
set and datapath. It is an instruction-level programmable processor with an 
architecture and instruction sets tuned to a specific application. Design of 
ASIP requires a good balance between flexibility and performance to provide 
the most optimal solution. It also covers multi-disciplinary areas like computer 
architecture and logic design, DSP algorithm analysis, software programming 
and integration of the hardware platform. As a consequence, designing the 
ASIP represents a hardware/software co-design task. An overview of the entire 
ASIP design flow is depicted in Figure 1.2. 
,,"-"•Application witFT"^ , . / design constraints \ Step 1 C (time, area & J �Derformance^.,.--^ 
Ar^^iw^j^ Architecture丨 Design 
Step 2 A叩丨丨;丨丨ysis Space Exploration 
I I (Hardware) 
q - Instruction Set Architecture Definition 
。访P ( S W / HW Co-Desgin) 
cs^fjl!…� I Hardware 
S t e p 5 Verification 
厂 � � �� 
( C o m p l e t e d ASIP ) � 
Figure 1.2: Overview of ASIP Design Flow 
There are mainly 5 steps for the ASIP design flow. 
1. The design flow starts with the consideration of targeted application, spec-
ification and design constraints such as timing, area and performance. 
7 
Chapter 1. Introduction 
2. It involves the partitioning of the software and hardware analysis. A high-
level language like C is written to have a thorough understanding on the 
algorithms. Possible architecture is also studied for a specific application 
that requires the minimum hardware costs. 
3. A new instruction set is defined to act as an interface between the software 
implementation and hardware platform. 
4. Programs are written by the pre-defined instructions while Hardware De-
scription Language (HDL) like VHDL/Verilog is used for hardware gen-
eration and logic synthesis. 
5. The final design must be verified to see if it still meets the specification 
requirements and works under different constraints. 
1.4 Fundamentals of Speech Recognition 
Classification of ASR systems 
Speaker-dependent versus independent System 
Speaker dependent recognition systems are trained to recognize some particular 
speaker's voice. This is accomplished through a training session, which allows 
the speakers to record their voice beforehand. The system is tailor-made to 
some specific speaker's pronunciations, inflections, and accents. The recognition 
accuracy is high since speaker's voice are pre-stored in database. 
In contrast, Speaker independent systems mean that any individual can 
speak directly to the computer without going through the training process of 
his own voice. That means speaker who utters via microphone may not nec-
essarily have his own voice pre-recorded in the database. Speaker-independent 
approaches are the only ones that make sense where speech training process 
is impossible. It is sometimes difficult to expect every user to go through the 
trouble of training his own voice first in the recognition system before they have 
8 
Chapter 1. Introduction 
any applications. The system may have lower recognition accuracy compared 
with speaker dependent systems, but it retains flexibility. 
Small versus Large Vocabulary Size 
The real issue is how big a vocabulary is required by the application and how 
much of the vocabulary can be made active at one time. For example, an 
office dictation application might require a vocabulary of 30,000 words while 
an industrial inspection task might require only 300 words. The maximum 
number of words that are active at one time can depend on memory availability, 
recognition accuracy, and response time required by different applications. It 
is obvious that the smaller the vocabulary size, the less memory required, the 
higher accuracy as well as the faster response time. 
Portable versus Non-Portable Hardware System 
Some speech recognition applications like manufacturing inspection or environ-
mental surveillance require portable hardware. Size, power, and memory loca-
tions are the major limitations of the design that must be considered. Other 
systems such as mainframe-based or stationary desktop computers tend to use 
more powerful processors and memory-intensive algorithms to achieve better 
speech recognition performance. 
The Properties of Speech 
Speech utterances are unpredictable, time-varying and random in nature. There 
are wide range of possibilities to represent a speech signal, including energy, 
pitch, tone and other related parameters. Probably, the most effective way of 
modeling the speech signal is the short-time stationary segmentation. By divid-
ing the speech into the same period region with overlapping, many seperated 
speech frames are formed. As the time period of each frame is very short (10 
� 2 0 ms), that segmented speech frame can be assumed to be stationary. Figure 
1.3 illustrates the short-time stationary characteristic of speech signal. 
9 -
Chapter 1. Introduction 
Frame (N+1) 
I “ II ^ I 
Frame N Frame (N+2) 
i i 
— — 1 Parameter 
— — > Vector 
=q pq J Size 
."V 
Speech Frames / Vectors 
Figure 1.3: The Short-time Stationary Characteristic of Speech Signal 
The final extracted vectors are the feature vectors that represent the special 
characteristics of the speech. It is these feature vectors that act as the input 
for the later process of speech recognition. 
1.5 Thesis outline 
The remainder of the thesis is organized as follows: 
Chapter 2 describes the completed procedure of automatic speech recogni-
tion in theory, including the front-end feature extraction and back-end HMM-
based Viterbi search. 
Chapter 3 briefly introduces the design of ASIP platform. Its focus is on 
the hardware architecture, functional description of each module and the design 
space of datapath exploration. 
Chapter 4 presents the practical implementation of speech recognition on the 
ASIP platform. There is wide discussion on different optimization techniques, 
the working mechanisms and design considerations. 
Chapter 5 proves the effectiveness of implementing speech recognition on 
ASIP by revealing the performance in terms of speed and recognition accuracy. 
Chapter 6 summarizes the overall research work and suggestions for the 
future work in this area. 
10 
Chapter 2 
Automatic Speech Recognition 
2.1 Overview of ASR system 
This section describes an HMM-based Isolated Word Recognition (IWR) system 
that can be divided into two parts, the front-end and back-end processing. The 
front-end includes data acquisition and feature extraction. Through the input 
via microphone and a codec from which digitized speech data are generated, 
it produces important and useful acoustic features for speech recognition. The 
back-end is the HMM-based Viterbi search where Gaussian mixture computa-
tion and memory usage take place frequently. This finally leads to the result of 
which word is being spoken. Figure 2.1 shows the block diagram of the speech 
recognition system. 
,   
i Front-End i i Back-End i 
^cousti^ i 一 Data Feature i ！ _ 已 二 ^ ^ Decision- j ^cognizA 
y^Ns^^ i Acquisition Extraction : ！ comparison Making > i^^eed^ 
广 — � 
Training ^ — 
Data Reference 
L.Model^  
Figure 2.1: Automatic Speech Recognition System 
11 
Chapter 2. Automatic Speech Recognition 
2.2 Theory of Front-end Feature Extraction 
The acoustic feature generated by the signal processing front-end is Mel-
Frequency Cepstral Coefficients (MFCC), which are calculated using the real 
cepstrum, defined as the inverse Fourier transform of the log spectrum: 
Cs{n) 二 厂 log\Siw)\e''"''dw 
where S{w) is the spectrum of the speech signal. The acoustic features consist of 
12 cepstral coefficients together with energy. The features are energy normalized 
and cesptral mean normalized based on each short time segment. It is known 
that the performance of a speech recognition system can be greatly enhanced 
by adding dynamic time derivatives to basic static parameters. However, only 
the static coefficients and first-order dynamic time derivative coefficients are 
included in the feature vectors with consideration of tight memory as well as 
the extensive computation time. 
/^ms samples^ Hamming 
( ^ H z I 6 . b i ^ P r e - e m p h a s i s 一 window 
Logarithm of Mel-frequency DFT 
filter energies filter bank Computation 
Figure 2.2: The Algorithm for Front-end MFCC Computation 
In practice, the mel-frequency cepstral coefficients can be computed using 
the algorithm shown in Fig 2.2. 
The digitized speech signal s(n) derived from the data acquisition step is 
16-bit linear sampled at 8 kHz. It is a common practice to pre-emphasize the 
signal by applying the first-order differential equation, 
s(n) = s(n) — a . s{n — 1) 
12 
Chapter 2. Automatic Speech Recognition 
where a is the pre-emphasis coefficient which should be in the range 0 < a < 1. 
Then, the pre-emphasized speech signal s(n) is segmented into frames with size 
of 20 ms. With overlapping of 10 ms between frames, each frame is multiplied 
by a 160-point Hamming window w{n). 
x{n) 二 . s(n) 
Next the spectral magnitude of the windowed signal is computed by Discrete 
Fourier Transform (DFT). The magnitude is then processed by a series of over-
lapping triangular filters, Hm[k), which are centered at equally spaced frequen-
cies in the mel-scale, to find an estimation of mel-spectrum. The logarithmic 
scale is taken to produce a weighted log energy Y(m). This results in compu-
tation of the total energy in the mth band, 
'N-1 -
Y{m) = logio - Hmik) ,0<m<M (2.1) 
_fc=o . 
where X{k) is the DFT of the windowed speech signal, Hm(k) is the filter-
bank coefficients, N is the length of a frame and M is the number of filters. 
The weighted log energy is real and even, so the inverse Fourier Transform 
can be implemented as a Discrete Cosine Transform (DCT). This transforma-
tion decorrelates features so that the diagonal covariance matrices can be used 
instead of full covariance matrices. Cepstral coefficients have rather different dy-
namics, the higher coefficients show the smaller variance. It is desirable to have 
a constant dynamic range across coefficients for modeling purposes. One way 
to reduce these differences is to apply liftering windows which weight cepstral 
coefficients C (/c) differently, 
� =C �. | l + y - s i n ( 蒜 ) } � 0 S /c < M (2.2) 
where M is the number of filters. Finally, the first-order time derivatives of 
feature vectors are estimated to represent the dynamic characteristics of speech 
signals. 
13 
Chapter 2. Automatic Speech Recognition 
2.3 Theory of HMM-based Speech Recognition 
2.3.1 Hidden Markov Model (HMM) 
Though speech signal is well-known for its variability, the spectral properties of 
the frames of a pattern can be characterized by Hidden Markov Model (HMM), 
one well-recognized and widely used statistical method. The underlying as-
sumption of HMM is that the speech signal can be well modeled as a parametric 
random process. 
2.3.2 The Typical Structure of the H M M 
An HMM is characterized by the number of states N, the state transition prob-
ability matrix A, the observation symbol probability distribution B and the 
initial state distribution n. Given an HMM A = (A, B, tt) and an observation 
sequence 0，we wish to calculate the probability of the observation sequence 
P(0|A). These probabilities are a measure of how well the data match each 
state in the model. With left-to-right topology, the formal model for the HMM 
is shown in Figure 2.3. 
ai1 322 333 3a4 355 
/ X I I V \ 
/ \ \ \ I � � � ��� 
/ \ b2(03)\ I b.(05) 1 、\ 、、、bs(问 
b 例 / h 广〉 \ b3(6"4)\ 1 � � _ ) \ 
J h h h h h 
Observation 
Sequence 
I |1 I J J L—J ImmJ IJ 
01 02 03 04 05 06 07 
Figure 2.3: The Simplified View of Hidden Markov Model 
The state transition probability matrix A = {a^}, which indicates the prob-
14 
Chapter 2. Automatic Speech Recognition 
ability for state i to change to state j. 
ail Q'U 1^3 ai4 ai5 
a2l 0,22 ti23 0'24 0^25 
] = asi 032 033 au a35 
0 4 1 0-42 0^43 <^44 <2.45 
5^1 a52 (253 5^4 <^55 
Note that the sum of the probabilities of all transitions with the same current 
state must be equal to one. 
N 
= 1 
The observation sequence 0 = = C0i~0 2----~0t), where I < t < T 
and T is the number of observations in the sequence. The observation symbol 
probability B = {6j("^t)}，it is the probability density function for each state 
and the argument ~0t is an acoustic feature vector. 
2.3.3 Discrete HMMs and Continuous H M M s 
There are two different forms of HMMs, including the discrete observation HMM 
and the continuous observation HMM. 
For the discrete observation HMM, it is restricted to the production of a 
finite set of discrete observations. Vector Quantization(VQ) is used to associate 
each continuous feature vector with a discrete value. Vector Quantization is a 
sampling process of continuous signals and this results in a serious loss of data. 
In reality, the observations are usually representations of continuous signals in 
most applications. VQ of these continuous signals can degrade the performance 
significantly. 
For the continuous HMMs, the observations are continuous (or vectors). It 
would be beneficial to model continuous speech signals directly with continuous 
observation densities. The most general representations of continuous observa-
tion density is in the form of Gaussian probability density function (pdf). Figure 
2.4 shows the single-mixture distribution, which means that there is only one 
pdf at each state in HMM. 
15 
Chapter 2. Automatic Speech Recognition 
V V V 
( i H ^ K S K f ) - © 
Figure 2.4: The HMM with Single-mixture Distribution 
For such a pdf mentioned above, it is not sufficiently flexible to accurately 
model the variation which occurs between different acoustic vectors that corre-
spond to a state. This is particular true if the models are used to characterize 
speech from a number of speakers. Thus, Gaussian double mixtures are typically 
used to model broad sources of variability. Figure 2.5 shows the double-mixture 
distribution, which means that there are two pdf at each state in HMM to 
accurately model the highly varied speech. 
\ 7 V 
Figure 2.5: The HMM with Double-mixture Distribution 
The most popular form of the output probability density function (pdf) is 
16 
Chapter 2. Automatic Speech Recognition 
Gaussian density, which is defined as 
bj � =N = . 1 e - M A - 巧广 "广 ( ‘ ^ 力（ 2 . 3 ) 
^(27rfdet(Uj) 
where Ot is the observation vector with the dimensionality of D at time t, and 
TV is a Gaussian pdf with mean vector Hj and the determinant of the covariance 
matrix Uj in state j . 
In practise, only one Gaussian distribution is not sufficient to appropriately 
estimate the speech parameters. Thus multi-variate Gaussian density, which is 
weighted sums of Gaussian densities, is often used. 
bj � = g c j 补 冬 -
(2.4) 
where Cjk is the mixture coefficient for the kih mixture in state j, M is the 
number of mixtures per state and iV is a multivariate Gaussian with mean 
vector Jljk and the determinant of the covariance matrix Ujk for the kih mix-





Cjk > 0, l<k<M 
It is flexible to alter mixture densities to sufficiently approximate the arbitrary-
shaped densities if an appropriate number of components are used. 
2.3.4 The Three Basic Problems for H M M s 
Given HMM with model A = (A, B,7t) and the observation sequence O = 
{~0i~0 2----~0t), there are three basic problems that must be solved for the model 
to implement in real-world applications. These problems are listed as belows: 
1. How do we efficiently compute the probability of the observation sequence 
P(0|A) given the observation sequence O and the model A? 
2. How do we choose the corresponding state sequence q = 
that best describes the observations given the observation sequence O and 
the model A? 
17 
» 
Chapter 2. Automatic Speech Recognition 
3. How do we adjust the model A to maximize P{0\X)? 
Problem 1 is the recognition problem, given an output sequence and a model, 
what is the probability that the model could have created the sequence. The 
problem can also be viewed as calculating scores to find out how well a given 
model matches a given observation sequence. Comparing new data to the mod-
els of known signals can solve the recognition problem. If there are V words to 
be recognized, then there will be V distinct HMM to model each word seper-
ately. The recognition result of the unknown word is based on the final scores 
of each word model that match the given observation sequence (input feature 
vector). The word whose model score is the highest will be selected as the 
recognition output. 
Problem 2 is the sequence problem, given an output sequence and a model, 
what is the optimal and most likely sequence of states that could have created 
the output sequence. Segmenting the training sequence of each word into more 
states is the solution to sequence problem because it makes refinements of the 
model and improve its capability of modeling the spoken word sequences. 
Problem 3 is the training problem, given the output sequence and the topol-
ogy, how can the parameters of a model be adjusted to maximize the probability 
that the model creates the output sequence. The training problem can be solved 
by finding optimal model parameters of the known data and training the refer-
ence data to create the best models. 
2.3.5 Probability Evaluation 
To implement speech recognition, we need to calculate the probability of the 
observation sequence O = {~oi~0 2----~ot), given the HMM model A, i.e., P(0|A). 
The most straightforward way of calculating it is through enumerating every 
possible state sequence of length T . 
Consider one fixed N-state sequence q = ("?i"?2."."?t)，where ql is the 
initial state. The probability of observation sequence is obtained by summing 
18 
Chapter 2. Automatic Speech Recognition 
the probability over all possible state sequence q as 
•P(0|A) = ^ … ( 2 . 5 ) 
ql,q2,....qT 
From Equation 2.5, the interpretation of the probability computation can be 
illustrated as follows. Initially (at time t= l ) we are in state ql with probability 
TTgi, and generate the symbol "o^ i in this state with probability bqi{~Oi). The 
time changes from t to t+1 (at time=2) and we make a transition from state 
ql to state q2 with probability dq卿 and generate symbol ~0 2 with probability 
bq2{~0 2)- This process continues until we reach the last transition (at time T) 
from state qt-i to state qt with probability ag^-igr' and generate symbol ~0t 
with probability bgrCor) [11 • 
The calculation of P(0\X) involves 2T • N? order of calculations because 
there are N possible states at every t = 1,2, ...，T, that can be reached (i.e., 
there are N ? possible state sequences), and for each such state sequence about 
2T calculations are required for each term in the summation of Equation 2.5. 
It is difficult or infeasible to compute probability in this way even for small 
values of N and T. For example, state N=2 of HMM with total frames T二200 
of the speech signal, there are 2.200.2舰 « computations. Fortunately, an 
efficient algorithm called Viterbi can be used to solve the problem. The main 
difference is that instead of summing the probabilities of transitions from all 
states in the previous method as shown in Equation 2.5, only the most probable 
transition is considered and the rest is discarded. But one more step is needed 
to trace back from the most probable final state which reveals the most probable 
state sequence. 
2.4 The Viterbi Search Engine 
The Viterbi algorithm aims at finding the best state sequence, q = 
for the given observation sequence O = The 
highest score 6t(i) is defined as 
St{i) = maxP[qiq2..…qt-i,qt = i�0i02"..0f|A 
q 
19 
Chapter 2. Automatic Speech Recognition 
6t{i) is the best score with the highest probability along a single path, at time t, 
which accounts for the first t observations and ends in state i. The array ipt (j) 
is used to keep track of the argument that maximized the probability for each 
t and j. The complete procedure for implementing the Viterbi algorithm [11 
can be stated in three steps: 
1. Initialization 
(i) = 1Tik (oi) , l<i<N 
机(i) = 0 
2. Recursion 
5t U) = \ max 卜 1 � ai,] . bj {ot)，1 < j < N, 2<t<T 
•t U) = arg max (^ t-i �( k j 
l<i<N 
3. Termination 
P = ^maj^ (i) 
qx = arg m,ax [St ⑷： 
It should be clear that a lattice structure efficiently implements the computation 
of the Viterbi procedure as shown in Figure 2.6. The recursion forms the basis of 
the so-called Viterbi algorithm. This algorithm can be visualised as finding the 
best path through a matrix where the vertical dimension represents the states 
of the HMM and the horizontal dimension represents the frames of speech (i.e. 
time). Each large dot in the picture represents the probability of observing 
that frame at that time and each arc between dots corresponds to a transition 
probability. The path always goes in the direction with higher probability. For 
example, the starting point is at time二 1 and at state=A (i.e. lA). The path 
can go to two directions either. One direction is lA —^  2B. Another is lA — 
2A. Suppose the route for lA — 2 B has a higher probability than the one for 
lA — 2A, then the path will simply go upwards with a 45-degree direction to 
reach point 2B. The paths are grown from left-to-right and column-by-column. 
The comparison process continues until it reaches both of the final state and 
the last speech frame (i.e. 5D). 
20 
Chapter 2. Automatic Speech Recognition 
s t a t e + I I I I I 
I I I I I 
I I I I I 
I I I I I 
D J 令 令 — i h -
！ I i / i 1 
v, I C ) — 令 — ^  
^ f I I A A i 
3 — — i Z k ^ — 上 
" " " I I I I I 
I I j I I S p e e c h 
I I I I I S 1 ‘ 1 ‘ ‘ ~ • Frame 
1 2 3 4 5 (Time) 
Figure 2.6: The Lattice Structure of Simplified Viterbi Search 
Alternative Viterbi Implementation 
The previous section of the Viterbi algorithm requires frequent operation of mul-
tiplication, which is unfavorable for the hardware implementation. By taking 
logarithms of the model parameters, there is no need to have any multiplication 
operations. The multiplication is converted to addition after taking logarithms. 
The main procedures of modified Viterbi algorithm [11] are shown as belows: 
1. Preprocessing 
亓i = log(7ri)，\<i<N 
bi {ot) = log [bi (o,)] , I <i<N, l<t<T 
hij = log(aij), 1 < i, j <N 
2. Initialization 
(i) = log 秘 ) ) = 7ri + bi (oi) ’ l<i<N 
也 ( i ) = 0 
3. Recursion 
式 U) = log(St (j)) = max St-i (i) + 知 + bj {ot) 
l < t < i V L J 21 
Chapter 2. Automatic Speech Recognition 
iPt U) = arg max (z) + a J , I < j < N, 2<t<T 
l<i<N L � 
4. Termination 
P = ^ma：}^ 5t [i) 
Qt = arg max It � 
1<Z<7V L � 
The calculation required for this alternative implementation is on the order of 
only N'^T additions plus the calculation for preprocessing. The preprocessing 
cost is negligible for most systems because it only performs once and the values 
are saved. 
2.5 Isolated Word Recognition (IWR) 
To build the speaker-independent HMM-based isolated word recognizer, assume 
there is a vocabulary of V words to be recognized and each word is modeled by 
distinct HMM, there are mainly two parts that are cruical for the implementa-
tion of isolated word recognizer. 
1. Offline Training 
2. Real-time Isolated Word Recognition 
Offline Training 
Each word in the vocabulary has a training set of K utterances of the word. 
Given a set of training utterances corresponding to a particular word model, the 
parameters of that model (A, B, n) can be determined automatically by a robust 
and efficient HTK toolkit, which is primarily designed for building HMM-based 
speech processing tools, in particular recognizers. Provided that a sufficient 
number of representative utterances of each word can be collected, a HMM 入v 
can be constructed which implicitly models all sources of variability inherent in 
real speech. The training procedure is shown in Figure 2.7. 
22 
Chapter 2. Automatic Speech Recognition 
Word 1 Word 2 Word V 
UtterancelGGGG • • • • • • • • • • • 
Utterance 2 ] 
• • • • • • • • 
• • • 
• m • 
Utterance K [ ] QDDD DDDD D D D D E 
Estimated f T • • • • • 
( : ) A 丨 义2 ^ 
Figure 2.7: The Training Process for Generating HMM Reference Models 
Real-time Isolated Word Recognition 
For each unknown word to be recognized, the processing shown in Figure 2.8 
must be carried out. The front-end process of the real-time speech extracts 
the useful features and represents those features in the form of observation 
sequence O. The likelihood of each model generating that observation sequence 
O of unknown word is calculated and the most likely model identifies the word. 
In other words, it is necessary to have calculation of model likelihoods for all 
possible models, P(0|A), 1 < v < V. The recognition result can be found by 
selecting the word whose model likelihood is the highest. 
f Real-time \ ts^  Feature 
V Speech J ‘ V Extraction 
Unknown 0 « 
_ / 1 \ 
Reference (hMM for Word l) (hMM for Word^  • • • •(hMMfor Word^  Word Model 
c P : : 卞 ( & 司 I 叩 i H • • • H • 义 V ) 




Figure 2.8: Using HMMs for Isolated Word Recognition 
23 
Chapter 3 
Design of ASIP Platform 
Our ASIP processor [12] mainly focuses on the digital signal procssing applica-
tions like speech, audio and video. In order to meet the real-time requirements 
of various applications, the ASIP processor has been specially designed for ap-
plications which are computationally intensive and repetitive. 
The design goal of the processor is to maximize the degree of reusability by 
re-programming it with efficient application specific instruction set and mini-
mizing the impact on timing and power consumption when changing the archi-
tectural parameters. 
The proposed architecture of the processor is divided into four parts，namely 
instruction fetch, instruction decode, datapath and register file system as showil 
in Figure 3.1. The selected architectural parameters are listed in Table 3.1. 
Architectural Parameters Values 
Instruction Addressing 16 bits 
Instruction Width 24 bits 
Data Addressing 2 x 16 bits 
Data Width 16 bits 
Register File 2 x 64 x 16 bits 
Table 3.1: The Architectural Parameters of the ASIP 
24 
Chapter 3. Design of ASIP Platform 
； ; ~ 5 — ‘ ^ ^ p - ^ : , 
r ^ ！ Processor | MAC f ：‘ | 
L n J 萄 iJ^.； 
会 — i - 。 ： ^ 
Figure 3.1: The Organization of the ASIP Architecture 
3.1 Instruction Fetch 
The Instruction Fetch Unit (IFU) is responsible for reading instructions from 
program memory, passing them to the instruction decoder and updating the 
program counter. 
The structure of the instruction fetch unit is shown in Figure 3.2. It con-
sists of five major modules, including program counter, address selector, branch 
controller, loop controller and subroutine controller. 
IFU begins to operate autonomously as soon as reset is released. The pro-
gram counter is a register that stores the current position of the program. This 
stored value is used as the instruction address for fetching instructions. On 
the other hand, this value is also passed to the instruction decoder for being 
a reference for branching and other program flow control activities. When a 
branch is executed, the fetch unit must stop fetching instructions from the cur-
rent stream and change the program counter to the new value. When the loop 
or subroutine is called, the value of program counter is stored into stack and 
later this value will be retrieved when the loop is completed or return from the 
25 
Chapter 3. Design of ASIP Platform 
�l .1 f 
Branch Address Branch ' 
Branch Coniroi — j — • Controller I 
I I 
I I 
Loop Address — j L o o p Address __ Program Final 
Loop Control • Controller Selector Counter ^ ^ ^ ^ Address 
I I 
I I 
Sub Address j » Subroutine _ Control I 
Sub Control ~ ~ I p - Controller I 
I 1 I 
Instruction ^ 
I , — � � 
, 、 、 ^ ) \ _,«Instruction^J 
I Decoder I 一 \ Memory / 
\ 乂 � - -
Figure 3.2: The Structure of Instruction Fetch Unit 
subroutine. The program counter updates its content autonomously with the 
new address provided by the address selector. The address selector controls the 
content of the program counter. It calculates the new address based on the 
addresses from instruction decoder or other addresses with the control signals 
of branch, loop and subroutine operations. Depending on the status of the 
processor and the requests from those modules, the address selector supplies 
the appropriate address to update the program counter. In this way, branches, 
loops and subroutine calls can be realised. 
3.2 Instruction Decode 
The instruction decoder is responsible for converting the fetched instruction into 
useful information for different parts of the processor. Basically, it has three 
tasks to do : 
1. Identify the fetched instruction 
2. Interpret the encoded part of the instruction and generate the correspond-
ing control signals and opcodes 
3. Dispatch the decoded information to the corresponding modules 
26 
Chapter 3. Design of ASIP Platform 
In ASIP design, different application specific instructions are introduced to ac-
commodate different applications needs. Inevitably, the instruction decoder 
needs to be redesigned frequently, which is not favoured in design reuse. To 
meet our design goal, the changing part must be isolated in order to minimize 
the modification effort. Therefore, a highly modulized instruction decoder is 
designed. The structure of the instruction decoder is shown in Figure 3.3. The 
decoding of parallel instructions and complex instructions is separated from the 
decoding of base instructions. Two internal decoder modules are dedicated for 
these application specific instructions. The contents of the parallel and complex 
instruction decoder can be changed without altering the others. 
勢 一 ， 「 （：：⑨ 
Address | I X Bank address 
Control ^ I Instruction l Y^ Bank address 
- J Decoder I Spoclal register 
‘ >S  
stage 1 C " " " " 
Jump ^ j — — • I -
Loop • I I Program Counter 
Sub I I I ~ ^ � 
-乂 ( ~ ~ T 1 7 1 ‘ (芒乡 
� ” Parallel Complex Base “ 
Instruction Instruction Instruction 
Decoder Decoder Decoder 
Stage 2 Stage 2 Stage 2 
r ’ ：； ,； � 
(^ rlr^ ) opcodas opcodes opcodes | 
Figure 3.3: The Structure of Instruction Decoder 
Secondly, the whole decoder is divided into two levels in order to match the 
pipeline organization. It is natural that the instructions involving execution of 
datapath is assigned to the second level which is closer to the datapath. This 
partitioning has two advantages : 
1. The numbers of pipeline registers can be saved because the data that 
passed along the pipeline stages is the encoded part of the instruction 
instead of the massive control signals and opcodes 
2. The unused modules can be turned off efficiently. 
27 
Chapter 3. Design of ASIP Platform 
Table 3.2 lists the modules that are activated in the execution of different 
classes of instructions. It shows that the second level can be disabled when 
the processor is doing flow control, configuration and memory manipulation. 
Moreover, the first level can activate one of the modules in second level only, 
as the base instructions, parallel instructions and complex instructions are in-
dependent. 
Instruction Processor Address General Special 
Classification Fetch Control Generation Register Register Datapath 
Unit Unit Unit File File 
Data Processing \/ \/ V 




Flow Control < ^ 
Configuration >/ 
Memory ^ \J 
Manipulation 
Table 3.2: The Processor Usage of Different Functional Units 
Referring to Figure 3.3 and Table 3.2，the ASIP can be partitioned into 
different units for easier design and maintenance. The Instruction Fetch Unit 
(IFU) and Processor Control Unit (PCU) are responsible for the flow control 
operations, namely break, jump, loop and subroutine call and return. The 
Address Generation Unit (AGU) calculates the new address based on the various 
addressing modes. The Datapath executes the corresponding operation via 
General Register File. All data processing, bitwise manipulation and boolean 
operations require the input values to be stored in the register before they can 
be further processed. Special Registers allows the programmer to initialize the 
parameters at the beginning of the code segments. The programmer can pre-
28 
Chapter 3. Design of ASIP Platform 
define parameters like the memory/register start address, size and step. 
To make good use of this partition, the class of the fetched instruction has to 
be identified as soon as possible. Hence, the instruction encoding is first based 
on the classification of the instructions then the number of bits needed for the 
arguments, so that the fetched instruction can be classified within the first few 
bits. 
3.3 Datapath 
The base datapath is designed for general DSP application. Similar to other 
general digital signal processors, the number representation is two's complement 
and the heart of this datapath is a multiplier-and-accumulator (MAC). The 
structure of the base datapath is shown in Figure 3.4. In the centre is a 16x16 
40-bit MAC. It is made of 3:2 compressors in Wallace tree configuration, an 
adder and a 40-bit register for accumulation. The most significant eight bits 
are guard bits for avoiding overflow. The adder in the MAC is also responsible 
for addition and subtraction operations. 
For shift operation, a barrel shifter is used in the datapath. It can shift 
the accumulator by at most thirty two bits to left and to right in arithmetic 
or logical manner. The shift distance can be defined by the immediate value 
from the instruction or the value stored in the register file. In addition, this 
shifter is also used to implement normalization operation. After the exponent 
instruction, the exponent of the current accumulated value is stored into a 
special register. This stored value is used as the shift distance of the shift 
operation when executing the normalization instruction. 
There are also other modifiers for the accumulator : 1) logical unit for 
bitwise logic manipulation including AND, OR, XOR and NOT; 2) absolute 
unit for working out the absolute value of the accumulator; 3) negation unit for 
converting the accumulator to its opposite sign. The modified value is stored 
back to the accumulator. 
Besides data processing, there is a comparison unit for comparing the two 
29 
Chapter 3. Design of ASIP Platform 
I] H] I 
Operand Selector 
CMP I  
16x16 
S “ r I Shifter 
A ——i—H r ^ 40 bit / j l 
4 x p o ^ EXP ABS Logic Adder/ / / 
‘ Subtrator / / NEG 
^ T / i ? 
Accumulator    
Figure 3.4: The Structure of the Base Datapath 
values in the register file or comparing the accumulator with one stored value 
in the register file. The comparison unit can report six conditions : 1) equal; 2) 
not equal; 3) greater than; 4) less than; 5) greater than or equal; 6) less than 
or equal. The result is stored as conditional flags in special register. 
A complete instruction set description is tabulated in Appendix A. 
3.4 Register File Systems 
3.4.1 Memory Hierarchy 
In common with other current DSPs, the platform uses a dual Harvard architec-
ture where one program memory and two separate data memories (labelled X 
and Y) are used. This avoids conflicts between program and data fetches, and 
many DSP operations map naturally onto dual memory space. For example, 
the data for convolution or cross correlation can be stored separately in X and 
Y memories. 
30 
Chapter 3. Design of ASIP Platform 
X Data Memory /I 1\ A X Register 





Y Data Memory /I N^ ：：^^ / K Y Register 
6 4 K X 1 6 b i t \ | ~ ] / 64 X 16 bit 
_ 
Figure 3.5: The Structure of the Memory Hierarchy 
In order to allow high degree of data reuse, a large register file is highly 
recommended. The memory hierarchy is shown in Figure 3.5. The register file 
is partitioned into X and Y banks for matching the organization of the data 
memory. To interface with the data memories, a load store unit is used. It is 
designated for reading from and writing to the data memories in bulk. 
3.4.2 Register File Organization 
To supply enough operands to the parallel datapath without introducing any 
conflict, the register file is designed to be multi-ported. However, when the 
number of ports grows, the performance of the register file deteriorates in terms 
of delay, area and power consumption. Previous research work presented the 
impact to the performance of a register file with increasing number of arithmetic 
units in [13]. It showed that for N arithmetic units, the area of the register file 
grows as N^，the delay as N鄉,and the power consumption as N^ • The 
main reason is that more arithmetic units need more ports for parallel execu-
tion, which implies an exponentially growth in the complexity of the address 
decoder and the interconnection between the arithmetic units and the register 
file. Inevitably, this can cause a great impact to the platform when scaling up 
the parallel datapth. Partitioning register file into multiple banks was reported 
to be an effective solution to slowing down the performance deterioration in [13 
14] [15] [16]. The design philosophy of multi-banked register files is to distribute 
31 
Chapter 3. Design of ASIP Platform 
the ports to different register banks, so that the number of ports per bank can 
be reduced. This method can alleviate the complexity of the interconnection, 
but the drawback is that each port is confined to access the corresponding bank 
only. The primary challenge of this scheme is to avoid the number of simul-
taneous accesses to any bank that exceeds the available ports on each bank. 
Our design is based on the observation that the data which need to be accessed 
simultaneously can be uniformly assigned to different banks for many DSP al-
gorithms. In other words, data conflict can be omitted if the data is carefully 
assigned to suitable banks. 
It is a good practice to study the data access pattern of DSP algorithms. Our 
focus is on the convolution and correlation algorithms, which are fundamental 
and commonly implemented in most DSP applications. They also show a huge 
amount of parallelism and favour the analysis of parallel data access pattern. 
The general mathematical form for both of the convolution and correlation 
algorithms can be combined as 
N-l 
i=0 
where P{n) and Q{n) are the two digitized input signals, N is the length of 
P(n) or Q{n), y[n) is the output signal. It is assumed that N = 8 and there are 
four functional units (FU) in the parallel datapath. In the table, the bold items 
are the arithmetic operations. The operation mul represents a multiplication; 
operation mac is a multiplication-and-accumulation; operation add is an ad-
dition. The four functional units have their own accumulator for temporary 
storage which are notated in acc and t indicates a particular time instant. 
Observing the data dependency of the access pattern in Table 3.3, it is 
possible to further partition each register bank into two blocks. The first block 
contains data with index 2n and the second one contains index with 2n+l. 
In this arrangement, each functional unit possesses a block of X bank and a 
block of Y bank. The functional units only need to access data from the local 
blocks, and there is no need to access data across other blocks. As a result, 
each block is only required to provide one read port for the functional unit. 
32 
Chapter 3. Design of ASIP Platform 
t FUO FUl 
0 mul Pq QN mul Pi QN±I 
1 mac P2 Qn±2 Accq mac P3 (5„�3 Acci 
2 mac P4 (5ri±4 Accq mac P5 (5„�5 Acci 
3 mac Pq 士 6 Accq mac P7 土 7 ACCI 
4 add Acco Acci 
Table 3.3: Exploiting Data Parallelism by Observing Data Access Pattern 
The corresponding structure of the register file is shown in Figure 3.6. P(n) and 
Q(n) are supposed to be stored in X and Y banks separately. This structure 
illustrates a way to assembly an four read ports register file with four individual 
local read port register block. It is easy to scale up the architecture for further 
Address Generation Unit 
‘ � , 4 � 
I XBank I I I Y Bank i 
I X(2n) X(2n)+1 | 1 Y(2n) Y(2n)+1 I 
I X Block X Block j | Y Block Y Block j � : � ‘ 1 -‘ 
T i i r o i 
Operand 0 Operand 1 Operand 2 Operand 3 
Figure 3.6: The Structure of Register File for Data Parallelism 
exploiting data parallelism as shown in Figure 3.7. For J functional units and N 
data length, the index of X，Y block starts from yn , y n + 1, y n + 2, .... to the 
last index y n + ( y — l ) , where N is the power of 2 and J is an even number. It 
is particularly suitable for computationally intensive DSP algorithms because 
many calculations can be done simultaneously, though using more hardware 
resources for rapid operations. 
33 
Chapter 3. Design of ASIP Platform 
Address Generation Unit 
X Bank Y Bank 
- - - 厂 ― ’ ,： ：—、1 
I X(Nn/J) X(Nn/J+1) • • • x(Nn/j.(N/j.i)) | | Y(Nn/J) Y(Nn/J+1) • • • Y(Nn;j»<N/J-i)) | 
I X Block X Block X Block 丨 I Y Block Y Block Y Block 丨 
、 丄 — 了 」 」 _ 二 | : „ ! _ „ Tljj、！了二」―丄—二：」 � J 
( S e l e c t o r ^ ( S e l e c t o r ) 
Operand 0 Operand 1 • 拳 參 ••鲁 Operand (N/2-1) Operand N/2 
Figure 3.7: The Large-Scale View of Register File for Data Parallelism 
3.4.3 Special Registers 
The special registers are created for several reasons. Firstly, it marks the sta-
tus flags indicating the occurrence of overflow, carry, equal/not equal, less 
than/greater than, less than or equal/ greater than or equal for instructions 
like CMP, ADD and SUB. Secondly, it allows the programmer to initial-
ize the parameters that access memory and register through Load Store Unit 
(LSU). For memory/register access and instructions like LOAD and STORE, 
parameters like memory/register start address, size and step can be initialized 
at the beginning of the code segment. It is also feasible to set up the special-
ized addressing mode (e.g. circular/bit-reversed) through the special registers. 
The organization of special registers provide the high degree of flexibility. It is 
applicable to base instructions as well as parallel instructions. The details of 
special registers are shown in Appendix B. 
3.4.4 Address Generation 
The function of the address generation unit is to provide the addresses of the 
operands required to carry out the DSP operations. Since many instructions, 
34 
Chapter 3. Design of ASIP Platform 
such as the multiply instruction, require more than one operand for their ex-
ecution. The address generation unit should work fast enough to provide the 
addresses within the time constraints imposed by the instruction execution re-
quirements. 
There are lots of address calculation keeping the address generation unit 
busy. The variety of special addressing modes, including indirect addressing, 
circular (modulo) addressing and bit-reversed addressing modes, burden the 
main ALU with heavy workload. In order to efficiently compute those ad-
dresses, a separate arithmetic unit is required to compute addresses in the DSP 
implementation. 
I Offset  
Modulo Buffer ‘ 
Length ，r ‘ 
” I ；, I Add/Sub “ I 





Figure 3.8: The Datapath of Address Generation Unit 
The diagram of address generation unit is shown in Figure 3.8. It typically 
involves the following operations : 
1. Getting the immediate address from a register, memory location or in-
struction operand. 
35 
Chapter 3. Design of ASIP Platform 
2. Adding or subtracting the current address by 1 for program counter to 
fetch the instruction from memory. 
3. Incrementing or decrementing the current address by an offset for jump, 
loop operations and subroutine calls. 
4. Generating new address by applying circular addressing algorithm to han-
dle a continuous stream of incoming data samples. 
5. Generating new address by applying bit-reversed addressing mode to im-
plement some DSP algorithms like the Fast Fourier Transform (FFT). 
3.4.5 Load and Store 
Load and store are the two operations for accessing the data memory, and they 
are the only bridge between the data memory and the register file. A dedicated 
hardware, load store unit, is responsible for the tasks in transferring data from 
data memory to register file or vice versa. The duties of the load store unit are 
providing addresses, generating control signals and dispatching the fetched data. 
To accommodate the X-Y bank organization of the memory, the load store unit 
has independent address datapaths for the two banks. For each bank, there are 
one address datapaths for memory, one for reading from register file and one for 
writing to register file. The address datapaths are the same as those in address 
generation unit and the special registers that attached to the address datapath 
are also organized as in Appendix B. 
36 
Chapter 4 
Implementation of Speech 
Recognition on ASIP 
4.1 Hardware Architecture Exploration 
Application Specific Instruction Set Processor (ASIP) differs from general DSP 
Processor because ASIP targets at a specific class of application. ASIP involves 
thorough analysis of the algorithm in the hope of optimizing both the struc-
ture of datapath and application-oriented instruction set. Different techniques 
will be discussed in the following section for speaker-independent isolated word 
recognition system. 
4.1.1 Floating Point and Fixed Point 
DSP algorithm implementations deal with signals and coefficients. To use a 
fixed-point DSP device efficiently, one must consider to represent feature co-
efficients using fixed-point 2's complement representation. Typically, the coef-
ficients are fractional numbers. Floating-point numbers can provide wide dy-
namic range of numerical representation by normalizing all numbers into the 
same format with sign bit, exponent and mantissa. Though floating-point com-
putation can provide calculation with a good precision, it suffers from speed 
degradation since it requires more execution time and more complex hardware 
to run the routine compared with the fixed-point computation. For example, 
37 
Chapter 4- Implementation of Speech Recognition on ASIP 
floating-point additions require the exponents to be normalized before the ad-
dition of the mantissas. While floating-point multiplications require addition of 
exponents besides the multiplication of mantissas. 
Our ASIP is a 16-bit fixed-point processor, the only format allowed for each 
number is a 16-bit integer, which ranges from 0 to 65535 for unsigned number 
or -32768 to +32767 for signed number. 
4.1.2 Multiplication and Accumulation 
Miiltiply-accumulate (MAC) operations are useful features for matrix opera-
tions, such as convolution for filtering, dot product and even for polynomial 
evaluation. They are used to implement the mathematical function in the form 
of A + BC. It is important for most DSP applications which require the accu-
mulation of the products of a series of successive multiplications. In our design, 
MAC consists of a 16 X 16 multiplier followed by the 40 bit adder/subtracter 
unit and an additional register called accumulator. In general, the product of 
A B 
16 Z Z 16 
\ } ” 
Multiplier 
32 Z 40 J 
\Adder I Subtractoy^ 
Guard ^ r , 
bits 40 Z 
8 Accumulator 32 
Figure 4.1: The Datapath of MAC unit 
38 
Chapter 4- Implementation of Speech Recognition on ASIP 
a 16 X 16 multiplication is 32 bits. The extra most significant 8 bits in the 
adder/subtracter unit and accumulator are guard bits. When repetitive MAC 
operations are performed, the accumulated sum grows with each MAC operation 
if the inputs of the multiplier are not normalized. This increases the number 
of bits required to represent the result without loss of accuracy. One way of 
handling this growth is to provide extra bits in the accumulator. These extra 
bits, the so-called guard bits, allow for the growth of the accumulated sum as 
more and more product terms are added up. The datapath of MAC unit is 
shown in Figure 4.1. 
The most critical part of MAC unit is the multiplier. Efficient algorithm 
of multiplication [17] can enhance the computational speed remarkably. The 
multiplier is divided into three parts, namely the Booth encoding, Wallace tree 
and the addition. That means all partial products are first generated using 
Booth encoding, followed by blocks of 4:2 Wallace tree that compress the 4 
inputs into 2 outputs (sum and carry) and the final 32-bit addition. 
Booth Encoding 
Booth Encoding requires both multiplicand (A) and multiplier (B) to be its 
inputs. The generation of partial product (pp) depends on the encoding of 
multiplier as shown belows : 
in7 in5 in3 inl 
I /~^^ / ~ ^ ^ r - ^ r - ^ I 
B:bi5 bisibis bi4 bi3 bi2 bii bio b9 bs b? be bs b4 ba bz bi bo i 0 
W f ^ W K 
inS I in6 m4 iQ2 inOi 
Figure 4.2: The Encoding of Booth Using Multiplier 
1. An zero is inserted in the position of the least significant bit of multiplier 
B (i.e. bit 0). 
2. Two sign extended bits are inserted in the position of the most significant 
bit of multiplier B (i.e. bit 15). 
39 
Chapter 4- Implementation of Speech Recognition on ASIP 
3. Consecutive three bits form one "word" and act as the input of Booth 
encoder. 
4. Partial product (pp) is the output of Booth encoder which depends on the 
multiplicand A. The Booth encoding table is shown as follows : 
MSB Multiplier B LSB  
I I I I [TTI • in2 irii ino output (pp) 
11 T 0 0 0 0 
-2A 2A -A A 0 0 0 1 A 
� -| 0 1 0 A 
I \ + + + + + /in2 I O i l 2A 
I \ M U X [ M I  
I \ / fflO I I 1 0 0 -2A 
I I 1 0 1 -A 
Partial Product _ } J °  
I l l 0 
Figure 4.3: The Booth Encoding Logic Table 4.1: The Booth Encoding Table 
Wallace Tree Compressor 
In this multiplier, both the Booth algorithm and Wallace tree array block [18 
are used to speed up the multiplication process by enhancing parallelism. The 
Cln F _ I J ^ 
Coia 
c 
Figure 4.4: The Diagram of 4:2 Compressor 
40 
Chapter 4- Implementation of Speech Recognition on ASIP 
Wallace array block is made of numbers of 4:2 compressor, which means there 
are four inputs and two outputs in each compressor as shown in Figure 4.4. 
The inputs are partial products (XI, X2, X3, X4) and outputs are sum (S) and 
carry (C). The addition of the partial products uses the 4:2 compressor to sum 
up the partial products concurrently. In other words, it compresses four partial 
products into two new partial products (S’C) simultaneously. 
The final 32-bit Addition 
The final 32-bit adder is carry select adder (CSA) and it is constructed from a 
eight 4-bit carry lookahead adder (CLA) to propagate the carry at high speed. 
It pre-computes two possible outputs (sum) with carry=0 or carry=1 and select 
the appropriate output according to the carry. The structure of CSA is shown 
in Figure 4.5. 
A 32:29 B 32:2B A/�* 87:4 A3:o 
「 = 员 。 「 = 莎 
s 32:29 7^:4 3^:0 
Figure 4.5: The CSA with 4-bit CLA 
Combining the preceding 3 parts, including Booth encoder, Wallace array 
block and the final 32-bit addition, the fast multiplication process can be im-
plemented and illustrated as shown in Figure 4.6. 
4.1.3 Pipelining 
Pipelining is a technique that exploits the parallelism among the instructions 
in sequential instructions. The platform is a typical pipelined processor with 
41 
Chapter 4- Implementation of Speech Recognition on ASIP 
Multiplier B Multiplicand A 
(16 bits) (16 bits) 
广  
Booth Ecoder 
Booth J X 1 r 
Block * 
Partial Product Generator — 
PPO, r p p l l p p s i p p s , f 卯4， r P p s i p p B ^ p ? , f 
r ~ h ^ ~ ~ W ~ 
Compressor Compressor 
Wallace 「乂 H H � 
S i \ i i / 412 � Compressor 
, , f I P 7 
二t { 32-bit Adder (CLA & CSA) 
Product (32 bits) 
Figure 4.6: The Whole Process of Fast Multiplication 
five stages. It means that an instruction is divided into five stages. There are 
at most five instructions that will be in execution during any single clock cycle. 
The organization of the pipeline is illustrated in Figure 4.7. The first stage is 
instruction fetch (IF). In this stage, the instruction fetch unit provides instruc-
tion address to the instruction memory and fetches the corresponding instruc-
tion into the processor. Then the processor moves to decode stage (DEC). The 
fetched instruction is decoded into commands and operands. Meanwhile, the 
address generation unit calculates for operand address calculation. The third 
stage is read stage (RD). The major task is to read the operands from the regis-
ter file. Similarly, the load store unit also accesses the register file for preparing 
store operation. Some instruction decoding works that related to datapath is 
completed in this stage. The forth stage is execution stage (EX). All the data 
processing, Boolean manipulation tasks are performed there. The load store 
unit accesses the data memory in this stage. The last stage is writeback (WB). 
The processed data is written back to the register file. On the other hand, the 
42 
Chapter 4- Implementation of Speech Recognition on ASIP 
： 綱 H i 喷 
Figure 4.7: The Pipeline Organization of the Platform 
load store unit puts the loaded data into the register file. 
4.1.4 Memory Architecture 
The simplest processor memory structure is a single bank of memory that con-
tains single set of address and data lines. Both program instructions and data 
are stored in the single memory. It is called the Von Neumann architecture, 
which is common in most non-DSP processors. 
Such memory architecture, however, is not sufficient to handle large amount 
of data with considerable access speed in DSP applications. Thus, our design 
adopts Harvard architecture, which holds program instructions and data sepa-
rately in three distinct memory locations labeled as instruction ROM, X and Y 
data memory. The instructions are stored in the ROM, while feature vectors of 
the real-time incoming speech and the speech reference parameters are stored 
in X and Y data memory respectively. 
43 
Chapter 4- Implementation of Speech Recognition on ASIP 
4.1.5 Saturation Logic 
Many DSP applications involve summing up a series of values using MAC in-
struction. When the number is accumulated, the magnitude of the number will 
grow. Eventually, the magnitude of the sum may exceed the maximum value 
that can be represented by the accumulator or register. Overflow is occurred 
under that situation. On the other hand, if the magnitude of the number is too 
small to represent it by the minimum value, underflow is said to occur. 
There are two solutions to overflow and underflow problems. Firstly, satu-
ration arithmetic can be used to represent numbers that are out of range. The 
overflow value will be replaced by the largest value that can be represented by 
the processor while the underflow value will be replaced by the smallest value. 
Secondly, modular arithmetic can be seen as an alternate solution. When the 
values of data lie out of the range of the largest and smallest represent able num-
bers, these values are wrapped around into the range using modular arithmetic 
relative to the smallest representable number. Modular arithmetic is sometimes 
referred to 'clock arithmetic' for integers, where numbers 'wrap around' after 
they reach a certain value (the modulus). For example, while 8 + 6 equals to 14 
in conventional arithmetic, the actual answer is 2 if it is implemented in modulo 
12 arithmetic. It is because 2 is the remainder after dividing 14 by the modu-
lus 12. Thus, it is better to use saturation arithmetic instead of the modular 
arithmetic to avoid an error known as the wrap around error. The datapath 
of the saturation logic is shown in Figure 4.8. It is particularly useful in the 
parallel instruction to enhance the accuarcy by eliminating the arithmetic right 
shift operation that scales down the values significantly. 
4.1.6 Specialized Addressing Modes 
Different addressing modes can be employed to speed up the DSP real-time 
implementation, namely circular addressing mode and bit-reversed addressing 
mode. They are used for various DSP algorithms like filtering or Fast Fourier 
Transform (FFT), which require the large amount of data to be handled as well 
44 
Chapter 4- Implementation of Speech Recognition on ASIP 
Y 
——n I 32767 | -1 “ 
r T ^ ,, 丁 r ^ M a x > Y 
M I \ i o v V 
Saturated Y 
Figure 4.8: The Datapath of Saturation Logic 
as a fixed address index pattern can be observed. The following sections will 
have a detailed discussion on these specialized addressing modes. 
Circular Addressing Mode 
The circular addressing mode is usually used to handle a continuous stream of 
incoming data samples (coefficients). There is a buffer that act as a storage 
place for those samples. Usually the length of buffer depends on the number 
of samples. The more the samples, the larger the buffer. Sometimes, it wastes 
lots of memory saving numerous coefficients without further re-using it again 
when those stored data are no longer needed afterwards. To reserve the tight 
memory for other purposes, it is better to keep the data in a circular buffer 
instead of a general buffer. In a circular buffer, successive data samples are 
stored in sequential buffer locations until the end of the buffer is reached. Then 
the next incoming data will be saved at the beginning of the buffer once the 
previous data is stored in the end of the buffer. The actual operation of the 
circular addressing mode can be represented in a mathematical form 
next address = {current address 土 step)%size 
where step is the movement range of the pointer (PTR) which holds current 
address and size is the buffer length. The PTR can be incremented or decre-
45 
Chapter 4- Implementation of Speech Recognition on ASIP 
mented. There are two additional registers to mark the position of the start 
address and end address of the circular buffer. They are called start address 
register (SAR) and end address register (EAR). There are total four cases for 
calculating the updated PTR of circular buffer : 
Condition 1 Condition 2 Updated PTR 
SAR < EAR (current address + step) > size (current address + step) - size 
SAR < EAR (current address + step) < size (current address + step) + size 
SAR > EAR (current address - step) > size (current address - step) - size 
SAR > EAR (current address - step) < size (current address - step) + size 
Table 4.2: The Pointer Update Algorithm of Circular Addressing Mode 
Bit-reversed Addressing Mode 
Special data access capability is important in the Fast Fourier Transform (FFT) 
algorithm implementation. The FFT is a fast algorithm that transforms a time-
domain signal into its frequency-domain representation. However, there is a 
drawback of FFT operation. It takes the input in a natural order, but results 
in an irregular outputs shown in Figure 4.9. Note that the bit-reversed pattern 
is just a mirror reflection of the original input pattern. 
xo I Xo Input (Natural Output (Bit-
Order) reversed) 
XI ~ • — ^ 0 0 0 0 0 0 
X2 ~ • • X2 0 0 1 1 0 0 
X3 ~ • ~ • X6 0 1 0 0 1 0 
X 4 — 0 1 1 1 1 0 
Y 1 0 0 0 0 1 
X 5 ~ ^ 1 0 1 1 0 1 
X6 ~ • ~ • X3 1 1 0 0 1 1 
X7 • ~ • X7 1 1 1 1 1 1 
Figure 4.9: The Output of FFT Algorithm 
46 
Chapter 4- Implementation of Speech Recognition on ASIP 
Fortunately, there is still a traceable rule to generate the bit-reversed pat-
tern. From Figure 4.9，the length of FFT is 8. The actual operation of the 
bit-reversed addressing mode can be represented in a mathematical form 
next address = {current address + ^ {FFT Length)) 
ZD 
In the example, the start address is 0 and half of FFT length is 8/2=4，thus 
next address is equal to current address 000 + half FFT length 100 = 100 
with no carry generated. The next address becomes current address. However, 
current address 100 + half FFT length 100 does produce a carry. Please notice 
that the carry is propagated in the reverse direction (i.e. from right to left) as 
shown belows. 
Carry propagate in  
reverse direction 
Current address 0 0 0 � • Current address 1 0 0 
Half FFT length + 1 0 0 Half FFT length + 1 0 0 
Next address 1 0 0 Next address 0 1 0 
Carry 
4.1.7 Repetitive Operation 
Loops are complicated tasks for instruction fetching. A dedicated controller 
is used to maintain the current status of a loop and to handle the address 
calculation. The structure of the controller is shown in Figure 4.10. The internal 
control logic interacts with the request from instruction decoder, controls the 
operation of the stack and the loop counter. The status of the loop is temporarily 
stored in the stack. The content of the stack is shown in Figure 4.11. The first 
one bit field indicates if the loop operation is currently running or not. The 
second field indicates the number of lines of instructions covered by a loop. 
The third field indicates the start position of a loop. The forth field states the 
number of iteration left. When a static loop is set up, the current status and the 
setup data of the loop (start address and size) are pushed into the stack and the 
loop tag is set to one. The number of iterations is stored in the loop counter. 
47 
Chapter 4- Implementation of Speech Recognition on ASIP 
End address 
PC t = d 
Loop control, k Control Logic 
from decoder ‘ ^^  status  
A A “ I I 
Loop Start 
control size , � address 
control 
Loop Counter 一 ‘ _ _ stack — ： 
Q Selector ) ‘ 
Loop setup 
data ‘ ‘ 
Start address 
Figure 4.10: The Structure of Loop Controller 
loop loop size start address iteration 
Figure 4.11: The Content of Stack 
Based on the setup data, the control logic can figure out the end position of 
the loop and the current relative position of the program in the loop. At the 
end of each iteration, the loop counter is decreased by one, and the program 
counter is updated with the start address of the loop. Once the loop counter 
reaches zero, the stored loop status is popped out and the previous status can 
be maintained. 
Sometimes, the application maybe so complicated that multiple nested levels 
of static loops are required. The total number of levels depends on the number 
of entries of the stack. Hence the size of the stack should be considered in 
application analysis in order to match the behaviour of the target application. 
48 
Chapter 4- Implementation of Speech Recognition on ASIP 
4.2 Software Algorithm Implementation 
4.2.1 Implementation Using Base Instruction Set 
The front-end that extracts useful feature vectors from incoming speech is pre-
processed and saved in the X memory. All of the word models are trained 
offline by efficient HTK toolkit and pre-stored in the Y memory for back-end 
Viterbi search operation. Both feature vectors and model parameters are 16-bit 
fixed-point arithmetic. 
The front-end processing is straightforward and less computationally inten-
sive than the back-end Viterbi Search. The procedure for feature extraction 
is standardized and the processing time can be negligible compared with the 
time-consuming Viterbi Search. Thus, we mainly focus on the back-end Viterbi 
search in our project. 
Recall from Section 2.3.3 and 2.4, the simplest implementation is single-
mixture Viterbi search. The recognition result is found by calculating the output 
probability density function recursively 
bj {ot) = N{ouilj,Uj) , 1 < j < iV, 2<t<T 
= 1 —p 言'-A 广“广（言'-A) (4.1) 
t^ U) = ^t-i �( k j . bj (ot) (4.2) 
= • m a x & 一 1 � . ) ^ e - H - t - l ^ j f 昨 “ j ) 
J ^ ( 2 n f d e t ( U j ) 
where ot is the observation vector with the dimensionality of D at time t) and 
N is a Gaussian pdf with mean vector p,j and the determinant of the covariance 
matrix Uj in state j . 
The probability value is small and further multiplications lead to an interme-
diate result which is so small that the computer cannot represent it accurately. 
Underflow is expected under this situation. To solve the problem, the logarithm 
49 
Chapter 4- Implementation of Speech Recognition on ASIP 
of probability is taken (i.e. natural log). The Equation 4.2 can be re-written as 
式 0") = (j)] = ln[max (i). an, St-i (i) . . bj (ot) 
= m a x St-i (i) + da, St-i (i) + dij + bj (ot) 
= m a x St-i (i) + da,呂t-i � + S^j + In iV [ot, ]!” Uj) 
卜 ~ 1 1 = m a x 5t-i {i) + an, 6t-i (i) + dij + In 
+ (4.3) 
\ ^ J 
From Equation 4.3, the first term max 5t-i (i) + an, St-i (i) + a^ - is the se-
lection of higher probability from two possible search paths (stay in the same 
state or advance to next state).呂t-i � is calculated previously because the 
search is a recursive process. Both da and a^ are model parameters that can 
be pre-stored in the memory. The second term is simply a constant since vec-
tor size D and covariance Uj are known before the implementation. As there 
is growing concern in the computation efficiency of different arithmetic oper-
ations, it is realized that addition and multiplication compute faster than di-
vision in most hardware platform. To avoid division, we need to pre-compute 
the inverted data. We can pre-calculate and pre-store the constant *兀丨“丨 
instead of \ Uj\ for the Gaussian probability computation. The third 
term 去{ot - {ot 一 Jlj)) are composed of 1 X 26 Ot , 1 X 26 巧 and 
1 X 26 as there are 26 elements (D=26) for each parameter that forms a 
vector. Thus, the third term of Equation 4.3 can be analyzed in the following 
way : 
- fljf U-' {ot - Jlj) 
On — Jiji 
Ot2 - � 1 1 I 
='2 - Uj, -巧 1 Ot2 — -
Ot26 ~ "j-26 
50 
Chapter 4- Implementation of Speech Recognition on ASIP 
The complicated matrix form of the third term can actually be viewed from 
a simple perspective using subtraction, multiplication and addition. Since the 
equation displays the accumulated summation property, MAC instruction can 
be used to speed up the operations. Remember that all the arithmetic oper-
ations supply the input through the registers. The efficiency of operation can 
be further enhanced if specific registers are assigned for storing the input val-
ues. The address of registers can be generated (incremented by 1) automatically 
through the address generation unit, assuming that the address of registers align 
in a sequential order. It bypasses the datapath of the ALU unit because there 
is no need to have any addition to find out the next updated address. To reduce 
the number of used registers, circular addressing mode can be utilized to over-
ride the previous loaded input data by the next incoming ones. It is feasible to 
do so as the vector size is fixed (26 parameters) for each speech frame. 
To model the wide variability of speech appropriately, we can consider im-
plementing the speech recognition process with double mixture Viterbi search. 
Recall from Section 2.3.3, the output probability density function with double 
mixture is defined as 
� =f ： ， 昨 t , J l ] k , U j k �= 妄 ， I - 节 推 
— (4.4) 
where Cjk is the mixture coefficient for the k t h mixture in state j , M is the 
number of mixtures per state and iV is a multivariate Gaussian with mean 
vector and the determinant of the covariance matrix Ujk for the k t h mixture 
component in state j . To have a simple expression, rewrite equation (4.1) as 
N = BeJ (4.5) 
Thus equation (4.4) can be expressed as 
M M 
bj (ot) = [ CkBke"' = ’ = c i ^ 
k=l k=l 
/ M \ 
ln[6, (o , ) ]=ln (4.6) 
\k=i J 
51 
Chapter 4- Implementation of Speech Recognition on ASIP 
where Gk is constant for the kth mixture of each state and can be pre-calculated 
and pre-stored in ROM. A simple and accurate method can be developed 
for the log-add operation (In XI ^ ^p) in equation (4.6). As we use a double-
mixture Gaussian model (M=2) throughout our implementation, select the 
larger Gk max^ '^ '' """"in equation (4.6) and divide each operand of add operation 
by it 
Hbj PO] = In Gk max + Jk ma. + In { fl + J] \ 
I \ k=l,k神 max •工 / J 
Since min QJk min-Jk max [g always kept less than one, 
Gfc max 
( l + y ^ , . Gj^^e-Jk min-Jk ma^ ) ig less than M. This makes 
V ‘ max Gk max / 
(1 + y^Jli . f^c min e^k min-Jk max ] lie in a finite region which is al-
乂 ^―'«=!,«；Ttfc max Gfc max J 
ways greater than one but less than two. It is recommended to ignore this 
complicated term for easier implementation since its value is small and the 
removal of this term will not degrade the recognition accuracy significantly. 
The completed equation that implements the Viterbi search [19] is simplified as 
• "I ~ 
St ( j ) = \n[6t (j)l = max St-i (i) + da, St-i (i) + + bj {ot) 
- -
= m a x 各_i (i) + da, St-i (i) + + In Gk max + Jk max 
-
厂 ~ ~ ~ • ~ 1 1 =max 6t-i (i) + flu, ^ t-i � + 知 + In Ck max ,—— 
L L ([/,). 
26 1 
+ D o r 汚 於 ( 4 - 7 ) 
i=\ 
To have a systematic way of viewing the implementation of speech recogni-
tion, the whole program flow can be illustrated as follows: 
1. Initialize registers for storing the address of look-up table of various model 
parameters, including mean, variance, transition probability, etc. 
2. Compute Jkmax i.e., T ^ Z M i " Subtraction is performed 
first, followed by the multiplication and arithmetic right shift. The right 
shift can prevent the overflow of the intermediate values. MAC instruction 
is finally executed for the accumulated sum and multiplication. 
52 
Chapter 4- Implementation of Speech Recognition on ASIP 
3. Compute InfCfc maa：] i.e., In [ck max] + In , � � �. O n l y one addi-
L V det([;j.)」 
tion is required since both Ck max and Uj are constants, which means the 
logarithm of those two terms can be pre-calculated and pre-stored in the 
memory. 
4. Calculate max{&_i {i) + an, 5t-i (i) + aij} . The term 瓦—i(i) can be ob-
tained from previous step while da and a^ j are constants. There are totally 
two additions plus one selection of the path with higher output probability. 
5. Repeat procedures 2 � 4 until all speech frames are processed and all word 
models are compared with the feature vectors of the incoming speech. 
LOOP instruction can be used for repetitive tasks. 
To enhance the recognition speed, multiple-stage pipeline discussed in Section 
4.1.3 can be applied to speech recognizer. Given that the speech parameters 
shown in Table 4.3，the design takes 6 cycles to calculate one output probability 
Speech Parameters Value 
Number of Words 11 
Number of Frames 190 
Number of States 8 
Number of Mixtures 2 
Feature Vector Size 26 
Table 4.3: The Speech Parameters for Recognition 
S, there are 4 cycles for two consecutive multiplications and other operations 
like reading constants from memories, subtraction, shift, addition, comparison 
and selection are all executed within that 6 cycles. Thus it requires 6 x 11 x 
190 x 8 x 2 x 2 6 ? ^ 5 x 10® clock cycles to recognize one word. It implies that 
it is possible to recognize one word in 1 second if the operating frequency is 5 
MHz. 
53 
Chapter 4- Implementation of Speech Recognition on ASIP 
4.2.2 Implementation Using Refined Instruction Set 
The refinement of software implementation is based on the instruction-level 
design enhancement. There are two methods of optimization derived from the 
pre-defined base instruction set, mainly distillation and condensation. 
Distillation corresponds to the process of eliminating the infrequently used 
pre-defined instructions. This can save the special hardware resources imple-
menting that pre-defined instruction by replacing it with a sequence of instruc-
tions. On the contrary, condensation is the process of replacing a frequently 
used instruction sequence by a newly defined application specific instruction. 
Extra hardware deployment is expected to have a perfect match between the 
architecture and the refined instruction set. Figure 4.12 shows an example of 
Instruction Condensation 
LDACC X08, FF ； X08 = FF 
MPY X01,Y02 ; X01 = X01 • Y02 
Condensat ion^^. .^—- -^MAX X03,Y04 ； X03 « MAX{X03,Y04) 
ABS X08 ； X08 = 01 
MAC X05.Y06 ； X05 = Acc + X05 ‘ Y06 
LDACC X08. F F / 1 x 0 8 » FF 
MPY X01 ,Y02 / ； X01 » X01 * Y02 p  
{ LDACC X08. FF ； X08 = FF 
| A D D _ W ； X03-X03 + Y03 MPY X01.Y02 ； X01=X01*Y02 
AOO X04.Y(» I X04 « X04 + Y04 
eMP X 0 i X04 ： If (X03 < X04) P 0 1 8 ADD X03, Y03 ； X03 = X03 + Y03 
BGi i fa f l . i a ADD X04. Y04 ; X04 = X04 + Y04 
I • ,,,, ,, CMP X03. X04 
ABSX08 ； X08 = 01 ^ ^ ^ ^ t l o n BGTflag, 18 : i f(X03< Y04) PC=18 
MAC X05.Y06 ; X05 ^ Acc X05 * Y06 - J O l f l P ^ . d ~ ； if {X08>0) PC=19 
Original Instruction Sequence 
MAC X05.Y06 ； XQ5 « Acc + X05 * YOB 
Instruction Distillation 
Figure 4.12: Examples of Instruction Condensation and Distillation 
instruction distillation and instruction condensation [9]. In our project, pre-
defined instructions such as ABS, N O R M and EXP are seldom used. Those 
instructions are transformed into a new instruction sequence by distillation, i.e., 
substitution of the instruction ABS (absolute value) by a three-instruction se-
quence. It takes more clock cycles to run while lowering the hardware costs. 
Conversely, the repetitive and frequent operation occurs in the implementation 
54 
Chapter 4- Implementation of Speech Recognition on ASIP 
of add/compare selection of the Viterbi search process. Only the path with the 
highest output probability will be considered. Additional hardware functional 
unit called Add, Compare, Select Unit (ACSU) [20] is particularly designed for 
the sake of optimized implementation of instruction M A X . The structure of 
ACSU is shown in Figure 4.13. It can be seen that an instruction sequence 
5t-i(i) aii 5i-i(i-l) ay 
I ” u u f I 
I Adder Adder | 
I ^ ^ _ _ I 
I Comparator | 
I I 
I decision A ^ ^ / I 
I " M U X / I 
u 
Max 
Figure 4.13: Examples of Instruction Condensation and Distillation 
can be replaced by a new instruction M A X through condensation. Without 
M A X instruction, the inefficient branch action is required to access different 
code segments after the comparison of two register values. The register that 
holds the larger value is selected and the corresponding code segments will be 
executed. The use of M A X instruction speeds up the Viterbi search proce-
dure. It takes fewer clock cycles by eliminating the overhead of branch process. 
It is possible to recognize one word in 0.5 second if the operating frequency is 5 
MHz. Compared with the base instruction set, the recognition time is reduced 





A speaker-independent speech recognizer with double-mixture HMM is devised. 
Its major focus is mainly on the Cantonese isolated words. The design specifica-
tion is written first, followed by the modeling of Verilog HDL using behavioral 
description. The functional verification of the design is performed under simula-
tion environment SimVision. If there is no error found, the design is synthesized 
in the Design Compiler of Synopsys. The physical design like floor planning, 
automatic placement and routing is done in Silicon Ensemble with AMS 0.35-
micron 4 metal 2 poly CMOS technology. The specification of fabricated chip 
is depicted in Table 5.1 while the chip microphotograph is shown in Appendix 
C. 
Specification Value 
CMOS Technology 0.35 fi m 
Area (NAND2 equ.) 132K 
Area m x /i m) 2600 x 2600 
Number of Pads 120 
Operating Voltage 3.3V 
Table 5.1: The Specification of Fabricated Chip 
The whole speech recognizer is implemented by writing programs based on 
the base instruction set, which is defined by the platform. Figure 5.1 outlines 
the simplified program skeleton. The main theme of Viterbi algorithm is to 
56 
Chapter 5. Simulation Results 
find the maximum likelihood of the word model that matches the incoming 
speech. The details of program flow are also discussed in Section 4.2.1. By first 
Max = 0 
for V = 1 to V " n o of words 
for f = 1 to T //no of frames 
for / = 1 to S " n o of states 
P(t) = 0 
for y = 1 to S 
P(t) = max (P(t-1)) + log aij 




if Max < P(v), then Max = P(v) 
end for 
Figure 5.1: The Program Skeleton of Viterbi Search 
implementing the algorithm in high level language (HLL), it is easy to convert 
it to the ASIP platform using completed base instruction set in Appendix A. 
To verify the ASIP chip, a PCB board is made in Appendix D. Figure 5.2 
is a brief diagram of the PCB. The program is preloaded in the ROM. The X 
( V D D ) 
V / X RAM Y RAM 
data address address data 
；§ ^ , clK ^ 
^ ~ preload A S I P result R 
O. 1 ___J 
Q ！5! i 
g H - . r > ] = 问 
Program Program 
R0M{1) ROM (2) 
Figure 5.2: The Simplified Diagram of PCB 
57 
Chapter 5. Simulation Results 
RAM and Y RAM store the extracted feature vectors and trained parameters of 
word models respectively. When the chip is power-on, the start signal indicates 
the beginning of the recognition process. The code segments describing the 
Viterbi search algorithm will be executed sequentially. The LED displays the 
recognition result. 
Simulation results using software (HTK) [21] and our hardware platform are 
shown in Table 5.2 respectively. The accuracy of hardware simulation is only 
0.7 % lower than software one. 
Software Hardware 
Acciiracy(%) 93.9 93.2 
Table 5.2: The Simulation Results of Recognition Accuracy 
In our speech recognition system, the sampling rate of the speech is 8 kHz, 
the frame period is 20 ms and frame rate is 10 ms. The vocabulary in our 
experiment contains 11 Cantonese words. Each word has two syllables. Training 
data include 2200 utterances from 5 male and 5 female native speakers. Each of 
the word is modeled by an HMM, which are trained offline by the HTK toolkit. 
734 utterances from another 10 male and 10 female native speakers are used 
for performance evaluation. Both training and testing utterances are recorded 
via microphone channel under the similar acoustic environment with signal-
to-noise ratio (SNR) of 10 dB. The time required to complete one recognition 
is about 1 second at the working frequency of 5 MHz. The recognition time 
can be reduced to 0.5 second if refined instruction is used. Profiling of time 
information can be achieved based on the calculation in Section 4.2.1. According 
to the simulation results from Static Timing Analysis (STA) using PrimeTime, 
the maximum operation frequency can reach 86 MHz. This allows the chip 
to recognize one word in only 47 ms. The recognition time is fast enough in 
real-time implementation. 
Compared with the past research work [22], it takes 60 seconds for their chip 
to recognize the word with working frequency of 17 MHz. Their vocabulary is 
58 
Chapter 5. Simulation Results 
moderately-sized with around 60 monosyllable words and the technology for 
fabricated chip is CMOS. It is possible for our design to extend the vocabulary-
size by occupying more memory. The recognition time is approximately 6 sec-
onds for 60 words. This is at least one magnitude faster than the previous work. 
Further refinement reduces this speed by a half. Moreover, that previous work 
operates its chip at a higher working frequency. It implies that more power is 
expected to be consumed. Our proposed platform shows great improvement in 
terms of execution time and power consumption. The achievement is particu-
larly important in real-time applications. 
59 
Chapter 6 
Conclusions and Future Work 
Conclusions 
This work aims at converting the conventional speech recognition algorithm 
from a research engine to a practical real-time system on ASIP platform. It 
targets at the isolated word recognition on an arbitrary moderately-sized vo-
cabulary. The proposed ASIP design methodology is proven to be effective in 
the practical implementation. The time required to complete one word recogni-
tion is about 1 second at the working frequency of 5 MHz. The recognition time 
can be reduced to half of the original one if refined instruction is used. As our 
platform can reach the maximum operation frequency 86 MHz, this allows the 
chip to recognize one word in only 47 ms. The recognition time is fast enough 
in real-time speech applications. On the other hand, the proposed speech recog-
nition running on ASIP platform attains approximately the same recognition 
accuracy as the software recognition. It is obvious that our design can meet 
the stringent requirement of both time-critical and highly accurate speech ap-
plication. Most importantly, our research demonstrates the beauty of ASIP 
methodology by having software/hardware co-design. The speech algorithm is 
thoroughly analyzed in order to convert the complicated algorithm into simple 
mathematical form. Special hardware Add-Compare-Select Unit(ACSU) is de-
ployed for the frequently repetitive Viterbi search. Finally, application-specific 
instruction set is exploited to bridge the gap between the software and hardware 
60 
Chapter 6. Conclusions and Future Work 
parts. It is believed that the ASIP approach requires multi-disciplined exper-
tise with background in digital signal processing, software engineering practice, 
logic and arithmetic design as well as computer architectures. 
Future Work 
The ASIP design provides a framework of implementing DSP algorithm effi-
ciently. The research mainly focuses on the double mixture HMM-based speech 
recognizer. The recognition accuracy can be enhanced further if higher-order 
mixture is considered. However, the computation will be increased accordingly. 
This arouses the issue of scaling up the base platform to meet more stringent 
requirement. Scaling up the design is not a difficult task because the original 
platform can be extended to a more complicated one. More powerful instruction 
sets can be developed in parallel and complex forms. By selecting the useful 
instructions from the base one, the mixed utilization of originally selected base, 
newly defined parallel and complex instructions can show tremendous improve-
ment in the demanding application. 
In addition to higher-order mixture, there are many other speech algo-
rithms that can be executed, including speaker identification, speech verifi-
cation, speech synthesis for text-to-speech, etc. Since our ASIP platform has 
different instruction sets, it is feasible for developer to shift from one application 
to another with programmable instruction set. They can also define another 
set of instructions that are specific to their application. In this way, the ASIP 
design is flexible and powerful enough to have multiple applications. 
61 
Appendix A 
Base Instruction Set 
Mnemonic Input —> Output Description 
M A C (Reg, Reg, Acc) —> Acc Multiply two values of registers and accumulate 
M P Y (Reg, Reg) 一 Acc Multiply two values of registers 
A D D (Reg, Reg) 一 Acc Add two values of registers together 
SUB (Reg, Reg) -> Acc Subtract one value of registers from another one 
A D D C (Reg, Reg, Flag) Acc Add two values of registers together with carry 
SUBB (Reg, Reg, Flag) -> Acc Subtract one value of registers from 
another one with borrow 
A D D A (Reg, Offset, Acc) -> Acc Add an offset-able value of register to accumulator. 
SUBA (Reg, Offset, Acc) — Acc Subtract an offset-able value of register 
from accumulator. 
NEG Acc —» Acc Invert the sign of the accumulator. 
ABS Acc — Acc Take the absolute value of the accumulator. 
E X P Acc —> SReg Determine the exponent of the accumulator. 
N O R M (Acc, SReg) —> Acc Normalize the accumulator to the exponent 
stored in the special register. 
SH (Acc, Reg) — Acc Shift the accumulator with the signed value in 
a register ( + left, - right). 
SHK (Acc, Value) —> Acc Shift the accumulator with the immediate 
signed value ( + left, - right) 
Table A.l: The Data Processing Instructions 
62 
Mnemonic Input —>• Output Description 
N O T Acc -> Acc Bitwise NOT of the accumulator 
OR (Reg, Offset, Acc) —> Acc Bitwise OR of the accumulator with an 
offset-able value from register. 
AND (Reg, Offset, Acc) — Acc Bitwise AND of the accumulator with an 
offset-able value from register. 
X O R (Reg, Offset, Acc) —> Acc Bitwise X O R of the accumulator with an 
offset-able value from register. 
Table A.2: The Bit Manipulation Instructions 
Mnemonic Input —> Output Description 
BEQ Flag —> PC Branch if equal to flag is asserted. 
BNE Flag —> PC Branch if not equal to flag is asserted. 
BLT Flag —> PC Branch if less than flag is asserted. 
B G T Flag —> PC Branch if greater than flag is asserted. 
BLE Flag — PC Branch if less or equal to flag is asserted. 
BGE Flag —> PC Branch if greater or equal to flag is asserted. 
SR Value -> (PC, stack) Subroutine call 
JP Value PC Unconditional jump 
LOOP (size, cycle) — (PC, stack) Static looping 
RET stack —> PC Return from subroutine call or break a static loop. 
NOP NA No operation 
Table A.3: The Flow Control Instructions 
Mnemonic Input — Output Description 
CMP (Reg, Reg) —> Flag Compare two values from registers and assert the 
condition flag. 
CMPACC (Reg, Offset, Acc) — Flag Compare the accumulator with a value from 
register and assert the condition flag. 
Table A.4: The Boolean Operation Instructions 
63 
Mnemonic Input —» Output Description 
C 0 N F 4 (Value, Pos) —> SReg Write a nibble to a special register without altering 
other bits. 
C 0 N F 8 (Value, Pos) SReg Write a byte to a special register without altering 
other bits. 
C 0 N F 1 6 Value —> SReg Write a word to a special register. 
Table A.5: The Configuration Instructions 
Mnemonic Input —> Output Description 
M O V Reg —> Reg Move a register content to another register 
l o a d Mem — Reg Load a value from data memory to register. 
STORE Reg — Mem Store a register content to data memory. 
LDACC L O P / R O P (bypass DP) — Acc Load an immediate value to accumulator. 
LOP/ROP[15:0] — ACC 
STACC Acc —> Reg Store the accumulator to register. 
Table A.6: The Memory Manipulation Instructions 
Reg - register content 
Acc - accumulator content 
Flag - status and conditional flags 
Offset - shift the value to the left by 16 bits 
Offset-able - a value can be set to be offset 
SReg - special register content 
PC - programme counter 
Stack - programme stack 
Value - immediate value 
Size - the number of instructions within a static loop 
Cycle - the number of iterations of a static loop 
NA - not available 
Pos - a position of a nibble or a byte in a 16 bits value (O:low word, l:high word) 




Table B.l: The Organization of Special Purpose Registers 
15 8 7 0 
0 INT (RESERVED) 
1 ov40 ov32 EQ NE LT GT LE GE 0 0 EXP 
2 LSU XDATA address 
3 LSU XDATA size 
4 LSU XDATA step 
5 LSU YDATA address 
6 LSU YDATA size 
7 LSU YDATA step 
8 0 0 LSU XREG LD address 0 0 LSU XREG LD size 
9 0 0 LSU YREG LD address 0 0 LSU YREG LD size 
10 0 0 LSU XREG LD step 0 0 LSU YREG LD step 
11 0 0 LSU XREG ST address 0 0 LSU XREG ST size 
12 0 0 LSU Y R E G ST address 0 0 LSU YREG ST size 
13 0 0 LSU XREG ST step 0 0 LSU YREG ST step 
14 LSU Configuration 
15 0 X / Y SFU LOP address 0 0 SFU LOP size 
16 0 X / Y SFU ROP address 0 0 SFU ROP size 
17 0 0 SFU LOP step 0 0 SFU ROP step 
18 0 X / Y SFU W B address 0 0 SFU W B size 
19 0 0 0 0 SFU Conf 0 0 SFU W B step 
20 PFUO �P F U 3 Configuration 
65 
21 0 X / Y PFUO LOP address 0 0 PFUO LOP size 
22 0 X / Y PFUO HOP address 0 0 PFUO ROP size 
23 0 0 PFUO LOP step 0 0 PFUO ROP step 
24 0 X / Y PFUO W B address 0 0 PFUO W B size 
25 0 0 0 0 PFUO interval 0 0 PFUO W B step 
26 0 X / Y PFUl LOP address 0 0 PFUl LOP size 
27 0 X / Y PFUl ROP address 0 0 PFUl ROP size 
28 0 0 PFUl LOP step 0 0 PFUl ROP step 
29 0 X / Y PFUl W B address 0 0 PFUl W B size 
30 0 0 0 0 PFUl interval 0 0 PFUl W B size 
31 0 X / Y PFU2 LOP address 0 0 PFU2 LOP size 
32 0 X / Y PFU2 ROP address 0 0 PFU2 ROP size 
33 0 0 PFU2 LOP step 0 0 PFU2 ROP step 
34 0 X / Y PFU2 W B address 0 0 PFU2 W B size 
35 0 0 0 0 PFU2 interval 0 0 PFU2 W B size 
36 0 X / Y PFU3 LOP address 0 0 PFU3 LOP size 
37 0 X / Y PFU3 ROP address 0 0 PFU3 ROP size 
38 0 0 PFU3 LOP step 0 0 PFU3 ROP step 
39 0 X / Y PFU3 W B address 0 0 PFU3 W B size 
40 0 0 0 0 PFU3 interval 0 0 PFU3 W B size 
PFUn一3 �PFUn Configuration 
0 X / Y PFUn LOP address 0 0 PFUn LOP size 
0 X / Y PFUn ROP address 0 0 PFUn ROP size 
0 0 PFUn LOP step 0 0 PFUn ROP step 
0 X / Y PFUn W B address 0 0 PFUn W B size 
0 0 0 0 PFUn interval 0 0 PFUn W B size 
66 
Appendix C 
Chip Microphotograph of ASIP 
m m Q U Q u m n i i m ^ p m 
67 




.to n n ^ l ^ ^ ^ ^ p H S 
p e p h A T 
Bibliography 
[11 J G Cousin, 0 . Sentieys, and D. Chillet, "Multi-algorithm asip synthesis 
and power estimation for dsp applications," in IEEE International Sympo-
szum on Cvrcuits and Systems, pp. 621 - 624’ May 2000. 
2] P. Kievits, E. Lambers, C. Moerman, and R. Woudsma, "R.e.a.l. dsp tech-
nology for telecom baseband processing," in Proceedings of ICS PAT, 1998. 
[31 R. E. Gonzalez, "Xtensa: a configurable and extensible processor," IEEE 
Micro, vol. 20’ issue. 2, pp. 60 - 70, Mar-Apr 2000. 
[4] ‘‘Improv systems inc., jazz psa/jazz dsp." http://www.improvsys.com. 
[5] ‘‘Arc cores ltd., arctangent processor." http://www.arccores.com. 
6] "Target compiler technologies, chess/checkers is a retargetable tool-suite." 
http://www.retarget.com/products-more.html. 
7] A. Hoffmann, T. Kogel, A. Nohl, G. Braun, O. Schliebusch, O. Wahlen, 
A. Wieferink, and H. Meyr, "A novel methodology for the design of 
application-specific instruction-set processors (asips) using a machine de-
scription language," IEEE Transactions on Computer-Aided Design of In-
tegrated Circuits and Systems, vol. 20, issue. 11, pp. 1338 - 1354, Nov 
2001. 
8] "Institute of integrated signal processing systems, aachen university of tech-
nology, germany, lisa processor design platform." http://www.iss.rwth-
aachen.de/lisa/lpdp.html. 
69 
[9] J. H. Yang, B. W. Kim, S. J. Nam, Y. S. Kwon, D. H. Lee, J. Y. Lee, 
C. S. Hwang, Y. H. Lee, S. H. Hwang, I. C. Park, and C. M. Kyung, 
“Metacore: An application-specific programmable dsp development sys-
tem," IEEE Journal of Solid-State Circuits, vol. 8，issue. 2，pp. 173 - 183’ 
Apr. 2000. 
[10] M. Itoh, S. Higaki, J. Sato, A. Shiomi, Y. Takeuchi, A. Kitajima, and 
M. Imai, "Peas-iii： an asip design environment," in Proceedings of Inter-
national Conference on Computer Design, pp. 430 - 436’ Sept. 2000. 
Ill L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Engle-
wood Cliffs, NJ: Prentice Hall, 1993. 
12] Y. L. Kwok, "Design of application-specific instruction set processors with 
asynchronous methodology for embedded digital signal processing applica-
tions," M.Phil, thesis, The Chinese University of Hong Kong, The Depart-
ment of Electronic Engineering, Nov. 2004. 
13] S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, and J. D. 
Owens, "Register organization for media processing," in Proceedings of 
International Symposium on HPCA-6, pp. 375 - 386，Jan. 2000. 
14] J. H. Tseng and K. Asanovic, "Banked multiported register files for 
high-frequency superscalar microprocessors," in Proceedings of 30th ISC A, 
pp. 62 - 71, June 2003. 
15] I. Park, M. D. Powell, and T. N. Vijaykumar, “Reducing register ports for 
higher speed and lower energy," in Proceedings of MICRO-35, Nov. 2002. 
[16] V. Zyuban and P. Kogge, "The energy complexity of register files," in 
Pwccediugs of 1998 International Symposium on Low Power Electronics 
and Design, pp. 305 - 310’ Aug. 1998. 
17] J. Mori, M. Nagamatsu, M. Hirano, S. Tanaka, M. Noda, Y. Toyoshima, 
K. Hashimoto, H. Hayashida, and K. Maeguchi, "A 10 ns 54x54-b paral-
70 
lei structured full array multiplier with 0.5-"m cmos technology," IEEE 
Journal of Solid-State Circuits, vol. 26, issue. 4, pp. 600-606’ Apr. 1991. 
[18] M. Nagamatsu, S. Tanaka, J. Mori, K. Hirano, T. Noguchi, and 
K. Hatanaka, "A 15-ns 32x32-b cmos multiplier with an improved parallel 
structure," IEEE Journal of Solid-State Circuits, vol. 25，issue. 2’ pp. 494 
- 4 9 7 , Apr. 1990. 
[19] w . Han, K. W. Hon, C. F. Chan, T. Lee, C. S. Choy, K. P. Pun, and P. C. 
Ching, "A real-time Chinese speech recognition ic with double mixtures," 
in Proceedings of 5th International Conference, pp. 926 — 929，Oct. 2003. 
[20] G. Fettweis and H. Meyr, "High-speed parallel viterbi decoding: algorithm 
and vlsi-architecture," IEEE Communications Magazine, vol. 29, issue. 5’ 
pp. 46 - 55，May 1991. 
[21] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, 
V. Valtchev, and P. Woodland, The HTK Book (for HTK Version 3.1). 
Cambridge University Engineering Department, 2001. 
22] K. Nakamura, Q. Zhu, S. Maruoka, T. Horiyama, S. Kimura, and 
K. Watanabe, “Speech recognition chip for monosyllables," in Design Au-


































































































































































_ _ _ _ 
0 0 4 2 7 0 4 3 8 
