An HMM-based speech recognition IC. by Han, Wei. & Chinese University of Hong Kong Graduate School. Division of Electronic Engineering.
An HMM-Based 
Speech Recognition IC 
Han Wei 
A Thesis Submitted in Partial Fulfillment 
of the Requirements for the Degree of 




Prof. C. F. Chan 
© The Chinese University of Hong Kong 
June 2003 
The Chinese University of Hong Kong holds the copyright of this 
thesis. Any person(s) intending to use a part or whole of the materials 
in the thesis in a proposed publication must seek copyright release 
from the Dean of the Graduate School. 
統系餘書圖 
錢i ^  f一 — “ \ ‘ 
封 2 9 • )|| 
丨 1 
university""' y M l 
^J^LIBRARY SYSTEM/xv/ 
Abstract 
Automatic speech recognition has received a great deal of attention in the past 
decade and a wide variety of isolated word recognition systems have been used 
in many applications. Speech recognizers based on hidden Markov models 
(HMMs) have less computation compared with other speech recognizers. Thus 
H M M technology is most widely used in speech recognition. 
The models of the speech recognition system can be trained as one mixture or 
multi mixtures. More mixtures in H M M s will result more computation 
requirements and more complicated design, but the recognition accuracy will be 
better. In this thesis a double-mixture hidden Markov model based isolated word 
recognizer IC is presented. 
Using a table look-up approach, the new design is smaller and more accurate 
comparing with existing designs, and added advantage of this design is that the 
architecture can be extended to higher-order mixture H M M based speech 
recognizer with minor modifications. 
The test chip is fabricated with a 0.35|i C M O S technology. The chip can operate 
at 20MHz at 3.3V, and at this frequency the recognition time is 0.5 sec for a 
50-word speech library. Tested with 353 speech data from A U R O R A 2 database, 
the chip's recognition accuracy is 93.8%, which is as accurate as a software 






















I acknowledge gratefully the valuable guidance and encouragement given by m y 
supervisor, Prof. C.R Chan. He has worked with me and provided me 
continuous comments, patience, supervision, and encouragement throughout the 
lengthy and demanding project. I would also like to express my gratitude to Prof. 
Lee Tan for his insightfiil suggestions and assistance during my research work. 
In addition, a special expression of thanks goes to the research assistant，Mr. 
Hon Kwok Wai, for his important assistance in m y research. Without their 
willing support this work would not have been possible. Also I would like to 
thank Prof. C.S. Choy and Prof. K.R Pun for their kind assistance. 
Thanks also to my colleagues, Mr. Cheng Wan Chi, Mr. Chan Wing Kin, Mr. 
Leung Pak Keung, Miss. Yeung Wing Ki and Mr. Yu chun Pong, and the 








List of Figures vi 
List of Tables vii 
Chapter 1 Introduction 1 
1.1. Speech Recognition 1 
1.2. ASIC Design with HDLs 3 
Chapter 2 Theory of HMM-Based Speech Recognition 6 
2.1. Speaker-Dependent and Speaker-Independent 6 
2.2. Frame and Feature Vector 6 
2.3. Hidden Markov Model 7 
2.3.1. Markov Model 8 
2.3.2. Hidden Markov Model 9 
2.3.3. Elements of an H M M 10 
2.3.4. Types of H M M s 11 
2.3.5. Continuous Observation Densities in H M M s 13 
2.3.6. Three Basic Problems for H M M s 15 
2.4. Probability Evaluation 16 
2.4.1. The Viterbi Algorithm 17 
2.4.2. Alternative Viterbi Implementation 19 
Chapter 3 HMM-based Isolated Word Recognizer Design Methodology ……20 
3.1. Speech Recognition Based On Single Mixture 23 
3.2. Speech Recognition Based On Double Mixtures 25 
Chapter 4 VLSI Implementation of the Speech Recognizer 29 
4.1. The System Requirements 29 
4.2. Implementation of a Speech Recognizer with a Single-Mixture 
H M M 30 
4.3. Implementation of a Speech Recognizer with a Double-Mixture 
H M M 39 
4.4. Extend Usage in High Order Mixtures H M M 46 
4.5. Pipelining and the System Timing 50 
Chapter 5 Simulation and IC Testing 53 
5.1. S imulation Result 53 
5.2. Testing 55 
Chapter 6 Discussion and Conclusion 58 
Reference 60 
Appendix I Verilog Code of the Double-Mixture H M M Based Speech 




Register for X 66 
iv 
Subtracter and Comparator 67 
Shifter 68 
Look-Up Table 71 
Register for Constants 79 
Register for Scores 80 
Final Score Register 84 
Controller 86 
Top 97 
Appendix II Chip Microphotograph 103 
Appendix III Pin Assignment of the Speech Recognition IC 104 
Appendix IV The Testing Board of the IC 108 
V 
List of Figures 
Figure 1-1. A Block Diagram of a Pattern Recognition Speech Recognizer.2 
Figure 1-2. Design Flow of Verilog HDL-Based ASICs 4 
Figure 2-1. Examples of One Frmes of Speech Signal 7 
Figure 2-2. A Three-State Markov Model 8 
Figure 2-3. A Three-State Hidden Markov Model 9 
Figure 2-4. (a) A Two-State Markov Model (b) An Equivalent One-State 
Hidden Markov Model 10 
Figure 2-5. A Fully Connected H M M 12 
Figure 2-6. A Left-Right H M M 12 
Figure 2-7. (a) Training Speeches From Boys and Girls (b)(c) 
Recognition Result 14 
Figure 2-8. (a) (b) Searching in the Lattice Structure 18 
Figure 3-1. H M M Based Recognition Process 21 
Figure 3-2. The Lattice Structure of the Speech Recognizer 22 
Figure 4-1. The Structure of a Single-Mixture H M M Based Speech 
Recognizer 31 
Figure 4-2. Part of the Lattice Structure 33 
Figure 4-3. Searching Process in the Lattice Structure 33 
Figure 4-4. A Example of the Modified Booth Multiplication 36 
Figure 4-5. A Block Diagram of The Double-Mixture H M M Based Speech 
Recognizer 41 
Figure 4-6. The Structure of the Recognizer Without a Look-Up Table .•…46 
Figure 4-7. A Block Diagram of Speech Recognizer with a High-Order 
Mixture H M M 49 
Figure 5-1. Design Flow of This Project 53 
Figure 5-2. A Brief Diagram of the PCB 55 
Figure 5-3. Block Diagram of the Real-Time Speech Recognition Testing 
System 57 
vi 
List of Tables 
Table 1. The Truth Table of the Modified Booth Encoder 36 
Table 2. Simulation Results 54 
Table 3. Specifications of the New Speech IC 55 
vii 
Chapter 1 Introduction 
Chapter 1 Introduction 
1.1. Speech Recognition 
Automatic speech recognition by machine has been a goal of research for more 
than four decades. Broadly speaking, there are three approaches to speech 
recognition: the acoustic-phonetic approach, the pattern recognition approach, 
and the artificial intelligence approach. The pattern recognition approach is a 
commonly used method for speech recognition because of three reasons: 
simplicity of use; robustness and invariance to different speech vocabularies, 
users, features sets, pattern comparison algorithms, and decision rules; proven 
performance. A popular pattern recognition method is the H M M (Hidden 
Markov Model) approach, which uses statistical information inherent in the 
speech data in recognition, e.g., mean and covariance. 
In most pattern recognition systems, there are four main steps: feature extraction, 
in which a sequence of measurements is made on the input signal to define the 
test pattern; pattern training, in which one or more test patterns of the same class 
are used to created a pattern representative of the features of that class; pattern 
comparison, in which the unknown test pattern is compared with each class 
reference pattern and a measure of similarity between the test pattern and each 
reference pattern is computed; decision logic, in which the reference pattern 
5 
Chapter 1 Introduction 
similarity scores are used to decide which reference pattern best matches the 
unknown test pattern. A block diagram of a pattern-recognition speech 
recognizer is shown in Figure 1-1 [1]. 
O =，过ure ^  Pattern Training 
^ Vector 
Speech _ Feature / \ 
Extraction \ 
。 Pattern ^ Decision Recognize‘ 
U Comparison Logic Speech 
Figure 1-1. A Block Diagram of a Pattern Recognition Speech Recognizer 
In a pattern recognition system, the performance is sensitive to the amount of 
training data available, the speaking environment and transmission 
characteristics of the medium used to create the speech. 
In Figure 1-1, the training part can be viewed as a separate system and normally 
be implemented by some software using a PC. The rest three parts make up a 
complete speech recognition system. Such a speech recognizer can be realized 
by software in a PC or by hardware using a DSP board. But our goal is to 
develop a new platform which has the smallest area and the lowest price so that 
it can be utilized in varies applications, e.g., electrical intellectual pet or smart 
house. A system on chip fulfills these requirements. It can be carried anywhere 
and embedded inside any products. Moreover, the cost per system is very low 
compared to DSP and PC system. 
We divide the speech recognizer into two parts, the feature extraction and the 
5 
Chapter 1 Introduction 
pattern comparison & decision logic blocks. In this thesis we design an IC 
which realizes the comparison and the decision blocks first. 
1.2. ASIC Design with HDLs 
Generally speaking, there are two design flows in ASIC (Application-Specific 
Integrated Circuit) design. One is from schematic to layout, the other is coding 
in hardware description language (HDL) and then synthesis and auto-layout. 
The second design flow is often employed in digital circuit design for that there 
are always thousands of gates in one design so that it is impossible to draw 
schematics gate by gate. 
By using hardware description languages, designers can easily manage the 
complexity of large designs containing several million gates, and modify and 
re-use designs to keep pace with improvements in technology. The most 
significant gain that results from the use of a H D L design is that a working 
circuit can be synthesized automatically from a language-based description, 
bypassing laborious steps that characterize manual design methods (e.g., logic 
minimization with Karnaugh maps). HDL-based designs are becoming an 
industry design standard [2]. 
Verilog and V H D L are two of the most popular hardware description languages, 
both are Institute of Electrical and Electronic Engineers (IEEE) standards. This 
3 
Chapter 1 Introduction 
project is designed with Verilog HDL. A typical design flow of Verilog 






RTL Description =  
一 (Verilog HDL) ^ 
^ n 
Functional 














r i M 
Layout Verification  
i .  
Implementation 
Figure 1-2. Design Flow of Verilog HDL-Based ASICs 
The designs specifications are written first. They describe abstractly the 
functionality, interface, and overall architecture of the digital circuit. Then a 
behavioral description is created to analyze the design in terms of functionality, 
performance, and other high-level issues. After the behavioral description is 
converted to an RTL description, logic synthesis tools convert the RTL 
5 
Chapter 1 Introduction 
description to a gate-level net list. The gate-level netlist is input to an Automatic 
Place and Route tool, which creates a layout. After verification, the labyout 
could be sent for fabrication. Most digital design activity is concentrated on 
manually optimizing the RTL description of the circuit. After the RTL design is 
finished, C A D tools are used to assist designers in further processes [3] [4]. 
5 
Chapter 2 Theory of HMM-Based Speech Recognition 
Chapter 2 Theory of 
HMM-Based Speech 
Recognition 
2.1. Speaker-Dependent and 
Speaker-Independent 
A speech recognition system can be trained as speaker-dependent (SD) or 
speaker-independent (SI). The difference is that in a SD system one model is 
trained with speeches from one person while in a SI system speeches from 
different speakers can be found in the training data of one model. For a given 
speech recognition task, a SD system normally performs better than a SI system, 
as a sufficient amount of data is available to adequately train the 
speaker-dependent templates, or models. However, when the amount of speaker 
specific training data is limited, this is not guaranteed because of the lack of 
reliability in the calculated reference parameters [1]. 
2.2. Frame and Feature Vector 
speech is a time varying signal. However, in a very short period of time (10ms 
19 
Chapter 2 Theory of HMM-Based Speech Recognition 
to 20ms), speech is fairly stationary. So a speech signal can be divided into 
several segmentations and each of them is called a frame. 
freane n < > 
frame n+1 •？ > 
Figure 2-1. Examples of One Frmes of Speech Signal 
In a speech-recognition system, the signal-processing front end converts a frame 
to some type of parametric representation for further analysis and processing. 
These include the short time energy, zero crossing rates, level crossing rates, and 
other related parameters. These information are generally at a considerably 
lower information rate. In the front end processing of this project, one frame is 
converted to a 26-element vector, in which there are 12 MFCC's 
(Mel-Frequency Cepstrum Coefficients), 12 first-order derivatives of these 
coefficients, and the energies of the above two sets of coefficients respectively. 
2.3. Hidden Markov Model 
One well-known and widely used speech recognition algorithm is the hidden 
19 
Chapter 2 Theory of HMM-Based Speech Recognition 
Markov model (HMM) approach [5]. It uses the statistic information inherent in 
the speech signals and provides a natural and highly reliable way of recognizing 
speech for a wide range of applications. 
2.3.1. Markov Model 
Consider a system that can be described at any time as being in one of a set of N 
distinct states, at regularly spaced, discrete times, the system undergoes a 




Z 〜 = 1 
This system is called an observable Markov model because at each instant of 




Figure 2-2. A Three-State Markov Model 
8 
Chapter 2 Theory of HMM-Based Speech Recognition 
2.3.2. Hidden Markov Model 
In an observable Markov model each state corresponds to a deterministically 
observable event. The output of such system in any given state is not random. 
But this model is too restrictive to be applied to many real problems. So the 
observable Markov model is extended to include the case in which the 
observation is a probabilistic function of the state. The resulting model is called 
a hidden Markov model. 
( P ( A ) 12 P(A) j 
\：：二 
\ P ( B )厂 
o 
Figure 2-3. A Three-State Hidden Markov Model 
Figure 2-3 shows a three-state hidden Markov model. Within each state there 
are three possible observation symbols, each one corresponding to a possible 
output of this system. So even the system has been decided in a particular state, 
the output of the system still has three choices. 
A hidden Markov model can be driven from a Markov model. Figure 2-4(a) is a 
two-state Markov model and in Figure 2-4(b) an equivalent one-state hidden 
19 
Chapter 2 Theory of HMM-Based Speech Recognition 
Markov model is illustrated. Inside the only state of this hidden Markov model, 
there are two possible observations corresponding to the two states in the 
Markov model in Figure 2-4(a). 
State 1 k C State 2 
(a) 
o 
/ State l \ 
P(A)=P, 
(b) 
Figure 2-4. (a) A Two-State Markov Model (b) An Equivalent One-State Hidden Markov Model 
2.3.3. Elements of an H _ 
An H M M for discrete symbol observations is characterized by the following 
elements: 
1. N, the number of states in the model In the coin-tossing experiment, each 
state corresponds to a distinct biased coin. 
2. M, the number of distinct observation symbols per state. The observation 
symbols correspond to the physical output of the system being modeled. 
For the coin-tossing experiment the observation symbols were heads or 
tails. 
3. The state-transition probability distribution A={ , where a0) for all i, j. 
10 
Chapter 2 Theory of HMM-Based Speech Recognition 
4. The observation symbol probability distribution B二{bi(ky}, in which bi(k) 
defines the symbol distribution in state i. In the coin-tossing experiment, it 
is the probability of Head or Tail in the state Coinl or Com2. 
5. The initial state distribution 7r={;r/}, in which iq means the probability of 
the initial state is state i, 
A complete specification of an H M M requires specification of two model 
parameters N and M, specification of observation symbols, and specification of 
three sets of probability J, B and tt. 
2.3.4. Types of HMMs 
H M M S can be classified by the structure of the transition matrix A of the 
Markov chain. One special case is the fully connected H M M or an ergodic 
model. In this kind of model every state can be reached in a single step from 
every other state (Figure 2-5), where 
以 11 以 12 ^13 
A — 以 21 2^2 以 23 
For some applications, especially in processing speech signals whose properties 
change over time in a successive manner, a left-right H M M can model the 
observed properties of the signals better than the standard ergodic model. In 
such a left-right model, as time increases, the state index increases or stays the 
19 
Chapter 2 Theory of HMM-Based Speech Recognition 
same (Figure 2-6). This model has a fundamental property that the 
state-transition coefficients have the following property: 
ciij =0, j< i 
And the state sequence must begin in state 1 and end in state N. 
[0, i^l 
[l, / = 1 
One additional constraint is placed on the state-transition coefficients of the 
models used in this project. 
a". =0, + l 
That is, one state can only be reached from its previous one or itself. The 
state-transition matrix is in the form of the following: 
A- 0 a 22 <^23 
0 0 3^3 
Figure 2-5. A Fully Connected HMM 
Figure 2-6. A Left-Right HMM 
12 
Chapter 2 Theory of HMM-Based Speech Recognition 
2.3,5. Continuous Observation Densities in HMMs 
In many real-world applications the observations are often continuous signals. 
Although it is possible to convert such continuous signal representations into a 
sequence of discrete symbols (that is, use an observable Markov model to model 
these continuous signals), it would be advantageous to use H M M s with 
continuous observation densities to model continuous signal representations 
directly. 
To use a continuous observation density, the model probability density function 
(pdf) is in the following finite mixture form: 
bj(o) = f^CjkWp,“jk,Ujk\ ^^j^N 
k=\ 
Where o is the observation vector being modeled, cjk is the mixture coefficient 
for the y^th mixture in state j and N is any log-concave or elliptically symmetric 
density (e.g., Gaussian, which is used in this project) with mean vector jUjk and 
covariance matrix Ujk for the li^ mixture component in state j. The mixture 
coefficient cjk satisfies the following constraint: 
M 
Y^Cjk 二 I \<j<N 
Cj, > 0, l<j<N, \<k<M 
Assume we have speeches from both boys and girls (Figure 2-7(a)) for a same 
word, each vector in the figure is one speech data from a boy or girl. Using 
these data we trained a single-mixture model and a double-mixture model for 
this particular word separately and then use these two models to recognize an 
13 
Chapter 2 Theory of HMM-Based Speech Recognition 
input speech, which is one of the training data from the boys. Here si and s2 are 
scores to be compared in the decision logic part, and a larger score means the 
input data is more probable to be the word which the model stands for. 
Obviously with a two-mixture model the speech recognizer is able to give out a 
better recognition accuracy than only using a single mixture model in 
recognition (Figure 2-7 (b)(c)). In this project a double-mixture H M M is used to 
model the speech signals. 
个 个 个 个 个 个 个 个 个 y ‘ ‘、 ,、个 个 个 个 个 个 个 个 个 个 个 个 
< Boy <- G-irl Speech Data 
(a) 




Input Speech Speech Data 
(c) 
Figure 2-7. (a) Training Speeches From Boys and Girls (b)(c) Recognition Result 
19 
Chapter 2 Theory of HMM-Based Speech Recognition 
2.3.6, Three Basic Problems for HMMs 
Given an H M M in the form of A=(A, B, n), there are three problems faced when 
this model is used in real-world applications: 
1. H o w to efficiently compute P(0|/l), the probability of the observation 
sequence 0=(pi02...0j), here Ot is the input feature vector at time t in 
speech recognition systems; 
2. H o w to choose a corresponding state sequence q={qiq2...qr) to best explain 
the observation sequence 0=(o购…or); 
3. H o w to adjust the model parameters A=(A, B, n) to maximize P(0|/l). 
To design an isolated-word speech recognizer, first we have to design a separate 
N-state H M M for each word of a 厂 word vocabulary. V is the number of words 
in the vocabulary. This task is done by using the solution to the third problem. 
Then by using the solution to the second problem we can segment each of the 
word training sequences into states to make refinements of the model to 
improve its capability of modeling the spoken word sequences. Finally 
recognition of an unknown word is performed using the solution to the first 
problem to score each word model based on the given test observation sequence, 
and select the word whose model score is highest. 
19 
Chapter 2 Theory of HMM-Based Speech Recognition 
2 A Probability Evaluation 
To do speech recognition, we wish to calculate the probability of the 
observation sequence, O = , given the model 义，and then we can 
compare the probabilities obtained from this calculation to make the recognition 
decision. The most straightforward way of doing this is enumerating every 
possible state sequence of length T, computing the probability of the observation 
sequence O given a fixed-state sequence q =(仏《2《3..4r) and then summing 
these probabilities over all possible state sequence q. That is, 
no I 义 ) = i X A i ⑷ 〜 入 ⑷ … 人 ⑷ （2.1) 
The direct calculation of the above equation involves IT • N^ calculations (N 
is the number of states in the model), since there are A T possible state sequences 
(at every t=l, 2，3,…，T, there are iVpossible states that can be reached) and for 
each such state sequence about 2T calculations are required for each term in the 
sum of equation (2.1). It is almost computationally infeasible. Even for a small 
value of TV, e.g., N=2>, for a speech input composed of 100 frames (7=100), there 
are around computations. 
An alternative to equation (2.1) is that the probability can be approximated by 
only considering the most likely state sequence, that is 
P{0 I 义)=，(冗入•〜—入(〜)）（2.2) 
The above equation also requires numerous computations when being directly 
calculated. Fortunately there is a simple recursive procedure existing which 
allows the equation to be calculated very efficiently. It is called the Viterbi 
16 
Chapter 2 Theory of HMM-Based Speech Recognition 
Algorithm [6], 
2.4.1, The Viterbi Algorithm 
The Viterbi algorithm is used to find the single best state sequence. It is useful 
in both model training and speech recognition. To find the single best state 
sequence 仏 仏 义 f o r the given observation sequence 
0 = , we need to define the highest probability along a single path 
at time t 
q 
St(i) accounts for the first t observations and ends in state i. The complete 
procedure of the Viterbi algorithm can be stated as follows: 
1 • Initialization 
= l<i<N 
2. Recursion 
3 • Termination 
The recursion step is the heart of the Viterbi algorithm. The above procedures 
should be clear that a lattice (or trellis) structure efficiently implements the 
computation of the Viterbi algorithm. 
26 




Cj^ —tir •tir 办 
^ ^ L-| 1 Frame 
1 2 3 4 
(a) 
State 个 
e 「---「…丁----「---(^  … 1 I I I I I I X ‘ ‘ 1 I • I I • I I Z 1 • 
d 丨 
v-^  t I I I I I yi I I I I I I I I z I I I 
r ！ i——{D--0—-JD (Sr—O——1 1 
^ I I I I yr 1 1 1 1 
b i""©"^~1—!—-1 
\ I I I I I I I I 
I~~I~II~II~I~I~!"• Frame 
1 2 3 4 5 6 7 8 9 
(b) 
Figure 2-8. (a) (b) Searching in the Lattice Structure 
In Figure 2-8(a), the initialization sets state a as the starting point. That is, the 
first frame corresponds to state a. The recursion step determines the maximum 
probability of a transition path. As illustrated in Figure 2-8(a), each state has 
two possible paths, one is from its proceeding state and the other is from itself. 
The algorithm calculates the probabilities of these two paths and only keeps the 
path with the higher probability. In this example, point b3 has two possible 
paths from point a2 and point b2 respectively. Assume the path a2—b3 has a 
higher probability than the path as^bs. Then after calculating and comparing 
two probabilities of these two paths the algorithm will replace the temporal 
probability by the larger one. The search will be continued until it reaches the 
very last state. The termination step determines the final probability of the 
search as illustrated in Figure 2-8(b). 
19 
Chapter 2 Theory of HMM-Based Speech Recognition 
2.4.2. Alternative Viterbi Implementation 
The Viterbi algorithm in the preceding section needs multiplications, which is 
not suitable for hardware implementation. Thus by taking logarithms of the 
model parameters, the algorithm can be implemented without any multiplication. 
The main procedures of the modified Viterbi algorithm then become: 
1. Preprocessing 
K. = \n(7r.), l<i<N 




2 < f < 





The calculation required for this alternative implementation is on the order of 
JSfT additions. Because the preprocessing can be performed once and saved, its 
cost is negligible for most systems. 
19 
Chapter 3 HMM-Based Isolated Word Recognizer Design Methodology 
Chapter 3 HMM-based 
Isolated Word Recognizer 
Design Methodology 
To build a speaker-independent isolated word recognize, assume we have a 
vocabulary of V words and each word is modeled by a distinct H M M . To do 
isolated word recognition, we must perform the following: 
1. For each word in the vocabulary, we need to build an H M M ^ v For each 
word in the vocabulary there is a training set of K utterances. With these 
utterances which appropriately represent the characteristics of the word, we 
can estimate the model parameters {A, B,龙）that optimize the likelihood 
of the training set observation vectors for the v^ ^ word. 
2. For each unknown word to be recognized, the processing is shown in 
Figure 3-1. First, by M F C C feature analysis the speech signal is extracted 
into observation sequence O; then calculate the model likelihoods for all 
possible models; finally by selecting the word whose model likelihood is 
highest the system gives the result. 
This project is focused on the probability computation and decision made block. 
The probability computation step is generally performed using the Viterbi 
algorithm. 
28 
Chapter 3 HMM-Based Isolated Word Recognizer Design Methodology 
Q 
H M M For 









Speech I Ilndex of Recognized 
_ _ ^ select Word 
- I ： Maximum 
： I • 
O 
T H M M For 
t WordV 
^ Probability ______ 
Computation 
Figure 3-1. HMM Based Recognition Process 
An 8-state left-right H M M is trained in advance for each word in the vocabulary. 
The state-transition matrix for this model is 
-以11 0 0 0 0 0 0 -
0 2^2 «23 0 0 0 0 0 
0 0 «33 3^4 0 0 0 0 
0 0 0 a^  a^ , 0 0 0 
0 0 0 0 0 0 
0 0 0 0 0 0 
0 0 0 0 0 0 «78 
_ 0 0 0 0 0 0 0 1 _ 
One state in this model can only be reached from its previous state or itself. The 
state sequence ends at state 8 and the initial state distribution ；r={l, 0, 0, 0, 0, 0, 
0, 0}. This type of H M M can properly model speech signals whose properties 
28 
Chapter 3 HMM-Based Isolated Word Recognizer Design Methodology 
change over time in a continuous manner. 
《 今 AA/7% yyvvv'A i 
Frame 
卜 
1 2 3 4 5 96 97 98 99 100 
Figure 3-2. The Lattice Structure of the Speech Recognizer 
Figure 3-2 shows the lattice structure of the searching engine of this speech 
recognizer. The horizontal coordinate is the frame number of the input speech 
which is to be recognized. The vertical coordinate is state index of the model of 
a particular word in the vocabulary. Each point in this lattice structure 
corresponds to bi(ot), that is, the probability of the input feature vector Ot given 
the state i. Assume at time t the feature vector Ot corresponds to state i. As time 
changes from t to t+\ and the input feature vectors change from Ot to Ot+i, the 
next state would be i or /+1. In the figure, the lines from one point to another 
point represent these probable state transfers. After comparison the searching 
engine will keep on the path which has a high probability. The probability 
calculation ends only when the searching engine reaches both the last input 
feature vector and the last state of the model. 
For a speech input whose length is T frames, the probability of the most 
probable state sequence that ends at state 8 is 
31 
Chapter 3 HMM-Based Isolated Word Recognizer Design Methodology 
2<t<T 
^t (0 = 0 > , , (i - }b.(o,), ^ < < g 丄） 
糊 H n 0, 
It is to be discussed in the following sections. 
3.1. Speech Recognition Based On 
Single Mixture 
While there is only a discrete probability density used within each state of the 
word model, the model is trained as a single-mixture H M M . Given a speech 
observation O = (0^020^...Oj,), the probability density function of each point in 
the lattice structure of this system would be 
b 人 = 队,UO, l<i<N 
where Ot is the input feature vector with a dimensionality of n at time t (in this 
project n=26). N is a multivariate Gaussian with mean vector fM and covariance 
matrix Ui in state i. 
P ^ y P l (3.2) 
P^yvi e 
Hence equation (3.1) can be written as 
28 
Chapter 3 HMM-Based Isolated Word Recognizer Design Methodology 
3t (0 = niax{ 1 (/)a,,, (i - ”,. }b. (o,) 
1 1 , 2<t<T 
r O ... / . 1 � � 1 --{o-fijur (O,-//,) 
= <^,_1(卜1)〜_1),} g 2 , 2 < / < 8 
7 , 、 1 4(0,-//丨)’L/丨-丨(Oi-Z/i) . 
、 0, m 
As stated in Chapter 2.4.2, to reduce hardware complexity and prevent 
underflow, the probability is often implemented in logarithmic domain. Take 





For a given state i in the H M M , \Ui\ is constant. Thus ln(a..), and 
ln( , ) are constant, (/) and {i -1) are scores obtained 
� 
from the previous step. So the core step of calculating S^  (i) is to compute the 
third part of the above polynomial. 
At time t the input feature vector is denoted as = 〜 , and xij is 
one of the 26 coefficients as stated before. For the state Gaussian, 
correspondingly there is a mean vector //. = IX!无,2无/3.•.无/26] and a covariance 
matrix U^ 二 , both are composed of 26 elements too. Then the 
third term of equation (3.2) would be 
33 
Chapter 3 HMM-Based Isolated Word Recognizer Design Methodology 
— 
1 1 X 1 1 1 
'2 '2 ... 一 - [x^j-x-i 考2 …太,26一无/26. 
Z ... U-2 
326 
爭 去 ） 
y=l 以ij 
As Uij is constant for a given state j, the factor (———)can be calculated in 
advance and viewed as a constant too. Thus the third term of the polynomial can 
be simply implemented by some multipliers and adders. So by using a 
sing-mixture model, the probability of an input speech observation giving a 
most probable state sequence can be implemented in hardware just by some 
multipliers, adders and comparators. And there have been standard 
implementations for all of these blocks in digital circuit design. 
3_2. Speech Recognition Based On 
Double Mixtures 
As discussed in Chapter 2.3.5, the speech observations are continuous signals 
and it is advantageous to use H M M s with continuous observation densities to 
model the continuous characteristics directly, so some restrictions must be 
placed on the form of the model probability density function (pdf) to ensure that 
the parameters of the pdf can be re-estimated in a consistent way when training 
the model For a double-mixture H M M isolated word recognizer, all the 
searching algorithms are the same as the single-mixture recognizer's except that 
a double-mixture Gaussian is used in the formula of the pdf. 
25 
Chapter 3 HMM-Based Isolated Word Recognizer Design Methodology 
For the t^^ frame, the corresponding pdf is 
Replace N by the representation in equation (3.2), the above pdf would be 
Z?,(o,) = c,iN(o,，11.,, ？7,1) +c,.2 NO,, Ua) 
=C.-i I e 2 + I g 
(3.4) 
Similarly taking the logarithms of pdf to reduce the hardware complexity, 
equation (3.4) becomes 
=ln(c.i , =e 2 + C.2 I =e 2 ) 
(3.5) 
Equation (3.5) requires an add-log operation (ln(2^ exp)). Normally the equation 
will be implemented by taking the larger factor out of the logarithm operation as 
follows: 
+ C ^ ' ) = max{lnC, -^X,, InQ + in(l + • ^ 己 义 - - 义 - ) 
^max 
Here C赚义 and Xmax are factors in max{lnCj + X p In C2 + X2}, and Cmin and 
Xmin are factors in min{lnCi InC^+X^} . There are two possible 
C _ 
solutions to implement the function ln(l + — ) . 
^max 
c _ 
1. Ignore the affect of + 如厂义墮)[7]. The reason for replacing 
Cfnax 
+ C^e^') by max{lnC； + Z ” is that 
35 
Chapter 3 HMM-Based Isolated Word Recognizer Design Methodology 
c c 广义隨 <1, thus ln(l + ) < in 2 < 0.7 . If the word to 
Cmax Cmax 
be recognized is distinguishable, that is, one of the probabilities P(0| 乂）is 
much larger than the others, this method works well. But if for an input 
observation, there are two words in the vocabulary whose models produce 
almost the same probabilities during calculation, then the ignored term 
C _ 
ln(lH——s^ HLgi™厂Zmax) will be a determinant in recognition. Thus this 
Cfnax 
implementation method will introduce large errors. 
2. Use a polynomial y 二 介—-^min) + 爪 to approximate 
C 
ln(lH———g-^ -'x-^ max ^  First, we have to find out the interval in which the 
Cfnax 
log-function can not be approximated to 0. Then evaluate 
C 
y - ln(l + 厂 ) at points within this interval following a step of 
^max 
every LSB to minimize L M S (Least Mean Squares) to find out k and m [8]. 
This implementation is more accurate than the previous one, but it needs 
some special circuits to implement. Moreover, while a higher-mixture 
model is employed in recognition, the add-log operation is more complex 
so that a higher order polynomial is needed to approximate the add-log 
equation. This need more calculation in choosing ki, k�，K and m. 
Therefore the hardware implementation will be more complex too. 
These are the two main existing implementations for add-log operation, each 
has its own advantages and disadvantages. In this project we introduce a new 
method to implement this add-log operation ——with a table look-up method. 
27 
Chapter 3 HMM-Based Isolated Word Recognizer Design Methodology 
This approach has simplified the current design while introducing an acceptable 
computation error. 
r c 
Because 0 < <1 , 1 < ln(l + )<in2 , for a 
CFNAX M^AX 
double-mixture model, the value of the add-log factor is kept between 0 and 0.7, 
Q _ C _ 
and is determined by • So can be set as the index 
C C max max 
of the look-up table while the content of the table is numbers between 0 and 0.7. 
The size of the look-up table is determined by the system requirement, which 
will be discussed in the next chapter. This table look-up approach has 
substantially reduced the complexity of the design, improved speed and 
accuracy. Moreover, the table look-up approach can implement higher order 
multi-mixture Gaussian pdf architecture based on a single-mixture model 
[9][10]. 
28 
Chapter 4 VLSI Implementation of the Speech Recognizer 
Chapter 4 VLSI 
Implementation of the 
Speech Recognizer 
As discussed in the Chapter 3, a speaker-independent speech recognizer with a 
two-mixture H M M can be implemented based on a single-mixture H M M speech 
recognizer, using the table look-up method. This implementation method can be 
extended to high-order-mixture H M M systems with little and very simple 
modifications. Thus this chapter will start from the design of a speech 
recognizer based on a single-mixture H M M , then the implementation of a 
two-mixture H M M based speech recognizer, and finally discuss how to applied 
the table look-up method to a high-order-mixture search engine. 
4.1. The System Requirements 
The speech recognizer in this project is designed for isolated words recognition. 
The vocabulary is composed of up to 64 words, and we assume the number of 
the feature vectors of each isolated word to be recognized is not more than 256, 
which are enough for pratical applications. All the feature vectors are composed 
of 26 elements and the H M M used to model the word in the vocabulary is a 
38 
Chapter 4 VLSI Implementation of the Speech Recognizer 
left-right 8-state model. 
All the feature vectors and the model parameters are pre-converted into 16-bit 
fixed point binary data with an accuracy equavilant to two places after the 
decimal point floating numbers. For example, a floating point number 32.163 
will be converted into a 16-bit fix point number equal to 3216 (110010010000). 
Because no truncation is considered during the computation in this system, after 
the multiplication the factor (x,. - ——— ) w i l l be a 48-bit data and then 
26 _ 1 
the addition involves in the summation . ) is a 48-bit 
> 1 2 〜 
operation. Thus the pre-calculated constants In , 丄 = a n d In a., are 
V側卞I 
26 — 1 
needed to be converted into 48-bit data before adding to ^ (x^ j ) • 
But as the model parameters P 力 are stored together with the constants and they 
are 16-bit data, every constant has to be separated into three 16-bit data and 
stored in three corresponding units in the external memory. 
4.2. Implementation of a Speech 
Recognizer with a Single-Mixture 
HMM 
The speech recognizer based on a Single-Mixture H M M mainly realizes the 
30 
Chapter 4 VLSI Implementation of the Speech Recognizer 
following algorithm: 
Pfinal=m 
1 26 1 2 < t < 
削 ^ ^ij 2<Z<8 
1 +!；(〜一功_^)，/ = i 
0, iVl 
Here T is the total number of the input feature vectors and the Viterbi algorithm 
is implemented in the logarithm field. 
Figure 4-1 shows the structure of a speech recognizer which employs a 
single-mixture H M M when training the vocabulary models. 
Word-Index 
FianI Score  
Register 
Feature Vector  
• Subtracter ——• Multiplier • Core—Adder ^ _ 
Model Parameter 
Constant Register for 
Constants 
Address • 
“ Controller  
Start  
Figure 4-1. The Structure of a Single-Mixture HMM Based Speech Recognizer 
This speech recognizer is composed of seven blocks, in which Controller is the 
master to control other blocks to realize the logarithmic Viterbi algorithm. These 
blocks are listed as below. 
1. Registers. There are three registers in this recognizer. 
1). Register for Constants. It is used to store the constants 
^{iTuf^p. 
40 
Chapter 4 VLSI Implementation of the Speech Recognizer 
and In a.. or lna《/+i)-lna" (The reason to store instead of 
lna,(/+i) will be explained later). These two constants are 48-bit data 
but stored as three 16-bit data in the external memory, thus we need to 
combine these three 16-bit data into one 48-bit data when the system 
fetches data. Thus this register is composed of two 48-bit storage 
elements, each of which stores one of the two constants when the 
system calculates the scores of one of the points in the lattice structure 
and their contents will be replaced by a new set of constants when 
come to calculate the next point. 
2). Register for Scores. This register is used to store the scores of the 
points in the lattice structure. Together with Core—Adder, these two 
blocks complete most of the searching work in the Viterbi algorithm. 
The scores stored in the register are results from the adder, which are 
48-bit data, and they will also be used by Core—Adder as one of the 
factors of the addition. Because in the searching process, every point 
has two paths to be reached, one from the previous state and the other 
from the same state as illustrated in Figure 4-2. The scores of these 
two paths have to be compared and the larger one is the right score of 
this point. Then as shown in Figure 4-3, we need eight 48-bit storage 
elements to store the temporal scores of each point, which are the 
computation results of the path from the same state, and one more 
48-bit element, called "Register Temp", to store the other scores 
resulting from the path thought the previous state. The score in the 
"Register Temp" will be compared with the score in one 
corresponding storage element at a proper time and then the larger one 
41 
Chapter 4 VLSI Implementation of the Speech Recognizer 
will be stored in this storage element as the final score up to this point. 
The comparison and replacement steps are shown in the Figure 4-3 as 
illustrated by 1, 2, 3.... When the searching process ends, the final 
score is stored in the storage element and will be pass to Final 
Score Register. This register for scores is composed of nine 48-bit 
storage elements with some embedded comparison logic. 
r 
State i+1 〇 y O 
• • • • • • 
State i C y — — ; 0 
Frame t / Frame t+1 
A 
Figure 4-2. Part of the Lattice Structure 
Register 8 
State 8 ^ V . ^ 
Register 7 h f ^ 
State 7 
Register 6 J ^ ^ 
State 6 / V V 
Register 5 
State 5 
Register 4 / ^ ^ ^ V —  
State 4 
Register 3 =  
State 3 
/ J . Temp 
Register 2 ( ^ ] 4. compare 
State 2 3. R e g i s t e r ^ ^ V V 
y / i . Temp 
Register 1 
State 1 1.Register 1 
Frame t Frame t+1 
Figure 4-3. Searching Process in the Lattice Structure 
3). Final Score Register. This 48-bit register is used to store the largest 
42 
Chapter 4 VLSI Implementation of the Speech Recognizer 
final probability of the input observation. Given one word model in 
the vocabulary, once the search process finishes, the final probability 
which is stored in the storage element of the register for scores will 
be passed to the final score register. This register then compares the 
new probability with the one that already stored (previous final score) 
inside it. If the new one is bigger than the old one, then the new score 
will replace the old one and the register will also set the recognized 
word index to the index of the word whose model has a larger 
probability during the searching process. This word index is the 
system output. If the new score is smaller than the old one, then the 
new score will be discarded. 
2. Subtracter. This block is used to implement x^ j -x-j. For a given input Xtj 
which is the element of the input observation Ot at time t, Subtracter 
finds out the corresponding x^  , which is the 广 component of the 
Gaussian's mean vector U / of the /也 state and calculates the difference 
between them. Because for a given word model in the vocabulary, x-- is 
constant. Then we pre-store - x^ . instead of x.. in the external memory 
to avoid the on-chip subtraction. Thus x". - x^  becomes 乂 + (—无"), 
which actually is an addition operation. Here we use a 16-bit 
carry-look-ahead adder [11] to implement this addition, for both the inputs 
and output are 16-bit data. This adder has very a small propagation delay. 
The algorithm is illustrated as below. 
P 二 A ® B 二 A| B, 
43 
Chapter 4 VLSI Implementation of the Speech Recognizer 
Q = A & B , 
Co =0 , 
Ci = Gi_i|(Pi_i&Ci_i), i〉0 
S = A ® B ® C 
Here A and B are two operands of the addition, P and Q are the propagate 
and generate term, C and S are carry and summation. And all operators 
involved here are bit operator. The result of Subtracter is the input to the 
multiplier that follows it. This Subtracter also acts as a passageway. When 
the system fetches in the f^ covariance ug in the f^  state Gaussian's 
CO variance matrix Ui, Subtracter does nothing except passing it to 
Multiplier. 
3. Multiplier. It is used to implement (x. -x.-Yi———).Although there are 
a square operation and a multiplication in this factor, we use a 48-bit 
modified Booth multiplier to realize both square and multiplication to save 
chip area [12]. The truth table of the modified Booth encoder is shown in 
table 1. The multiplicand to be encoded is a 16-bit data in both the 
multiplications. Before encoding, ‘0’ will be added to the right of the 
multiplicand. By doing so, the number of partial products of the 
multiplication is reduced to half of the original one and these partial 
products can be easily calculated by means of bit shifting or negation of the 
other multiplicand. Also, before summing these partial products up by the 
full adders, a sign bit must be extended in the partial products, and a 
constant must be added to the sum of the partial products, starting from the 
44 
Chapter 4 VLSI Implementation of the Speech Recognizer 
position n. n is the bit number of the other multiplicand. The extended sign 
bit is ‘1’ for a positive partial product and ‘0，for a negative one. And the 
form of the constant is (1010101 ...010111), where there are (m/2)-l zeros 
and m is the bit number of the multiplicand. In this project m=16. Figure 
4-4 gives a simple example of the modified Booth multiplication. 
m-2 
/=o 
, Z = Xw_i，x^ _2”..，Xo (two's complement) 
i-0 
X2i+i ^ Xii-i dj 
0 0 0 0 
0 0 1 - 1 
0 . 1 0 1 
0 1 1 2 
1 0 0 " -2 
1 “ 0 1 - 1 
1 1 0 -1 
- 1 I 1 I 1 I 0 
Table 1. The Truth Table of the Modified Booth Encoder 
extended bit 
zero added ^ ^ ^ 
/ / f 1000110 -2Y 
A T 1000011 Y 
V …A in A encoding 00 1 0 1 0 0 summation 0111101 Y 
X=001010=10, m=:6 —— summation constant, start 
Y=111101=-3,n=6 _ _ - 2 ^ luiuii ^ - _ 如爪 position 6 
111111100010-30 
Figure 4-4. A Example of the Modified Booth Multiplication 
This modified Booth multiplier first calculate (x^ j-x^ j)^  , in which 
Xtj -x^j is the computation result of Subtracter. Then Multiplier fetches 
the result of (x.-无i/f back and multiplies it with —, which is also 
2�. 
- 1 
the output of Subtracter and produces the factor (x • -x, ) ( ). Here 
we can see the advantages of using Subtracter as a passageway when 
36 
Chapter 4 VLSI Implementation of the Speech Recognizer 
fetching the factor ———.By using this method the Multiplier's inputs 
2〜. 
are the output of Subtracter and the computation result of itself instead of 
the system's input. Then it is much easier for us to trace the data in 
processing and control the system to fetch input data. The final output of 
Multiplier (x . - x^. ” ( —) is one of the Core—Adder inputs. 
2〜 — 
4. Core—Adder. It is one of the main blocks in the speech recognizer. W e use a 
48-bit carry-look-ahead adder here for there is no truncation employed in 
the system. This adder sums up the output of Multiplier, 
(x. ———),and then adds the constant In . 丄 = a n d the 
previous score 
of (M)th frame of this state 
max{(5",_i(/) + lna", + which is stored in "Register for 
Scores" to the result of the above summation. After that, as illustrated in 
Figure 4-3, Core—Adder will add the transition coefficient Ina". or ln<3/(/+i) to 
the result of the above operation and these two addition results will be 
stored in "Register for Scores" according to the judgment method stated 
above. Here after adding Inau and getting the first addition result, we add 
\nai(i+\)-]naii which is pre-calculated to the above addition result instead of 
Inau to the original value to avoid keeping the original value in Core—adder 
so that less control signal is needed. 
5. Controller. It is the heart of the system. Under the control of this block, the 
system is able to read feature vectors and model parameters from the 
37 
Chapter 4 VLSI Implementation of the Speech Recognizer 
external memories. Inside it, we use several counters to count the system 
clock cycles and controls the operations of other blocks. Also the frame 
number of the input observations and the word number of the vocabulary 
are stored in Controller. 
The speech recognizer works as follows: 
The feature vectors of the word to be recognized and the models of system 
vocabulary are pre-processed and pre-stored in two external memories 
respectively in the form of 16-bit data. Before the system starts to work, the 
word number of the vocabulary is read into the system and stored in a register. 
Then the system is waiting for a start signal and once it is detected, the system 
reads the frame number of the observation into a register and begins calculation. 
First Xji and -x^j are fetched into the system and Subtracter calculates the 
difference between x^ and , and then pass the result and — to 
Multiplier. Multiplier computers the factor (x^ ———)and this result 
together with other (x” —) are summed up in Core—Adder. Here 
26 _ I 
2 < 7 < 26 ,7 is the index of the 26 components. After ^ (x^ j -x^jYi ) 
M 2u” 
is computed, Core—Adder will add the constant In ^ and the 
26 — 1 
previous scores at this state to 无iy)2( ) • After that, the 
M 2 � 
logarithmic state-transition constant 'kva^ ^ and then I n - I n ^ j j is added 
47 
Chapter 4 VLSI Implementation of the Speech Recognizer 
one by one. The two partial probabilities are to be compared and stored in 
Register for Scores. The constants In ^ , In^ jj and -In a^ , 
are read into Register for Constants during some vacant clock cycles. The above 
calculation and searching along one word model continues until the system 
reaches the last frame of the input observation and then the final probability is 
stored in Final Score Register. This process is repeated for each word model in 
the vocabulary. At the end of every computation the final probability is to be 
compared with the previous score in Final Score Register and the larger one will 
be recorded while the corresponding word index of the model in the vocabulary 
is kept by the system. Thus after all the calculations have been finished, the 
system will output a complete signal and a word index which indicates the 
recognition result. To prevent the system continuing meaningless computation, 
Controller will stop the whole system and set an overflow signal to high 
immediately when one of Subtracter, Multiplier and Core—Adder is overflow. 
4.3. Implementation of a Speech 
Recognizer with a Double-Mixture 
HMM 
A speech recognizer based on a double-mixture H M M is designed to realize the 
following algorithm, which is similar to the single-mixture system except that 
the pdf computation has to consider two Gaussians for both the mixtures. 
48 
Chapter 4 VLSI Implementation of the Speech Recognizer 
’f-= 一_ 
^J例^maj 戶丨 琳— �‘max 
2<t<T 
2 < / < 8 
r ^ 1 r 
0, i^l 
C. 2 6 — 1 
Here C, andX,- stand for . ‘ = and V (x. ). As discussed 
M 2 〜 
in Chapter 3, the pdf in the above equations can be implemented by a look-up 
table. 
First of all, the larger C,隱已义‘瞧 should be selected. Assume that 
^ ^ /min^^'™"〉t^kc logarithm of this inequality, we will get 
hq.max + � m a x ^ hC；- +不 -
In 丨 + - 无 一 她 丨 + j>,-无—)2(-
Thus in a double-mixture H M M speech recognition system, to select the larger 
Cy 丽 e足"^狀 when computing the pdf, we only need to compute 
c 26 — 1 
In ‘ 4-V ( r • - ) for both the mixtures as what we have 
知 ) 2 6 I", I M 
done in the single-mixture system and then compare these two scores to get the 
C 
larger one. After that, the factor ln(l+ 乂應)is to be realized by a 
max 
49 
Chapter 4 VLSI Implementation of the Speech Recognizer 
look-up table. The look-up table's index is decided by ^！^^已乂‘-]'- , which 
can be implemented in logarithmic field as follows: 
C 
瞧=(lnC. . . )-(lnC. ) 
广 \ I min I mm / V ； max imax / 
/max 
Obviously the two operands of the index-decision subtraction are results from 
the previous calculation and only one more subtraction step is needed to obtain 
the index of the look-up table. 
A block diagram of this double-mixture H M M based speech recognizer is 
shown in Figure 4-5. 
Wc rd」ndex 
Final Score ^ 
Registe 厂 
Fea ure Vector 
——Subtracter ~ • Multiplier ~ • Core一adder * 一 
Mocel Parameter T ，r 
I Register n for X 
I  
Register Look-Up shifter I颂 Subtracter 




~ Controller  
StM ^  
Figure 4-5. A Block Diagram of The Double-Mixture HMM Based Speech Recognizer 
There are four more blocks in the speech recognizer based on a double-mixture 
H M M compared with the single-mixture based one. They are: Register for X, 
Subtracter and Comparator, Shifter, and Look-Up Table. The functions of these 
blocks are listed as below. 
1. Register for X. This register is used to store 
41 
Chapter 4 VLSI Implementation of the Speech Recognizer 
C 26 1 
In , ‘ • + y(x, ) of the two mixtures, which is the 
M J ‘ 、 
calculation result from Core—adder. Only after Core—adder finishes 
addition for both of the two mixtures can we perform the comparison to 
select the larger one and also obtain the index of the look-up table. So for 
the fth frame given the 产 state in the word model, after Core—adder finishes 
computing InCi+Xi of the first mixture, that is, 
C. 26 1 
In '1 + V (x^ - ), the final value should be stored 
P^rVnl 片 2 〜. 
in Register for X and then Core—adder continues to compute lnC2+X2 of the 
second mixture. Register for X is composed of two 48-bit storage element, 
each to store one value of \nC+X for a given time t and state i. 
2. Subtracter and Comparator. In this block, we compare InCi+Xi and 
lnC2+X2 to select the larger one and also compute the index of the look-up 
table. As discussed before, the index of the look-up table is implemented in 
the logarithmic field as (In + ) — (In C；.腿 + ), after the two 
operands are calculated by Core—adder and stored in Register for X, 
Subtracter and Comparator accesses Register for X to get these two 
numbers and then calculates (In C.^  + X.j )-(ln C.2 + ). If the result is 
larger or equal to 0, Subtracter and Comparator will tell Core—adder to add 
the logarithm factor to In C.^  + X.^, and if not, the logarithm factor will be 
added to In C.2 + X.^. At the same time the absolute subtraction result will 
be passed to the next two stages to be converted into the actual index of the 
look-up table. 
51 
Chapter 4 VLSI Implementation of the Speech Recognizer 
3. Shifter. This block is used to shift the output of Subtracter and Comparator 
to a fixed scale. Because different model employs a different scaling factor 
in pre-process when being converted into 16-bit data, the result of 
(In C-i + X.J) - (ki C.2 + X.2) I is of different scale too and can not be 
directly seen as the index of the look-up table which should be a fix 
number. Thus we apply a fixed scale on the index of the look-up table and 
use Shifter block to shift the value of | (In C.j + X.j) - (In C.2 +1/2) I to 
this fixed scale before it is used to generate the actual index of the look-up 
table. The number of shift bit is indicated by a shift number which is 
decided when pre-processing feature vectors and model parameters. 
4. Look-Up Table. The look-up table is stored in an external memory together 
with the models of the system vocabulary, and the contents of the look-up 
table are 48-bit data but stored as three 16-bit data. The block Look-Up 
Table in the block diagram actually is an address-conversion machine, 
whose input is the shifted absolute value of (In C.j + ) —（In -f- X^^) 
and the outputs are the actual addresses of the look-up table, which 
C 
correspond to the locals of three parts of ln(l+ 义,_). Then 
C/max 
these three 16-bit data are read into Register for Constants in where they 
will be combined to a 48-bit data and ready for addition in Core—adder. The 
contents of the look-up table are decided as follows. As 
C C 
< 1 , h 1 < ln( 1 + 6 义 ' 如 - 义 匪 ) S 111 2 , tllUS 
C i niax C i max 
52 
Chapter 4 VLSI Implementation of the Speech Recognizer 
c 
0 < ln(l + — ) < 0.7 . Besides, this system uses a fix-point 
C/max 
binary number representation method and two bits after the digital point is 
considered when converting the decimal feature vectors and model 
parameters into the binary ones. Apply the same criterion to the content of 
the look-up table, thus there should be altogether 70 values starting from 0 
and ending in 0.69. Because no truncation is employed in this system, the 
Q 
addition between InC and + 匪）is a 48-bit 
ITUiA lUaA \ ‘ 
/max 
operation, the values stored in this look-up table should be scaled into 
48-bit binary numbers. Accordingly we then divide the shifted value of 
(In C-1 + X.j) - (In C.2 + X.2) I into 70 ranges and each of the ranges 
corresponds to one 48-bit binary number in the look-up table. 
Other blocks which are same as those in the speech recognizer based on a 
single-mixture model are of similar functions as stated before. 
The working mechanism of the double-mixture H M M based speech recognition 
with a look-up table is as follows. 
The feature vectors of the word to be recognized and the models of system 
vocabulary together with the contents of the look-up table are pre-processed and 
pre-stored in two external memories respectively in the form of 16-bit data, 
similar to what we have done in speech recognition with a single-mixture H M M . 
Before the system starts to work, the word number of the vocabulary and the 
shift number are read into the system. The shift number is used by Shifter. After 
53 
Chapter 4 VLSI Implementation of the Speech Recognizer 
the initialization, at the time the system detects a start signal, it reads the frame 
number T of the observation and begins calculation. For the first frame given 
the first mixture in the first state of the first word model, 
26 _ 1 ^ 
^ (Xi. - Xji. Y ( ) + In . 1 = is computed as what have been done 
in the single-mixture H M M based speech recognizer. After that, the result is 
stored in Register for X and the system repeats the above procedure to calculate 
2 6 — 1 C 
V (Xj. ) + l n — 2 , given the second mixture of the 
same state. After these two scores are both stored in Register for X, Subtracter 
and Comparator compares InCi+Xi and lnC2+X2 and gives the difference 
between them. Then Shifter shifts | (In C.j + X.^) - (In C.^  + X.^) | with the 
shift number that is stored in the system. The shifted absolute value is passed to 
the next block Look-Up Table and converted into three actual addresses of the 
C _ 
external memory. The system then fetches the factor ln(l+ 
^imax 
and adds it with the larger (InCmax+^ max) term which is determined by 
Subtracter and Comparator. Up to now one score that corresponding to the first 
frame given the first state in the double-mixture model of the first word in the 
vocabulary is obtained, and the rest computations and searching procedures are 
the same as in the single-mixture H M M based speech recognizer. 
For comparison, the structure of a hardware recognizer without the look-up 
Q 
table is shown in Figure 4-6. The term ln(l+ jg ignored in 
C/max 
this recognizer. To calculate a pdf, first InCi+Xi and lnC2+X2 are computed 
45 
Chapter 4 VLSI Implementation of the Speech Recognizer 
separately for both mixtures as what have been done in the above procedures. 
Then we just select the larger term InCmax+Xmax as the value of a pdf and 
continue calculation with it. From Figure 4-6 we will find that the complexity of 
this recognizer is not reduced much compared with the double-mixture H M M 
based speech recognizer with a look-up table. The recognition accuracies of 
these two hardware recognizers will be compared in the next chapter. 
Wd] d—Index 
Final Score  
Register 
Fea' ure Vector 
— • Subtracter ~ • Multiplier • Core—Adder 趣 _ 
Model Parameter “ ^^  u 
_ Register 
forXmax 
Register for  
Go 门 stents 
a 
Address  
鑭 Start _ Controller  
Figure 4-6. The Structure of the Recognizer Without a Look-Up Table 
4-4- Extend Usage in High Order 
Mixtures HMM 
Even in isolated word recognition, choice of a mixture number that is larger 
than 1 will provide a better recognition performance. In a high order Gaussian 
system, the pdf will be in the following form: 
55 
Chapter 4 VLSI Implementation of the Speech Recognizer 
separately for both mixtures as what have been done in the above procedures. 
Then we just select the larger term InCmax+Xmax as the value of a pdf and 
continue calculation with it. From figure 4-7 we will find that the complexity of 
this recognizer is not reduced much compared with the double-mixture H M M 
based speech recognizer with a look-up table. The recognition accuracies of 
these two hardware recognizers will be compared in the next chapter. 
Wd] d—Index 
Final Score ^ 
Register 一 
i L 
Fea ure Vector 
！ Register 
— • Subtracter • Multiplier • Core一Adder 一 for Scores 
Model Parameter v 
_ Register 
for Xmax 




• Start ^ Controller  
Figure 4-6. The Structure of the Recognizer Without a Look-Up Table 
4.4. Extend Usage in High Order 
Mixtures HMM 
Even in isolated word recognition, choice of a mixture number that is larger 
than 1 will provide a better recognition performance. In a high order Gaussian 
system, the pdf will be in the following form: 
46 
Chapter 4 VLSI Implementation of the Speech Recognizer 
Where M is the number of the mixtures. Express the above equation in different 
form, and take logarithm to simply the calculation and avoid overflow, the 
equation will be: 
_ M 
k=\ 
Select the largest factor of the addition C腿 匪 out of the summation, the 
equation will be: 
— M C gl* 
咖 ,） = l n C 隱 瞧 + l n ( l + 2 (4-1) 
^ ^ 厂 /o max 
众?i max 
As stated in the chapter 3, the above equation can be implemented by three 
methods: 
1. Ignore the effect of / ——. This is the simplest implementation 
k=\ Umax 已 
k^rrm. 
way. But as more mixtures is employed when training the models, this 
method will be with lower recognition accuracy compared with the other 
two implementation ways. 
^ C e^' . 
2. Use a polynomial to approximate ln(l+ V ——^―^~) • This is an ( “ I o max k=\ ^max^ 
implementation method with relatively high recognition accuracy. But the 
number of variables in the polynomial is equal to the number of mixture M, 
and as M becomes larger, it is more and more difficult to find out the 
coefficients of the polynomial y = a^x^ a^x^ + . . . + . Also it is 
47 
Chapter 4 VLSI Implementation of the Speech Recognizer 
not convenient to realize the above polynomial in hardware design. 
3. Table look-up approach. This method is simple in realization but with an 
acceptable recognition accuracy. The logarithmic summation has a limited 
rang of value because every factor in the summation is less than or equal to 
1. 
_ < 1 
M n p^k 
欠 m a x 
k^max 
M r e、 
1+ y ^ < M 
k=l I max 已 
k^max 
M c e^^ 
I max 它 
A: max 
Thus the logarithmic equation can be implemented by a look-up table 
whose contents are values between 0 and InM. The index of this look-up 
table can be decided by the method that has been used in the design of the 
speech recognizer based on a double-mixture H M M . 
M pXi^ M 
ln( T - T i l—) = ln( Z 召 义 丄 ( i n c 瞧 瞧 ） ) ( 4 . 2 ) 
k^ msx A:?!： max 
It is also an add-log operation, but the error introduced here has little effect on 
equation (4.1). Then to simply the design, the look-up table's index can be 
decided on the maximum factor in the summation in equation (4.2)， 
{ max{(ln C, + X J - (InC^ + \\<k<M,k^ max}. A block diagram of 
speech recognizer with a high-order mixture H M M based on this table look-up 
method can be: 
48 
Chapter 4 VLSI Implementation of the Speech Recognizer 
Word. Index 
Final Score  
Register 
Feature Vector 
— 一 Subtracter ~ • Multiplier ~ • Core Adder Register 
[ _ _ _ _ _ _ — ——• for Scores 
i L iL 
M(xlel Parameter T  
Register for 
'nC^ ax+^ max & 
丨 nCmin+Xmin 
Constant R e n t e r Look-Up I 
• for ~ T ‘ 
^ , , Table _J 
Constants 一 Subtracter 
i I 
Address 
" ^ Controller  
Slaff 
Figure 4-7. A Block Diagram of Speech Recognizer with a High-Order Mixture HMM 
This speech recognition works almost as same as the speech recognition which 
is based on a double-mixture H M M . For one point in the lattice structure, first 
the recognizer computes the factor InQ + X j and In +X2 given the first 
two mixtures in the H M M and stores them in the register for I n C隱 
and In + . Then after the third mixture's is computed, the 
contents in the register will be renewed with the maximum and the minimum 
values. After all mixtures are went through and InC^ and 
InC^j, are obtained, the next block Subtracter will compute their 
difference and pass it to the block Look-Up Table, as in the double-mixture 
H M M speech recognition system. Then the largest InC^ will be 
added with a logarithmic factor from the look-up table and the calculation of the 
score in this point is finished. 
From the above procedure we can see that with the table look-up method, a 
49 
Chapter 4 VLSI Implementation of the Speech Recognizer 
single-mixture H M M speech recognizer can be extended to a double-mixture 
H M M recognizer easily and then a high order mixture H M M speech recognizer 
can be implemented by minor modifications. The only differences between 
these high order mixture H M M systems is the size of the look-up table, which is 
dependent upon the number of the mixtures M. 
4.5. Pipelining and the System Timing 
W e have designed a double-mixture H M M based speech recognizer as a 
synchronous system. In the block diagram, all the blocks except Multiplier 
consume one clock cycle in working and Multiplier uses two cycles to perform 
one multiplication. Also one clock cycle is needed to read in the data from the 
external memories. Therefore pipelining is employed in this system to improve 
system performance. 
Pipelining refers to the partitioning of a process into successive, synchronized 
stages such that multiple processor, each in a stage different to others, can be 
executed in parallel. Pipelining techniques are aimed at improving the system 
throughput. It has the effect of shortening the clock cycle, but the latency of a 
single instruction or operation will be increased because extra delays are 
introduced to the basic clock cycle due to the latching of intermediate results 
[1]. 
50 
Chapter 4 VLSI Implementation of the Speech Recognizer 
With the pipelining technology, the number of clock cycles that the system uses 
to compute one partial probability ^ is reduced from more then 10 to only 4, 
which are consumed in the two successive multiplications (x,-无"”（———)• 
2 � 
All other work is completed within these 4 cycles, including accessing external 
memories, subtraction, addition and comparison. Thus given that a typical frame 
number is 128 for an isolated word and the system vocabulary is composed of 
50 most frequently-used words, the number of clock cycles that is required to 
recognize one word is about 10^ , which can be illustrated from the following 
equation. 
number of cycles x number of mixtures x number of elements in one feature vector 
X number of states in the HMM x (number of frames - 7) x number of words in the 
vocabulary = 4 x 2 x 2 6 x 8 x ( 1 2 8 - 7 ) x 5 0 - 1 0 ' 
In the above equation the number of frames is subtracted by 7, because only one 
to seven points are needed to calculate their scores at the first and the last 7 
frames and therefore it can been seen that the number of frames which will be 
computed through all 8 states is 7 less than the original number. 
If the system is working at an operating frequency of 20MHz, the time required 
is to recognizer one isolated word under the condition of the above is about 0.5 
second. This system speed is acceptable for real-time applications. 
We have designed a double-mixture test chip to verify our design. The test chip 
was designed with Verilog HDL. The design was synthesized with Synopsys 
and the layout was generated by Cadence place and route tools Silicon 
Ensemble. A complete list of the Verilog description of the test chip os listed in 
51 
Chapter 4 VLSI Implementation of the Speech Recognizer 
Appendix I "Verilog Code of the Double-Mixture H M M Based Speech 
Recognition IC (RTL Level)". The test chip was fabricated by a 0.35 micron 
C M O S technology. 
52 
Chapter 5 Simulation and IC Testing 
Chapter 5 Simulation and IC 
Testing 
5.1. Simulation Result 
In this project a speaker-independent speech recognizer for isolated word based 
on a double-mixture hidden Markov model was designed. The design flow is 
shown in Figure 5-1. 
Design Specification 
^ I. 
• Behavioral ^  
Description 1 
“ ( V e r i l o g HDL) h ^ 
R T L Description \ = 
Functional 
Verif icat ion  
(Veri log Simulat ion) r , 
c s > i '二 is k 
Logical Verification —— 
(Verilog Simulat ion) 1 
i . 
^ Floor Planning 
( ^ e t D e l a / ^ Auiom=J>jace & 
(Silicon Ensemble) 
Layout Verification  
(Verilog Simulat ion)  
I 
Fabrication 
( A M S 0.35-micron 
C M O S technology) 
Test ing 
Figure 5-1. Design Flow of This Project 
53 
Chapter 5 Simulation and IC Testing 
W e have simulated the new design against two different references. One is a 
software recognition system using the same algorithm [13], and the other is a 
similar hardware recognizer without the look-up table. In the software 
recognition system there is almost no approximation in pdf calculation and no 
need to convert the input data into 16-bit fix-point binary number, the 
recognition result can be viewed as a theoretical one。In the hardware recognizer 
without the look-up table, only In C腿 + X ^ is considered in pdf calculation 
C. _ 
and the term ln(l+ g义,min-式•) jg ignored. This is equal to using an 
max 
all-zero look-up table in the proposed architecture (Figure 4-5). 
The test speech data are 353 speeches from A U R O R A 2 database [14][15]. 
These speeches were first imported into a software feature extraction program to 
be converted into the feature vectors. These feature vectors are needed as inputs 
for all the three recognizers. After simulation we checked if the recognizers 
gave the correct recognition results. The simulation results are tabulated in 
Table 2. We can see that if we just approximate by 
I n C ^ , the recognition accuracy dropped by 1.4% compared with the 
theoretical results. As discussed in chapter 4, we can increase the recognition 
accuracy with a look-up table. The test results indicate that the recognition 
accuracy has increased by 0.9% with a look-up table design. 
Hardware  
Software Without With LUT 
LUT (proposed) 
“Word 
Accuracy 94.3% 92.9% 93.8% 
(%) I  
Table 2. Simulation Results 
54 
Chapter 5 Simulation and IC Testing 
5.2. Testing 
A double-mixture H M M based speech recognizer with a look-up table test chip 
is fabricated with A M S 0.35-micron C M O S technology, and the specification is 
shown in Table 3 (Appendix II). 
Specification Value 
C M O S Technology — Q.35um 
Operating Voltage 3.3V  
Total gate count 30000 
Die area 12.25 sq.mm 
Package PGAIQQ 
Table 3. Specifications of the New Speech IC 
The 100-pin package includes data and address lines for two external memories, 
the system's inputs and outputs (e.g. reset, start signal, recognized word index, 
done signal), power pins, and some pins to keep track on the internal data in 
terms that there is any error occurs (Appendix III). 
L E D 
个 k k 
word index result_ack overflow 
data _ 
R A M ( 1 ) .address 名 
^ Speech _ , 
Recognition IC § 
data • ( P G A I O O )  
R A M ( 2 ) ^address 广 “ 
T f ( V D D ) 
rmet fv ack word_num 
I I I I I I I I 丨 ( G N D ) 
Dip Switch 
Figure 5-2. A Brief Diagram of the PCB 
55 
Chapter 5 Simulation and IC Testing 
A PCB has been made to test the new speech IC (Appendix IV). Figure 5-2 is a 
brief block diagram of the PCB. In this PCB, R A M (1) is used to store the 
H M M of the system vocabulary, and R A M (2) is used to store feature vectors. 
These data are obtained from a software program and are written into these two 
R A M s during power up. The address and data of the rams are 16-bit data. A 
6-bit dip switch is used to set the number of the words in the vocabulary, which 
can be varied from 1 to 63. The fV_ack signal initializes the recognition process. 
Every time the system detects a fv_ack single, it will start calculation from the 
beginning. The outputs are connected to 8 LEDs, among them 6 are for the 
recognition result Word一Index，one Result_Ack and one Overflow. 
We started with some simple functional tests to make sure that the IC is working. 
For example, we set the word number of the vocabulary to 10 and the 
parameters of the different word models in RAM(l) to all 1, all 2,…，all 10, 
separately. In RAM(2), we wrote 10 to all the storage units. The recognition 
result was 10, which was as same as the theoretical result. This simple 
functional test was performed several times and every time a right result was 
obtained. 
Then we came to the more complicated speech tests. This time we verified the 
new chip with the same set of A U R O R A 2 test data used in the simulation. With 
the same word models and feature vectors, we obtained the exactly same 
recognition results as the simulation. Also we wrote all zero into the look-up 
table to verify the recognition accuracy of a double-mixture speech recognizer 
without the look-up table. The recognition results were identical with the 
56 
Chapter 5 Simulation and IC Testing 
simulation. These functional tests have demonstrated that the IC can recognize 
real-world speech. W e recorded that the maximum operating frequency of the 
speech IC is around 62.5 M H z and the average power consumption is 56.7mW 
at 20MHz. 
Finally we connected our testing chip with a DSP board to perform real-time 
speech recognition. The function of the DSP board is to generate the feature 
vectors of the speech to be recognized. Figure 5-3 is a block diagram of this 
real-time speech recognition testing system. The word models were pre-trained 
by a software program and written into RAM(l) during power up. Then a 
person said a word towards the microphone. The DSP board generated the 
feature vectors of the input speech and wrote them into RAM(2). After that, the 
DSP board gave the speech recognition IC a start signal. In a very short time, 
the recognition result would be shown by the LEDs. 
RAM(l) 
(Word Models) 
\ 丨——I I — — I I I 
H. ^ DSP Board RAM(2) � ^^  „ . . . _ _ LEDs 
M i c r o p h o n e —诉 a t u r e Exaction) “ (Feamie Vectors) ^ Speech Recognition IC 一 態 
^ ‘——^  ^ ^ 
Figure 5-3. Block Diagram of the Real-Time Speech Recognition Testing System 
57 
Chapter 6 Discussion and Conclusion 
Chapter 6 Discussion and 
Conclusion 
In this thesis a new architecture of multi-mixture H M M based speech recognizer 
has been presented. Using a table look-up method, higher order mixture H M M 
speech recognizer can be implemented with accuracy, matching software 
recognizer. 
Since the add-log operation is unavoidable in the pdf calculation and its effect is 
significant to the final recognition result, finding an uncomplicated, universal 
and accurate implementation of this operation is important for the design of a 
speech recognizer. The table look-up method is such an implementation method 
that can fulfill these requirements. From Table 2 it can be seen that in a 
double-mixture H M M system, the proposed new speech recognition IC has 
approximately the same recognition accuracy as the software recognizer, and its 
recognition results are better than that of the other hardware recognizer without 
the look-up table. 
Moreover, this new technique can be applied to different high-order mixture 
systems with minor modifications, as stated in chapter 4.5. In those designs the 
only difference between the high-order mixture systems is the size of the 
look-up table. For example, given an accuracy equivalent to two places after the 
decimal point floating number, there are only 70 and 140 values in the look-up 
58 
Chapter 6 Discussion and Conclusion 
table for a double-mixture H M M system and a four-mixture H M M system, 
respectively. 
However, truncation is not considered when designing this double-mixture 
H M M speech recognition IC. All internal buses are 48 bits long，which requires 
more operation time, more areas and more power. But as a trade-off, truncation 
will introduce approximation error into the calculation of the pdf. If too many 
bits are truncated, the recognition accuracy will be too low that the system is not 
suitable for real-world applications. The effect should be carefully considered 





[1 ] • Fundamentals of Speech Recognition 
Lawrence Rabiner, Biing-Hwang Juang 
Prentice Hall 
[2]. ASIC System Design with VHDL: A Paradigm 
Steven S. Leung， Michael A. Shanblatt 
Kluwer Academic Publishers 
Verilog HDL: A Guide to Digital Design and Synthesis 
Samir Paluitkar 
[4]. Advanced Digital Design with the Verilog H D L 
Michael D. Ciletti 
Prentice Hall 
[5]. "A Tutorial on Hidden Markov Models and Selected Applications in 
speech Recognition" 
R. Rabiner 
Proceedings of the IEEE, Volume: 77 Issue: 2, pp. 257-286 Feb. 1989 
:6]. "Efficient Viterbi Scoring Architecture For HMM-Based Speech 
Recognition System" 
Y. S. Cho, J. Y. Kim and H. S. Lee 
IEEE Electronics Letters Vol. 28, No. 25, pp. 2338-2340, Dec. 1992 
7]. "An Efficient VLSI Architecture for HMM-Based Speech Recognition" 
J. M. Jou, Y. H. Shiau and C. J. Huang 
Electronics, Circuits and Systems, 2001. ICECS 2001. 
The 8th IEEE International Conference, Vol.1, pp. 469472, 2001 
[8]. “A VLSI Implementation of Pdf Computations in H M M Based Speech 
Recognition" 
J. Pihl, T. Svendsen and M. H. Johmsen 
T E N C O N '96. Proceedings., 1996 IEEE TENCON. 
Digital Signal Processing Applications, Vol.1, pp. 241-246, 1996 
•9]. "A VLSI Wordprocessing Subsystem for a Real Time Large Vocabulary 
Continuous Speech Recognition System" 
Stolzle, S. Narayanaswamy, K. Komegay, J. Rabaey and R. W. Brodersen 
Custom Integrated Circuits Conference, 1989., Proceedings of the IEEE 
1989, pp. 20.7/1-20.7/5, 1989 
60 
Reference 
：10]. “Low Power VLSI Architecture of Viterbi Scorer for HMM-Based Isolated 
Word Recognition" 
G. Park, K. S. Cho and J. D. Cho 
Quality Electronic Design, 2002. Proceedings. International Symposium on, 
pp. 235-239, 2002 




[13]."The H T K B O O K (for H T K Version 2.2)" 
S. J. Young, RC. Woodland and W. J. Byrne, 
Entropic Ltd., Jan. 1999 
14]. http: //www. elda. fr/ proj/aurora2. htm 
[15]. "The A U R O R A experimental framework for the performance evaluation of 
speech recognition systems under noisy conditions" 
H.-G. Hirsh and D. Pearce 
Proceedings of ISCAITRW ASR 2000, Paris, France, September 2000 
[16]."An Hmm-Based Speech Recognition IC" 
Wei Han; Kwok-Wai Hon; Cheong-Fat Chan; Tarn Lee; Chiu-Sing Choy; 
Kong-Pang Pun; Ching, R C ; 
Circuits and Systems, 2003. ISCAS '03. Proceedings of the 2003 
International Symposium on，Volume: 2 , May 25-28, 2003 
Page(s): 744 -747 
61 
Appendix III 
Appendix I Verilog Code of the 
Double-Mixture HMM Based Speech 
Recognition IC (RTL Level) 
今 Subtractor 
module clal6_dm(sumout, overflow, result_ack, inl, in2, out_sel, elk, reset); 
/* ——carry look ahead adder ——*/ 
input [15:0] inl; 
input [15:0] in2; 




output [15:0] sumout; 
output overflow; 
wire [15:0] inl_reg, in2_reg; 
reg [15:0] sum; 
reg overf; 
wire [15:0] p, g; 
wire [15:0] carout; 
assign sumout = sum; 
assign overflow = overf; 
/* ——when result_ack == 1, all the inputs are set to 0 to save power ——*/ 
assign inl_reg = (result一ack 二二 0) ？ inl : 16'b0; 
assign in2_reg = (result_ack == 0) ? in2 :16'bO; 
assign p = inl_reg I in2_reg; 
assign g = inl—reg & in2—reg; 
assign carout[0] = g[0]; 
assign carout[15:l] = g[15:l] I p[15:l] & carout[14:0]; 
always @(posedge elk or negedge reset) 
begin 
if (reset == 0) 
begin 
sum <= 0; 




if (out_sel == 0) 
sum[15:0] <= ml_reg[15:0] ^  in2_reg[15:0]、{carout[14:0]，IW); 
else 
sum[15:0] <= in2_reg[15:0]; 
if (out_sel == 0) 
if (carout[15] ^  carout[14] == 1) 







module booth—dm(outl，inl, in2); 
parameter zee = 33'bz; 
input [2:0] ml; 
input [31:0] in2; 
output [32:0] outl; 
assign outl = (inl == 3'bOOO) ？ 33'bO : zee; 
assign outl = (inl == 3'bOOl) ？ {in2[31], in2] : zee; 
assign outl = (ml == 3'bOlO) ？ {in2[31], in2] : zee; 
assign outl = (inl == 3'bOll) ？ {in2[31:0], I'bO) : zee; 
assign outl = (inl == 3'blOO) ？〜{in2[31:0]，I'bO} + I'bl : zee; 
assign outl = (inl == 3'blOl) ？ (~{in2[31], in2}) + I'bl : zee; 
assign outl = (inl == 3'bllO) ？ (~{in2[31], in2]) + I'bl : zee; 
assign outl = (inl == 3'blll) ？ 33'bO : zee; 
endmodule 
module fulladder_dm(cout, sumout, inl, in2, in3); 
input [47:0] inl, in2, in3; 
output [47:0] cout, sumout; 
assign sumout = (inPin2)^in3; 
assign cout = ((inlAin2)&in3)Kinl&in2); 
endmodule 
module cla48_dm(sumout, inl, in2, counter_cla48, elk, reset); 
/* ——carry look ahead adder ——*/ 
input [47:0] ml; 




output [47:0] sumout; 
reg [47:0] sum; 
wire [47:0] p, g; 
wire [47:0] carout; 
assign sumout = sum; 
assign p = inl I in2; 
assign g = ml & in2; 
assign carout[0] = g[0]; 
assign carout[46:l] = g[46:l] I p[46:l] & carout[45:0]; 
always @(posedge elk or negedge reset) 
begin 
if (reset 二一 0) 
sum <= 0; 
else 
begin 
if (counter_cla48 == 0) 






module multiplier—dm(muloiit，in_sel, inl, counter—cla48，elk, reset); 
/* ——booth multiplier ——*/ 
input [15:0] inl; 
input in_sel, elk, reset; 
input counter_cla48; 
output [47:0] mulout; 
wire [32:0] boothoutl，boothout2, boothoutS, boothout4, boothoutS, boothout6, boothoutV, 
boothoutS; 
wire [47:0] coutl, cout2, cout3, cout4, cout5, cout6, cout7; 
wire [47:0] muloutl, mulout2, mulout3, mulout4，muloutS, mulout6, muloutV, mulout_wire; 
wire [31:0] xmean_sq; 
assign xmean_sq = (in_sel == I'bO) ？ {{16{inl[15]]}, inl] : mulout_wire; 
assign mulout = mulout—wire; 
reg [47:0] cout7_reg, mulout7_reg; 
booth_dm boothl_dm(boothoutl, (inl [1:0], rbO), xmean_sq); 
booth_dm booth2_dm(boothout2, inl[3:1]，xmean_sq); 
booth—dm booth3_dm(boothout3, inl [5:3]，xmean_sq); 
booth_dm booth4_dm(boothout4, inl [7:5], xmean_sq); 
booth_dm booth5_dm(boothout5, inl [9:7], xmean_sq); 
booth_dm booth6_dm(boothout6, inl [11:9]，xmean_sq); 
booth_dm booth7_dm(boothout7, inl[13:11]，xmean—sq); 
booth_dm booth8_dm(boothout8, inl[15:13]，xniean_sq); 
fulladder_dm Mladderl_dm(coutl, muloutl, {14"b0，-boothoutl[32], boothoutl}, 
{12'b0,〜boothout2[32], boothout2,2'bO)， 
{lO'bO, ~boothout3[32], boothout3,4'bO]); 
fulladder_dm fulladder2_dm(cout2, mulout2, {8'bO, ~boothout4[32], boothout4, 6T30}, 
{6'bO,〜boothout5[32], boothoutS, m ] , 
{4'bO, ~boothout6[32], boothout6, lO'bO]); 
fulladder_dm fulladder3_dm(cout3, mulout3, {2'bO, ~boothout7[32], boothout?, 12'b0], 
{~boothout8[32], boothoutS, 14'b0], 
{15'b010101010101011,33'b0]); 
fo.lladder_dm fulladder4_dm(cout4, mulout4, {coutl[46:0], I'bO], muloutl, {cout2[46:0], 
I'bO)); 
fulladder_dm falladder5_dm(cout5, muloutS, mulout2, {cout3[46:0], rbO], muloutS); 
fulladder_dm Mkdder6_dm(cout6，mulout6, {cout4[46:0], I'bO], mulout4, {cout5[46:0], 
I'bO]); 
falladder_dm flilladder7_dm(cout7, muloutV, {cout6[46:0], I'bO}，mulout6, mulout5); 
cla48_dm cla48_dm(mulout_wire, {cout7_reg[46:0], I'bO}，mulout7_reg, counter_cla48, elk, 
reset); 
always @(posedge elk or negedge reset) 
begin 
if (reset ——0) 
begin 
cout7_reg <= 0; 




/* ——adding a registers between fulladder and cla48 ……*/ 
64 
Appendix III 
mulout7_reg <= mulout?; 





module core_cla48_dm(sumout, overflow, in—sel，inl, reg48_in, regl6_in, regxjn, elk, reset); 
input [47:0] reg48_in; 
input [47:0] ml; 
input [47:0] regxjn; 
input [47:0] regl6_in; 
input [1:0] in_sel; 
input elk; 
input reset; 
output [47:0] sumout; 
output overflow; 
reg [47:0] sum; 
reg carry; 
reg overf; 
wire [47:0] p, g; 
wire [47:0] carout; 
wire [47:0] intmpl; 
/* ——select the input between the multiplier output and the reg 16 output ——*/ 
assign intmpl = (in—sel == 2'bOO) ？ ml : 48l3z; 
assign intmpl = (in_sel == 2'blO) ？ reg48」n : 48'bz; 
assign intmpl = (in_sel == 2'bl 1) ？ regl6_in : 48'bz; 
assign sumout = sum; 
assign overflow = overf; 
assign p = intmpl I regxjn; 
assign g 二 intmpl & regx—in; 
assign carout[0] = g[0]; 
assign carout[47:l] = g[47:l] I p[47:l] & carout[46:0]; 
always @(posedge elk or negedge reset) 
begin 
if (reset == 0) 
begin 
sum <= 0; 
carry <= 0; 




carry <= carout[47]; 
sum[47:l] <= intmpl[47:1] ^  regx_in[47:l] ^  carout[46:0]; 
sum[0] <= intmpl[0] ^  regx_in[0]; 
if (carout[47] ^  carout[46] == 1) 
begin 







今 Register for X 
module regx_dm(dout, dout_xl, dout_x2, 
adderjn, start, load_adder, store, load_x2, sel_x2, wr_en_regx, fv_ack, 
elk, reset); 
output [47:0] dout, dout_xl, dout一x2; 
input [47:0] adderjn; 
input start; 




input elk, reset; 
reg [47:0] dout; 
reg [47:0] reg_xl, reg_x2; 
reg flag; 
assign dout—xl = reg_xl ； 
assign dout_x2 = reg_x2; 
always @(posedge elk or negedge reset) 
begin 
if (reset 二二 0) 
begin 
dout <= 0; 
reg—xl <= 0; 
reg_x2 <= 0; 




if (fv_ack 二一 1) 
begin 
dout <= 0; 
reg_xl <= 0; 
reg_x2 <= 0; 




if (start == 1) 
begin 
dout <•- 0; 
reg_x2 <= adderjn; 
flag <= 1; 
end 
else 




reg—xl <= reg_x2; 
dout <= 0; 
flag <= 1; 
end 
else 
if (store ：二 1) 
begin 
reg_x2 <= adderjn; 
dout <= reg_xl ； 





reg—xl <= adder—in; 
dout <= reg_x2; 
flag <= 1; 
end 
else 
if (load—adder == 1) 
begin 
dout <= 0; 
reg_xl <= adder—in; 
flag <= 1; 
end 
else 
if (flag == 1) 
begin 
dout <= dout; 
flag <= 0; 
end 
else 
if (wr_en_regx =:1) 
dout <= adderjn; 
else 





今 Subtracter and Comparator 
module subcomp_dm(out, sel_x2, inl, in2, en, elk, reset); 
input [47:0] inl, in2; 
input en; 
input elk, reset; 
output [47:0] out; 
output sel_x2; 




wire [47:0] im2; 
wire [47:0] p, g; 
wire [47:0] c; 
wire [47:0] sum, suml; 
assign im2 = -in2; 
assign p = inl I im2; 
assign g = inl & im2; 
assign c[0] = g[0]; 
assign c[47:l] = g[47:l] I p[47:l] & c _ ] ; 
assign sum[47:l] = ml[47:1]八 im2[47:l]八 c[46:0]; 
assign sum[0] : inl[0] ^  im2[0]; 
assign suml = -sum; 
always @(posedge elk or negedge reset) 
begin 
if (reset == 0) 
begin 
out <= 0; 




if (en == 1) 
begin 
if(sum[47] == 0) 
begin 
out <= suml; 




out <= sum; 




out <= out; 
if (sel_x2 == 1) 





module shift_dm(shift_num, datain, dataout, overflow, en, elk, reset); 
input [47:0] datain; 
input [5:0] shift_num; 
input en; 
input elk, reset; 




reg [47:0] dataout; 
reg overflow; 
always @(posedge elk or negedge reset) 
begin 
if(reset == 0) 
begin 
dataout <= 0; 






6'dO: dataout <= datairi; 
6'dl: begin dataout[47:l] <= datain[46:0]; dataout[0] <= 0; 
overflow <= ~((datain[47:46]==2ibll)&&(datain[45:0]!= 0)); end 
6'd2: begin dataout[47:2] <= datain[45:0]; dataout[l:0] <= 0; 
overflow <=〜((datam[47:45]==3'blll)&(datain[44:0]!= 0)); end 
6'd3: begin dataout[47:3] <= datain[44:0]; dataout[2:0] <= 0; 
overflow <= ~((datain[47:44]==4'hf)&(datain[43:0]!= 0)); end 
6'd4: begin dataout[47:4] <= datain[43:0]; dataout[3:0] <= 0; 
overflow <= -((datain[47:43]==5'blllll)&(datain[42:0]!= 0)); end 
6'd5: begin dataout[47:5] <= datain[42:0]; dataout[4:0] <= 0; 
overflow <= -((datain[47:42]==6'bl 11111 )&(datain[41:0] ！= 0));end 
6'd6: begin dataout[47:6] <= datain[41:0]; dataout[5:0] <= 0; 
overflow <= ~((datain[47:41 ]==7'bl 11111 l)&(datain[40:0] ！= 0)); end 
6'd7: begin dataout[47:7] <= datain[40:0]; dataout[6:0] <= 0; 
overflow <= ~((datain[47:40]==8'hff)&(datain[39:0]!= 0)); end 
6'd8: begin dataout[47:8] <= datain[39:0]; dataout[7:0] <: 0; 
overflow <= ~((datain[47:39]==9'bl 1111111 l)&(datain[38:0] ！= 0)); 
end 
6'd9: begin dataout[47:9] <= datain[38:0]; dataout[8:0] <= 0; 
overflow <= ~((datain[47:38]==101)l 11111111 l)&(datain[37:0]！= 0)); 
end 
6'dlO: begin dataout[47:10] <= datain[37:0]; dataout[9:0] <= 0; 
overflow <= ~((datain[47:37]==l ITd 11111111111)&(datain[36:0] ！= 
0)); end 
6'dll: begin dataout[47:ll] <= datain[36:0]; dataout[10:0] <: 0; 
overflow <= -((datain[47:36]==121ifff)&(datain[35:0] ！= 0)); end 
6'dl2: begin dataout[47:12] <= datain[35:0]; dataout[ll:0] <= 0; 
overflow <= ~((datain[47:35]==13'bll 111111111 ll)&(datain[34:0]!= 
0)); end 
6'dl3: begin dataout[47:13] <= datain[34:0]; dataout[12:0] <= 0; 
overflow <= 
〜((datam[47:34]==14'bllllllllllllll)&(datam[33:0]!=0));end 
6'dl4: begin dataout[47:14] <= datain[33:0]; dataout[13:0] <: 0; 
overflow <= 
〜((datain[47:33]==15'blllllllllllllll)&(datam[32:0]!= 0)); end 
6'dl5: begin dataout[47:15] <= datain[32:0]; dataout[14:0] <= 0; 
overflow <= ~((datain[47:32]==16'hffff)&(datain[31:0]!= 0)); end 




〜((datam[47:31 ]二 17'bl 1111111111111111)&(datain[30:0] ！= 0)); end 
6'dl7: begin dataout[47:17] <: datain[30:0]; dataout[16:0] <= 0; 
overflow <= 
〜((datain[47:30]==18'bllllllllllllllllll)&(datain[29:0]!=0)); end 
6'dl8: begin dataout[47:18] <= datain[29:0]; dataout[17:0] <= 0; 
overflow <= 
〜((datain[47:29]==19ibl 111111111111111111)&(datain[28:0]！= 0)); end 
6'dl9: begin dataout[47:19] <= datain[28:0]; dataout[18:0] <= 0; 
overflow <= ~((datain[47:28]==20’hfffff)&(datain[27:0]!= 0)); end 
6.d20: begin dataout[47:20] <= datain[27:0]; dataout[19:0] <= 0; 
overflow <= 
〜((datain[47:27]==21'bl 1111111111111111111 l)&(datain[26:0]！= 0)); end 
6'd21: begin dataout[47:21] <= datain[26:0]; dataout[20:0] <= 0; 
overflow <= 
~((datain[47:26]==22'bll 1111111111111111111 l)&(datain[25:0]!= 0)); end 
6'd22: begin dataout[47:22] <= datain[25:0]; dataout[21:0] <= 0; 
overflow <= 
~((datain[47:25]==23'bll 111111111111111111111)&(datain[24:0] != 0)); end 
6'd23: begin dataout[47:23] <= datain[24:0]; dataout[22:0] <= 0; 
overflow <= ~((datain[47:24]==24'hffffff)&(datain[23:0]!= 0)); end 
6'd24: begin dataout[47:24] <= datain[23:0]; dataout[23:0] <= 0; 
overflow <= 
〜((datain[47:23]==25’bl 1111111111111111111111 ll)&(datain[22:0] ！= 0)); end 
6'd25: begin dataout[47:25] <= datain[22:0]; dataout[24:0] <= 0; 
overflow <= 
~((datain[47:22]==26'bll 11111111111111111111111 l)&(datain[21:0] ！= 0)); end 
6'd26: begin dataout[47:26] <= datain[21:0]; dataout[25:0] <= 0; 
overflow <= 
~((datain[47:21 ]==27'b 1111111111111111111111111 ll)&(datain[20:0] ！= 0)); end 
6'd27: begin dataout[47:27] <= datain[20:0]; dataout[26:0] <= 0; 
overflow <= ~((datain[47:20]==28'hfffffff)&(datain[19:0]!= 0)); end 
6'd28: begin dataout[47:28] <= datain[19:0]; dataout[27:0] <= 0; 
overflow <= 
~((datain[47:19]==29'bll 1111111111111111111111111 ll)&(datain[18:0]！= 0)); end 
6'd29: begin dataout[47:29] <= datain[18:0]; dataoiit[:28:0] <= 0; 
overflow <= 
〜((datam[47:18]=30'bllllllllllllllllllllllllllllll)&(datain[17:0]!= 0)); end 
6'd30: begin dataout[47:30] <= datain[17:0]; dataout[29:0] <= 0; 
overflow <= 
〜((datain[47:17]==3rblllllllllllinilllUlllllllllll)&(datain[16:0]!=0));end 
6'd31: begin dataout[47:31] <= datain[16:0]; dataout[30:0] <= 0; 
overflow <=〜((datain[47:16]二321ifffffff:0&(datain[15:0]!= 0)); end 
6'd32: begin dataout[47:32] <= dataiii[15:0]; dataout[31:0] <= 0; 
overflow <= 
〜((datain[47:15]==33'blllllllllllllllllllllllllllllllll)&(datain[14:0]!=0));end 
6'd33: begin dataout[47:33] <= datain[14:0]; dataout[32:0] <= 0; 
overflow <= 
~((datain[47:14]==34'bllllllllllllllllllllllllllllllllll)&(datain[13:0]!=0));end 
6'd34: begin dataout[47:34] <= datain[13:0]; dataout[33:0] <= 0; 
overflow <= 
〜((datain[47:13]= 二 35'blllllllllllllllllllllllllllllllllll)&(datain[12:0]!=0));end 
6'd35: begin dataout[47:35] <= datain[12:0]; dataout[34:0] <= 0; 
overflow <= ~((datain[47:12]==36Mffffffff)&(datain[l 1:0]!= 0)); end 
70 
Appendix III 
6'd36: begin dataout[47:36] <= datain[ll:0]; dataout[35:0] <= 0; 
overflow <= 
〜((datam[47:ll]==37'bllllllllllllllllinillllllllinilllll)&(datain[10:0]!= 0)); end 







今 Look-Up Table 
module lut_dm(out, in, shift_overf, elk, reset); 




output [47:0] out; 
reg [47:0] out; 
always @(posedge elk or negedge reset) 
begin 
if( reset 二二 0) 
out <二 0; 
else 
begin 
































































































































































































































































































































































































































































今 I^ Ggistor for Gonst3nts 
module regl6_dm(data_out, datajn, addr, wr_en, fv_ack, elk, reset); 
output [47:0] data_out; 
input [15:0] data—in; 





reg [47:0] data_reg; 
assign data_out = data_reg; 




if (reset 二一 0) 
begin 




if (fv_ack ==1) 
data_reg <= 48'dO; 
else 
begin 
if (wr_en == 1) 
begin 
case (addr) 
2'blO : data_reg[47:32] <二 data.in; 
2'bOl : data_reg[31:16] <= data—in; 







今 Register for Scores 
module reg48_dm(data_outl, datajn, in—sel，out_sel, 
wr_en, out_en_reg48, frame—counter, word—end—index, 
word_slart_index, last_frame_counter, result_ack, //out_sel, 
counter_reg48, fv_ack, elk, reset); 
output [47:0] data_outl; 
input result_ack; 
input [47:0] data—in; 
input [3:0] in_sel; 
input elk, reset; 
input wr—en; 
input out_en_reg48; 
input [7:0] frame—counter; 
input word—end-index; 
input word_startJndex; 
input [2:0] last_frame_counter; 
input [2:0] out_sel; 
input [6:0] counter—reg48; 
input fv_ack; 
reg [47:0] data_regl, data_reg2, data_reg3, data_reg4, data_reg5, data—reg6，data_reg7, 
data_reg8, data_reg_tmp; 
/* ——data_reg{l-8] store the cost without changing state, dtata—reg—tmp store the cost 
with changing state ……*/ 
reg [47:0] data_outl_reg; 
assign data—outl 二 data_outl_reg; 




if (reset == 0) 
begin 
data—regl <= 0; 
data—reg2 <= 0; 
data—reg3 <= 0; 
data—reg4 <= 0; 
data—reg5 <= 0; 
data_reg6 <= 0; 
data_reg7 <= 0; 
data_reg8 <= 0; 
data_reg_tmp <= 0; 




if (fv_ack == 1) 
begin 
data—regl <= 0; 
data_reg2 <= 0; 
data—reg3 <= 0; 
data_reg4 <= 0; 
data_reg5 <= 0; 
data_reg6 <= 0; 
data_reg7 <= 0; 
data_reg8 <= 0; 
data—regjmp <= 0; 




/* assign different register to output ——*/ 
if (out_en_reg48 二二 1) 
begin 
case (out_sel) 
3'bOOO : data_outl_reg <: data_regl; 
3'bOOl : data一outl—reg <= data—reg2; 
3'bOlO : data一outl一reg <= data_reg3; 
3'bOll : data_outl_reg <= data_reg4; 
3'blOO : data_outl_reg <= data_reg5; 
3'blOl : data一outl_reg <: data_reg6; 
3'bllO : data_outi_reg <= data_reg7; 
3'bill : data_outl_reg <= data_reg8; 
endcase 
end 
/* ——reset all the parameters when calculation finished, in order to save power 
if (last_frame_counter[0] = 1 II result_ack 二— 1) 
data_regl <= 0; 
if (word—start-index == 1 II result—ack == 1) 
begin 
data_reg2 <= 0; 
data_reg3 <= 0; 
data_reg4 <= 0; 
81 
Appendix I 
data_reg5 <= 0; 
data_reg6 <= 0; 
data_reg7 <= 0; 
//data_reg8 <= 0; 
data_reg_tmp <= 0; 
end 
if (frame—counter == 8'dl) 
data_reg8 <= 0; 
/* ——storing the input when wr—en == 1 ——*/ 
if (resulLack == 0) 
begin 
if (wr—en == 1) 
begin 
case (in_sel) 
4'bO—000 : data_regl <= datajn; 
4'b0_001 : data_reg2 <: data_in; 
4'b0_010 : data—reg3 <= data—in; 
4'b0_011 : data_reg4 <= data—in; 
4’b0—100 : data_reg5 <= data—in; 
4'b0_101 : data_reg6 <= datajn; 
4,b0_110 : data_reg7 <= datajn; 
4'b0_lll : data_reg8 <= datajn; 
4'b 1—000 : data_reg_tmp <= datajn; 
4'b 1—001 : data_reg_tmp <= datajn; 
4,b 1—010 : data_reg—tmp <: data—in; 
4’b 1—011 : data—regjmp <= data—in; 
4'bl_100 : data_reg—tmp <= datajn; 
4'bl_101 : data_reg_tmp <= datajn; 
4'bl_110 : data_reg_tmp <: datajn; 





/* ——swap the registers when the data_reg—tmp > data_reg{l-8]——*/ 
if (result_ack 二二 0) 
begin 





8'dl : data_reg2 <= data_reg_tmp; 
8'd2 : data—reg3 <= data_reg_tmp; 
8'd3 : data_reg4 <= data_reg_tmp; 
8'd4 : data_reg5 <= data_reg_tmp; 
8'd5 : data_reg6 <= data_reg_tmp; 
8’d6 : data_reg7〈二 data—reg—tmp; 








if (out.sel == 3'd2 & & wr_en == I'bl) 
begin 
if ({data_regjmp[47], data_reg2[47]} ！= 2'blO) 
begin 
if ({data_reg_tmp[47], data_reg2[47]} == 2'bOl) 
data_reg2 <= data—reg_tmp; 
else 
if (data—reg_tmp > data—reg2) 
data—reg2 <= data—reg_tmp; 
end 
end 
if (out_sel == 3’d3 & & wr_en == I'bl) 
begin 
if ({data_reg_tmp[47], data_reg3[47]} ！= 2'blO) 
begin 
if ({data_reg_tmp[47], data_reg3[47]} == 2'bOl) 
data_reg3 <= data_regjmp; 
else 
if (data_reg_tmp > data_reg3) 
data_reg3 <= data_reg_tmp; 
end 
end 
if (out_sel == 3'd4 & & wr_en == I'bl) 
begin 
if ({data_reg_tmp[47], data_reg4[47]} ！二 2'blO) 
begin 
if ({data_regjmp[47], data_reg4[47]] == 2'bOl) 
data_reg4 <: data_reg_tmp; 
else 
if (data—reg_tmp > data_reg4) 
data_reg4 <= data_reg_tmp; 
end 
end 
if (out.sel == 3'd5 & & wr—en == I'bl) 
begin 
if ({data—reg_tmp[47], data_reg5[47]} ！= 2'blO) 
begin 
if ({data—reg_tmp[47]，data_reg5[47]] == 2'bOl) 
data_reg5 <= data—reg—tmp; 
else 
if (data_reg_tmp > data—reg5) 
data—reg5 <= data_reg_tmp; 
end 
end 
if (out_sel == 3'd6 & & wr_en == I'bl) 
begin 
if ({data_reg_tmp[47], data_reg6[47]] ！= 2'blO) 
begin 
if ({data_reg_tmp[47], data_reg6[47]] == 21^ 01) 
data_reg6 <= data_reg_tnip; 
else 
if (data_reg_tmp > data_reg6) 
83 
Appendix I 
data_reg6 <= data_regjmp; 
end 
end 
if (out.sel == 3’d7 & & wr—en == I'bl) 
begin 
if ({data—reg_tmp[47]，data_reg7[47]] ！= 2'blO) 
begin 
if ({data_reg_tmp[47], data_reg7[47]] == 2'bOl) 
data_reg7 <= data_reg_tmp; 
else 
if (data—regjmp > data_reg7) 
data_reg7 <= data_reg_tmp; 
end 
end 
if (out_sel == last_frame—counter & & last_frame_counter ！=0) 
begin 
if (wr_en == Tbl) 
begin 
if ({data_reg_tmp[47], data_reg8[47]} ！= 2'blO) 
begin 
if ({data_reg_tmp[47], data_reg8[47]] == 2'bOl) 
data_reg8 <: data_reg_tmp; 
else 
if (data_reg_tmp > data_reg8) 




if (out_sel == 3'dO & & wr_en == I'bl & & frame—counter ！= 8'd7) 
begin 
if({data_reg_tmp[47], data_reg8[47]} ！= 2'blO) 
begin 
if ({data_reg_tmp[47], data_reg8[47]) == 2'bOl) 
data_reg8 <= data_reg_tmp; 
else 
if (data—reg—tmp > data_reg8) 









令 Final Score Register 
module final_score_reg_dm(word_index_out, 
word_end_index, final_comp—index, fv_ack, data—fscore，elk, reset); 
84 
Appendix III 
output [5:0] word—index—out; 
input word—end—index, final_comp—index, elk, reset; 
input fv_ack; 
input [47:0] data_fscore; 
reg [47:0] data—reg; 
reg [5:0] word-index—oiit_reg; 
reg [5:0] word_index_counter; 
assign word_index_out = word—index一out_reg; 
always @(posedge elk or negedge reset) 
begin 
if (reset == 0) 
begin 
data一reg <= 481i800000000000; 
word_index_out_reg <= 0; 




if (fv_ack == 1) 
begin 
/* ——reset all the parameters when fv_ack == 1 ——*/ 
data_reg <= 481i800000000000; 
word_index_out_reg <= 0; 




/* ——counting the number of words have been calculated ——*/ 
if (word一end-index == 1) 
word_index_counter <= word_index_counter + 1 ； 
/* -- replace the old word index when the new global cost is higher than the old global cost -- */ 
if (final_comp_index == 1 & & word_index_couiiter ！ = 0) 
begin 
if ({data一fscore[47]，data_reg[47]] ！= 2'blO) 
begin 
if({data_fscore[47], data_reg[47]] == 2'bOl) 
begin 
data_reg <= data_fscore; 
word_index_out_reg <= word_index_counter; 
end 
else 
if (data_fscore > data—reg) 
begin 
data一reg <= data_fscore; 

















frame_counter, address_common, address—ram，address_rom, 
in_sel_mul, 
out_sel_clal6, 
in_sel_cla48, in_sel_reg48, out_sel_reg48, wr_en_reg48, out_en_reg48, 









output [6:0] counter—reg48; 
output result_ack; 
output [1:0] addr_regl6; 
output wr—en—reg 16; 
output [7:0] frame_counter; 
output [4:0] address—common; 
output [7:0] address_ram; 
output [10:0] address—rom; 
output in_sel_mul, out—sel_clal6; 
output [1:0] in_sel_cla48; 
output [3:0] in_sel_reg48; 






output [2:0] last-frame—counter; 
output [5:0] shift_num; 
output shift_en; 
input [5:0] word_num; 
input [5:0] word—index_out; 
input fv—ack; 
input elk, reset; 
input [15:0] inl; 
input [47:0] lut_out; 










reg [1:0] luUndex; 
reg [6:0] counter; 
reg [1:0] counter_4; 
reg counter—4_start; 
reg [7:0] frame_num; /* ——frame—num = (total number of frame) - 8 ——*/ 
reg in_sel—mul_reg; /* ——select multiplication between (l'bO)xx and (l'bl)xxy ——*/ 
reg out—sd_clal6一reg; /* ——select the output of (rbO)x-xmean or (I'bl)variance ——*/ 
reg [3:0] in_sel_reg48_reg; /* ——select the registers in the reg48 to store input ——*/ 
reg [3:0] in_sel_reg48_regjmp； /* ——use to generate the in_sel_reg48_reg from 
out_sel_reg48_reg ——*/ 
reg [2:0] out_sel_reg48_reg; /* ——select the output from the registers in the reg48 ——*/ 
reg wr_en_regx_reg; /* ——write enable of the reg48_reg, active high ……*/ 
reg wr_en_reg48_reg; 
reg [1:0] in_sel_cla48_reg; /* ——select the input between the multiplier output and the 
reg 16 output ——*/ 
reg out_en_reg48_reg; /* ----- set the output from (I'bl) the registers inside or (IW) the 
input ——*/ 
reg [7:0] frame—counter_reg; /* ——frame number counter (counting up to total frame number 
- 8 ) … - - *丨 
reg [2:0] last_frame_counter_reg； /* ----- counting the last 8 frames ----- */ 
reg word_end_index_reg； /* ——indicate the end of the calculation of one word ——*/ 
reg word_start_index_reg； /* ——indicate the start of the calculation of one word ——*/ 
reg fv_ack_reg; /* ——set to high when fv_ack_reg == 1 ——*/ 
reg [1:0] addr_regl6_reg; /* ——set the address of the registers in the reg 16 ——*/ 
reg wr_en_regl6_reg; /* ——write enable of the reg 16, active high ——*/ 
reg [4:0] addr—32bit; /* ——first five bits of output address, counting from 0 to 25 ——*/ 
reg [7:0] frame_num_addr; /* ——address select the feature vectors at appropriate frame 
time ——*/ 
reg [2:0] state_counter_reg; /* ——pointing at the state now being calculating ——*/ 
reg [2:0] state—counter—reg 1; 
reg [5:0] word—addr; /* ——pointing at the word now being calculating ——*/ 
reg [5:0] word—addrl; 
reg [4:0] tm_gc; /* ——Transition Matrix and Gaussian Constant ——*/ 
reg result_ack_reg; /* ——set to high when the search process finished ——*/ 
reg final_compJndex_reg； 
reg read—const; 
reg [5:0] shift_num; 
reg shift—set; 
reg shift_en; 
assign final—comp_index = final_compJndex_reg； 
assign result_ack = result_ack_reg; 
assign addr_regl6 = addr—reg 16—reg; 
assign wr_en_regl6 = wr_en_regl6_reg; 
assign word_start_index = word_staitJndex_reg ； 
assign last—frame—counter = last_frame_counter_reg； 
assign word—end—index = word_end_index_reg； 
assign frame—counter = frame_couiiter_reg; 
assign in_sel_mul = in_sel_mul_reg; 
assign out_sel_clal6 = out_sel_clal6_reg; 
87 
Appendix III 
assign in_sel_reg48 = in_sel_reg48_reg; 
assign out_sel_reg48 = out_sel_reg48_reg; 
assign wr_en_regx = wr_en_regx_reg; 
assign in_sel_cla48 = in_sel_cla48_reg; 
assign out_en_reg48 = out_en_reg48_reg; 
assign counter_reg48 = counter; 
assign wr_en_reg48 = wr_en_reg48_reg; 
assign regx_load_x2 = regx_load_x2_reg; 
assign regx_load_adder = regx_load_adder_reg; 
assign regx_start = regx_stait_reg; 
assign subcomp一en = subcomp_en_reg; 
assign regx_store = regx_store_reg; 
assign address_common = (lutjndex==2'b00)? lut_out[4:0]: 5TDZ; 
assign address_common = (lut_index==2'b01)? lut_out[20:16]: 5'bz; 
assign address一common = (lutjndex==2'bl0)? lut_out[36:32]: 5'bz; 
assign address—common = (lut_index==2'bl 1)? addr_32bit: Sl^ z; 
assign address_ram = {frame_num_addr}； 
assign address一rom = ({result_ack, lut—index，read_const]==4'bl_l 1_0) ？ {4'bOOOO, 
word_index_out} : ll'bz; 
assign address_rom = ({result_ack，lutjndex, read—const) ==4’b0_00_0) ？ lut_out[15:5]: IIIDZ; 
assign address_rom = ({result_ack, lutjndex, read_const) ==4'b0_01_0) ？ lut_out[31:21]: 
ll'bz; 
assign address_rom = ((result—ack, lutjndex, read_const) ==4'b0_l0_0) ？ lut_out[47:37]: 
ll,bz; 
assign address_rom = ({result_ack, lutjndex, read_const]==4'b0_l 1_0) ？ {db, counter[l], 
word_addr, state_counter_reg} : ll'bz; 
assign address_rom = ({result_ack, lutjndex, read_const} ==4'b0_l 1_1) ？ (db, counter[l], 
word_addrl, state_counter_regl} : ll'bz; 
/* counter[l] 二二 1 is mean, counter[l] == 0 is variance */ 
always @(posedge elk or negedge reset) 
begin 
if (reset 二二 0) 
begin 
counter <= 7'bOOOOOlO; 
counter_4 <= 0; 
counter_4—start <= 0; 
in_sel_mul_reg <= 0; 
out_sel_clal6_reg <= 0; 
in_sel_reg48_reg <= 0; 
out_sel_reg48_reg <= 0; 
wr_en_regx_reg <= 0; 
in_sel_cla48一reg <= 0; 
out_en_reg48_reg <= 0; 
frame_counter_reg <= 0; 
in_sel_reg48_reg_tmp <= 0; 
last_frame_counter_reg <= 0; 
word_end_index_reg <= 0; 
word_start_index_reg <= 0; 
fv_ack_reg <= 0; 
addr—regl6_reg <= 0; 
wr_en_regl6_reg <= 0; 
addr_32bit <= 5 'b l l l l l ; 
frame_num_addr <= 0; 
88 
Appendix III 
word—addr <= 0; 
word—addrl <= 6'bllllll; 
state—counter—reg <= 0; 
state—counter—reg 1 <= 0; 
tm_gc <= 0; 
result_ack_reg <= 0; 
wr_en_reg48_reg <= 0; 
regx_load_x2_reg <= 0; 
regx_load_adder_reg <= 0; 
regx_start_reg <= 0; 
subcomp_en_reg <= 0; 
regx_state <= 0; 
regx_store_reg <= 0; 
final_comp_index_reg <= 0; 
db <= 1; 
luUndex <=2'bll; 
frame—num <= 0; 
read-Const <= 0; 
shift_set <= 0; 
shift—num <= 0; 




if(shift—set 二 0) 
begin 
counter <= 7'dl; 
db <= 1; 
word_addr <= 0; 
state_counter_reg <= 0; 
addr—32bit <=5'blllll; 
if(counter == 7'dl) 
begin 
shift_num <= rom_out[5:0]; 





if (word_num 二二 word_addr & & counter == 7'd36) 
begin 
result_ack_reg <= 1 ； 
fv_ack_reg <= 0; 
end 
if (fv_ack_reg == 1) 
begin 
if (regx_state == 1 & & counter 二 7'dl 11) 
shift—en <= 1 ； 
else 
shift_en <= 0; 
if (counter == 7'dll & & regx—state == 0) 




if (counter == 7'dl5 & & regx_state == 0) 
lutjndex <=2'b01; 
else 
if (counter == 7'dl9 & & regx_state == 0) 
lutjndex <= 2'blO; 
else 
lutjndex <=2'bll; 
/* ——start address calculation ——*/ 
/* ——change the address to latch transition matrix and gaussian constant 
if (counter == 7’d23) 
begin 
addr_32bit <= 5'blllOl; 
tm』c <= addr_32bit; 
read—const <= 1 ； 
end 
else 
if (counter == 7621) 
begin 
addr_32bit <= 5'bllllO; 
tm_gc <= addr_32bit; 
read_const <= 1 ； 
end 
else 
if (counter == 7'd31) 
begin 
addr_32bit <= 5'blllll; 
tm_gc <= addr_32bit; 
read-Const <= 1; 
end 
else 
if (counter == 7'd35) 
begin 
addr_32bit <=5'bll010; 
tm_gc <= addr_32bit; 
end 
else 
if (counter == 7'd39) 
begin 
addr_32bit <= 5'bllOll; 
tm_gc <= addr—32bit; 
end 
else 
if (counter == 7’d43) 
begin 
addr_32bit <=5'blllOO; 
tm_gc <= addr_32bit; 
end 
else 
if (counter == 7'd36 II counter == 7'd40 II counter = 7'd44 il counter = 
7'd24 




addr_32bit <= tm_gc; 
read—const <= 0; 
end 
if (counter—4== 2"b00) 
begin 
/* ——the first 5 address bits counting from 0 to 25 (26 feature 
vectors) ----- */ 
if(addr_32bit==5'd25) 
begin addr_32bit <= 0; 
db <= ~db; 
end 
else 
addr_32bit <= addr_32bit + 1; 
/* ……pointing at the state now being calculating ——*/ 
if (addr_32bit == 5'd25 & & regx_state == 1) 
if (state_counter_reg == frame—coiinter_reg) 
begin 
state_counter_reg <= 0; 




state_counter_reg <= state_counter_reg + 1; 
state_counter_regl <= state_counter_reg; 
end 
/* ……calculating the address of the last eight frame time ……*/ 
if (addr_32bit == 5'd25 & & regx_state 二 1) 
begin 
if (last—frame—counter—reg 二二 3'dl & & state_counter_reg == 
3'bllO) 
begin 
state_counter_reg <= 0; 
state_counter_regl <= state_counter_reg; 
frame_num_addr <= frame_num_addr + 1 ； 
end 
else 
if (last_frame_counter_reg =: 3'd2 & & state_coimter_reg == 
3'blOl) 
begin 
state_counter_reg <= 0; 
state—counter—reg 1 <= state_counter_reg; 
frame_num_addr <: frame_num_addr + 1 ； 
end 
else 
if (last—frame—counterjeg =二 3'd3 & & state„counter_reg == 
3'blOO) 
begin 
state_counter_reg <= 0; 
state—counter—regl <= state—counter—reg; 





if (last_frame_counter_reg =: 3'd4 & & state_counter_reg == 
3'bOll) 
begin 
state_counter_reg <= 0; 
state_counter_regl <= state_counter_reg; 
frame_num_addr <: frame_num_addr + 1 ； 
end 
else 
if (last—frame_counter_reg = 二 3'd5 & & state_counter_reg == 
3'bOlO) 
begin 
state_counter_reg <= 0; 
state_counter_regl <= state_counter_reg; 
frame_num_addr <= frame_num_addr + 1 ； 
end 
else 
if (last_frame_counter_reg = 二 3'd6 & & state_coimter_reg == 
3'bOOl) 
begin 
state_counter_reg <= 0; 
state_counter_regl <= state_counter_reg; 
frame_num_addr <= frame_num_addr + 1; 
end 
else 
if (last_frame_counter_reg 二二 3’d7 & & state_counter_reg 二= 
3'bOOO) 
begin 
state_counter_reg <= 0; 
state_counter_regl <= state_counter_reg; 
frame_num_addr <: 0; 
word—addr <= word一 addr + 1 ； 
end 
else 
if (state_counter_reg == frame_counter_reg II state_counter_reg 
==3'blll) 
frame_num_addr <= frame—num_addr + 1 ； 
end 
end 
if (frame—counter_reg == 8'dO & & regx_state == 0 & & counter 二二 7'd36) 
word_addrl <= word_addr 1+1; 
end 
/* ——end address calculation ——*/ 
if (counter == 7'dl09 & & regx.state == I'bl) 
regx_start_reg <= I'bl; 
else 
regx_start_reg <= I'bO; 
if ((counter == 7'dl7 I! counter == 7'd21 II counter == 7'd33) 
& & regx—state == I'bO) 
begin 
if (frame—counter—reg == 0 & & word_addr ==0) 
regx_store..reg <= I'bO; 
else 





regx_store_reg <= I'bO; 
if (counter == 7’dl06) 
in_sel_cla48_reg <= 2W1; 
else 
if (counter == 7'dl8 & & regx一state == I'bO) 
in_sel_cla48_reg <= 2'blO; 
else 
if (counter == 7'd22 & & regx—state :: I'bO) 
in_sel_cla48_reg <=21311; 
else 
if (counter == 7'd34 & & regjc—state == I'bO) 
in_sel_cla48_reg <= 21311; 
else 
in_sel_cla48_reg <= 2 辑 
if ((counter == 7'dl9 II counter == 7'd23 II counter = 二 7'd35) 
& & regx.state == I'bO) 
begin 
if(frame_counter_reg 二二 0 & & word_addr == 0) 
regx_load_x2_reg <= I'bO; 
else 
regx_load_x2_reg <= I'bl; 
end 
else 
regx_load_x2_reg <= I'bO; 
if ((counter = 7'd23 II counter == 7’d35) 
& & regx_state 二二 I'bO & & frame—counter—reg ！= 0) 
wr_en_reg48_reg <= I'bl; 
else 
vvr_en_reg48_reg <= I'bO; 
if (counter ==7'dl 12) 
regx_state <= ~regx_state; 
if (regx—state —二 1 & & counter 二二 7'dllO) 
subcomp_en_reg <= 1 ； 
else 
subcomp_en_reg <= 0; 
if (regx_state == 0) 
begin 
if (counter == 7'dl09) 
regx_load_adder_reg <= 1 ； 
else 
regx_load_adder_reg <= 0; 
end 
if (counter == 7'd35 & & frame_counter_reg == 0 & & regx—state == 0) 
final_comp_index_reg <= 1 ； 
else 
final—comp_index_reg <= 0; 
if (counter_4_start = 1) 
begin 
if (regx_state == 1) 
if (counter == 7'dl03) 




out_en_reg48_reg <= I'bO; 
if (counterj == 21^00 II counter == 7'dl07 II 
(last_frame_counter_reg = 二 3'd7 & & counter == 7'dl 14)) 
wr_en_regx_reg <= 1; 
else 
wr_en_regx_reg <= 0; 
end 
/* ——output the appropriate transition matrix and gaussian constant ——*/ 
/* ——latching the transition matrix and gaussian constant into the reg 16 module 
… ― * / 
if ((counter == 7'dl2 & & regx_state == 0)11 (counter == 7'dl6 & & regx—state == 
0)11 
(counter == 7'd20 & & regx_state == 0)11 counter == 7'd24 
II counter == 7'd28 II counter == 7'd32 II counter == 7'd36 II counter == 7'd40 
II counter == 7'd44) 
wr_en_regl6_reg <= 1; 
else 
wr_en_regl6_reg <= 0; 
/* ——setting the address of the reg 16 module ……*/ 
if (counter == 7'dl2 II counter == 7'd32 II counter == 7'd44) 
addr_regl6—reg <= 2'bOO; 
else 
if (counter 二 = 7'dl6 II counter == 7'd28 II counter == 7'd40) 
addr_regl6—reg <= 2'bOl; 
else 
if (counter == 7'd20 II counter == 7'd24 II counter 二 7'd36) 
addr_regl6_reg <= 2'blO; 
/* ——setting the fv_ack_reg to high when fv_ack == 1 ——*/ 
if (fv_ack == 1) 
begin 
counter <= 7'dl; 
counter—4 <= 0; 
counter_4_start <= 0; 
in_sel_mul_reg <: 0; 
out_sel_clal6_reg <= 0; 
in_sel_reg48_reg <= 0; 
out_sel_reg48_reg <= 0; 
wr_en_regx_reg <= 0; 
in_sel_cla48_reg <= 0; 
out_en_reg48_reg <= 0; 
frame_counter_reg <= 0; 
in_sel_reg48_reg_tmp <= 0; 
last_frame_counter_reg <= 0; 
word_end_index_reg <= 0; 
word_start_index_reg <= 0; 
fv_ack_reg <= 1; 
addr_regl6_reg <= 0; 
wr_en_regl6_reg <= 0; 
addr_32bit <=5'blllll; 
frame_num_addr <= 0; 
word_addr <=〇； 
word_addrl <= 6 'b l l l l l l ; 
94 
Appendix III 
state_counter_reg <= 0; 
state_counter_regl <= 0; 
tm_gc <= 0; 
result_ack—reg <= 0; 
wr_en_reg48_reg <= 0; 
regx_load_x2_reg <= 0; 
regx_load_adder_reg <= 0; 
regx_start_reg <= 0; 
subcomp_en_reg <= 0; 
regx_state <= 0; 
regx_store_reg <= 0; 
final_comp_index_reg <= 0; 
db <= 0; 
lutjndex <= 2'bll; 
read—const <= 0; 
end 
if (counter == 2) 
frame_num <= inl [7:0]; 
/* ——generation of all the counters to be used ——*/ 
if (fv_ack_reg == 1 & & fv—ack ！= 1) 
begin 
if (word_end_index == I'bl) 
frame_counter_reg <= 8'bl 1111111; 
else 
if (counter == V'dlM & & out_sd_reg48—reg == 0 & & regx_state == 0) 
frame_counter_reg <= frame_counter_reg + 1 ； 
if (counter == 7'dll4) 
counter <= 7'dll; 
else 
counter <= counter + 1 ； 
counter_4 <= counter_4 + 1 ； 
if (counter 二二 7'd7) 
counter_4_start <= 1; 
end 
word_start_index_reg <= word_end_index_reg； 
/* ——reset the parameter for calculating the new word ——*/ 
if (counter == 7'dll2 & & frame_coiinter一reg == frame_num & & 
out_seLreg48_reg == 0 & & 
regx_state == I'bl) 
begin 
word 一end—index一reg <= 1; 
last_frame_counter_reg <= 0; 
end 
else 
word_end_index—reg <:., 0; 
/* ——select the multiplication of xx or xxy ——*/ 
in_sel_mul_reg <= counter[l], 
/* ——out一sd_clal6—reg == 1 means the output is vanance, 
else out_sel_clal6_reg == 0 means the output is the 




/* ----- in_sel_cla48_reg == 1 means the addition between the Gaussian constant 
and transition matrix elements output of the reg48 module, else 
in_sel_cla48_reg == 0 means the addition between the output of the reg48 
module and the output of the multiplier ……*/ 
if (regx_state --1) 
if (counter == 7'dlll) 
begin 
/* ——select the output of the reg48 module at the last eight frame time —— 
if (last_frame_counter_reg == 3'dO & & oiit_sd—reg48_reg == 31^111 & & 
frame—counter—reg == frame_num) 
begin 
last_frame_counter_reg <= 3'dl; 
out_sel_reg48_reg <= 3'bOOl; 
end 
else 
if (last_frame—counter—reg == 3'dl & & out_sel_reg48_reg == 3"bill) 
begin 
last_frame_counter_reg <= 3'd2; 
out_sel_reg48_reg <= 3'bOlO; 
end 
else 
if (last_frame_counter_reg == 3'd2 & & out_sel—reg48_reg == 31^111) 
begin 
last_frame_counter_reg <= 3'd3; 
out_sel_reg48_reg <= 3'bOll; 
end 
else 
if (last_frame—counter—reg == 3'd3 & & out—sel_reg48_reg == 3'bill) 
begin 
last_frame_counter_reg <= 3'd4; 
out_sel_reg48_reg <= 313100; 
end 
else 
if (last_frame_counter_reg == 3'd4 & & out_sel_reg48_reg ==31^111) 
begin 
last_frame_counter_reg <= 3'd5; 
out_sel_reg48_reg <= 3'blOl; 
end 
else 
if (last_frame_counter_reg == 3'd5 & & oiit_sel—reg48_reg =二 31：)111) 
begin 
last—frame—coimtei_reg <= 3'd6; 
out_sel_reg48_reg <= 3'bllO; 
end 
else 
if Gast_frame_counter_reg == 3'd6 & & out—sel—reg48_reg ==313111) 
begin 
last_frame_counter_reg <= 3'd7; 
out_sel_reg48_reg <= 3W11; 
end 
else 
/* ----- select the output of the reg48 module at the first eight frame time 
96 
Appendix III 
case ({frame_counter_reg, out_sel_reg48_reg]) 
11 'b00000000_000 : out_sel_reg48_reg <: 0; 
irb00000001_001 : out_sel_reg48_reg <= 0; 
U'bOOOOOOlO.OlO : out_sel_reg48_reg <= 0; 
irbOOOOOOl 1-011 : out_sel_reg48_reg <= 0; 
ll'bOOOOOlOOJOO : out_sel_reg48_reg〈二 0; 
ll'bOOOOOlOlJOl: out_sel_reg48—reg <= 0; 
ll'bOOOOOllOJlO : out_sel_reg48_reg <= 0; 
default: out_sel_reg48_reg <= out_sel_reg48_reg + 1 ； 
endcase 
end 
/* ——in—sel—reg—reg48[3] is used to control the write operation 
on the odd or even entry of the reg一48 ——*/ 
if (regx 一 state == 0) 
if (counter == 7'd35) 
in_sel_reg48_reg[3] <= I'bl; 
else 
in_sel_reg48_reg[3] <= IIdO; 
/* ——because in_sel_reg48_reg and out_sel_reg48_reg has 2 bit difference —— 
if (counter == 7'd33) 





�T o p 
module int_signal_dm(int_signal_out, mulout, sumout_cla48, reg48_outl, 
regl6_out, regx_out, subcomp_out, shift—out, lut_out, 
mod—sel，bit_sel, elk, reset); 
input [47:0] mulout, sumout_cla48, reg48_outl, regl6_out; 
input [47:0] regx_out, subcomp_out, shift—out, lut_out; 
output [7:0] int_signal_out; 
input reset, elk; 
input [2:0] mod_sel; 
input [2:0] bit_sel; 
reg [7:0] int_signal_out_reg; 
assign int_signal_out = int_signal_out_reg; 
always @(posedge elk or negedge reset) 
begin 
if (reset —二 0) 
begin 









3'dO : int_signal_out_reg <= mulout[7:0]; 
3'dl : int—signal—out—reg <= mulout[15:8]; 
3'd2 : int_signal_out_reg <= mulout[23:16]; 
3'd3 : int_signal_out_reg <= mulout[31:24]; 
3'd4 : int_signal_out_reg <= mulout[39:32]; 





if (mod.sel == 3'bOOl) 
begin 
case (bit_sel) 
3'dO : int_signal_out_reg <= sumout_cla48[7:0]; 
3'dl: int_signal_out_reg <= sumout_cla48[15:8]; 
3'd2 : int_signal_out_reg <= sumout_cla48[23:16]; 
3'd3 : int_signal_out_reg <= sumout—cla48[31:24]; 
3'd4 : int_signal_out_reg <= sumout_cla48[39:32]; 





if (mod.sel == 3'bOlO) 
begin 
case (bit—sel) 
3'dO : int_signal_out_reg <= reg48_outl[7:0]; 
3'dl : int_signal_out_reg <= reg48_outl[15:8]; 
3'd2 : int_signal_out_reg <= reg48_outl [23:16]; 
3'd3 : int_signal_out_reg <=reg48_outl[31:24]; 
3'd4 : int_signal_out_reg <= reg48_outl [39:32]; 





if (mod_sel== 3'bOll) 
begin 
case (bit_sel) 
3'dO : int—signal—out—reg <= reg 16—out[7:0]; 
3'dl : int—signal—out—reg <= regl6_out[15:8]; 
3'd2 : int_signal_out_reg <= regl6_out[23:16]; 
3'd3 : int_signal_out_reg <= regl6_out[31:24]; 
3'd4 : int_signal_out_reg <= regl6_out[39:32]; 










3'dO : int_signal_out_reg <= regx_out[7:0]; 
3'dl : int_signal_out_reg <= regx_out[15:8]; 
3'd2 : int_signal_out_reg <= regx_out[23:16]; 
3'd3 : int_signal_out_reg <= regx_out[31:24]; 
3'd4 : int_signal_out_reg <= regx_out[39:32]; 








3'dO : int_signal_out_reg <= subcomp_out[7:0]; 
3'dl : int_signal_out_reg <= subcomp_out[15:8]; 
3'd2 : int_signal_out_reg <= subcomp_out[23:16]; 
3'd3 : int_signal_out_reg <= subcomp_out[31:24]; 
3'd4 : int_signal_out_reg <= subcomp_out[39:32]; 








3'dO : int_signal_out_reg <= shift_out[7:0]; 
3'dl : int_signal_out_reg <= shift_out[15:8]; 
3'd2 : int_signal_out_reg <= shift_out[23:16]; 
3’d3 : int_signal_out_reg <= shift_out[31:24]; 
3'd4 : int_signal_out_reg <= shift_out[39:32]; 





if (mod_sel== 313111) 
begin 
case (bit—sel) 
3'dO : int_signal_out_reg <= lut_out[7:0]; 
3'dl : int_signal_out—reg <= lut_out[15:8]; 
3'd2 : int_signal_out_reg <= lut_out[23:16]; 
3'd3 : int_signal_out_reg <=lut_out[31:24]; 
3'd4 : int_signal_out_reg <= lut_out[39:32]; 









module top—core—dm(int_signal一out，result_ack, address—common，address_ram, address—rom’ 
overf, 
in 1，in2, word_num, mod_sel, bit_sel, 
fv_ack, elk, reset); 
output overf; /* ——overflow output ——*/ 
output [7:0] int—signal一out; /* ——output of the internal signal ——*/ 
output [7:0] address_ram; /* ——ram address for latching the feature vector */ 
output [10:0] address—rom; /* ——rom address for latching the mean, variance, gaussian 
constant, 
transition matrix, the recognized word will also be shown at the 
rom address bus ——*/ 
output [4:0] address—common; /* ——common address for first five bits of address_ram and 
address—rom ——*/ 
output result—ack; /* ——set to high when the search process finished ……*/ 
input [5:0] word_num; /* ——number of words stored in rom ——*/ 
input fv_ack; /* ——indicate the start of searching ——*/ 
input [15:0] inl, in2; /* ——inl is the data from ram, in2 is the data from rom ——*/ 
/* ——address 8'd31 of ram stores frame_num and will be latched in from 
m l …--*/ 
input elk, reset; 
input [2:0] mod一sel; /* ——select the internal output of the four signals between mulout, 
sumout_cla48, reg48_outl, data—fscore ——*/ 
input [2:0] bit—sd;/* ----- select the 8 bit outputs within the internal signal —- */ 
wire overf_clal6; /* ——overflow flag of 16bit carry look ahead adder ——*/ 
wire overf_cla48; /* ——overflow flag of 48bit carry look ahead adder ——*/ 
wire [5:0] wordJndex_out; 
wire [1:0] addr—reg 16; 
wire wr_en_regl6; 
wire word_start_index; 
wire [47:0] reg 16—out; 





wire [3:0] in_sel_reg48; 
wire [2:0] out—sel—reg48; 
wire [15:0] sumout_clal6; 
wire [47:0] mulout, sumout_cla48, reg48_outl; 
wire out_en_reg48; 
wire [7:0] frame—counter; 
wire word_endJndex; 
wire [2:0] last_frame_counter; 
wire [6:0] counter_reg48; 
wire [47:0] lut—out; 
wire [47:0] subcomp_out; 
wire [47:0] regx_outl; 
wire [47:0] regx_out2; 
wire [47:0] regx—out; 
wire [47:0] lut_in; 






reg [15:0] inl_d, in2_d; 
always @(posedge elk or negedge reset) 
begin 
if (reset == 0) 
begin 
fv_ack_d <= 0; 
inl—d <= 0; 




fv_ack_d <= fv_ack; 
inl_d <= inl; 
in2_d <= in2; 
end 
end 
assign overf = overf_clal6 丨 overf—cla48; 
int_signal_dm int_signal_dm 
(.int_signal_out(int_signal_out), .mulout(mulout),.sumout_cla48(sumout_cla48), • 
reg48_outl (reg48_outl), 
.regl6_out(regl6_out), .regx_out(regx_out), .subcomp_out(subcomp_out), 
.shift_out(lut_in), .lut_out(lut_out), 




.out_sel(out_sel_clal6), .clk(clk), .reset(reset)); 
multiplier_dm multiplier_dm 
(.mulout(mulout), .in_sel(in_sel_mul), 
.inl(sumout_clal6), .counter_cla48(counter_reg48[0]), .clk(clk), .reset(reset)); 
core_cla48_dm core_cla48_dm 
(.sumout(sumout_cla48), .overflow(overf_cla48), .in_sel(in_sel_cla48), 



























.out_sel_clal6(out_sel_clal6), .in_sel_cla48(in_sel_cla48), .in_sel_reg48(in_sel_re 
g48X 







.word_num(word_num), .inl(inl_d), .clk(clk), .reset(reset)); 
regl6_dm regl6_dm 






.fv_ack(fv_ack_d), .data_fscore(sumout_cla48), .clk(clk), 
.reset(reset)); 
regx_dm regx_dm 
(.dout(regx_out), .dout_xl(regx_outl), .dout_x2(regx_out2), .adder—in(sumout_cla 
48)， 
.start(regx_start), .load_adder(regx_load_adder), .store(regx_store), 




(.out(subcomp_out), .sel_x2(sel_x2), .inl(regx_outl), .in2(regx—out2)， 
.en(subcomp_en), .clk(clk), .reset(reset)); 
lut_dm lut_dm 
(.outdut—out)，.in(lut_in), .shift—overf(overf—shift)，.clk(clk), .reset(reset)); 
shift—dm shift_dm 
(.shift_num(shift_num), .datain(subcomp_out), .dataout(lut_in), .overflow(overf_s 





Appendix II Chip Microphotograph 
\ \ \ \ \ \ \ V\ V \ \ I FT J ! in:ff 傑一—二 
4 < % ^ < ^  % % % m- SI/ 1 * 1 M w^ M M W y w f …‘ 釋 ^ 
L , “ � 
— . . , _ 
纖 努 ： 偏 . • 寒 
•iilOiliPH 丨1irtii 
^ ：‘ 
' 广 ― ’ … ' 《 
\ ！ . H ： . ' 一 " 。 … ； … — ^ 
PP^ 'i'-'-'Li^ jF^ ''•••• • •：•； '•••ft • . , , 乂•^ , S ‘ > ‘ "•“ '““ " “ ‘ ‘“ ‘ Tfc 
103 
Appendix III 
Appendix III Pin Assignment of the 
Speech Recognition IC 
13 12 11 10 S 8 7 6 5 4 3 2 1 
^ . . 丽 函 丽 丽 丽 丽 碰 
“ © © © © © © © © © © © © © ^ 
© 0 © 0 © ©©〖. 
^ © 0 0 © ^ J © 0 0 © 
H ©G)© ©©©H 
^e-ee——.—— © e e -
^©0© ©0©「 
[ 0 © © 0 [ 
D © © © © D 
� © © © 0 © © 0 © � 
8 © © © © © © © © © © © C ^ B 
^QQQQQQQQQQOGfe -
13 12 11 10 9 a 7 6 5 4 3 2 1 
Top View of P G A 100 
‘ Pin 
Pin Name 、/ 丄 IN/OUT Description  
Number  
vdd3aUp_01 B 2 IN V D D 
, 1 。 1 select an internal block to check its output: bit 
PAD 丄 mod—sel_0 B 1 IN • 
‘~ ~ 门。 select an internal block to check its output: bit 
PAD 丄 mod_sd_l C 2 IN 1 
“ 。1 T^ T select an internal block to check its output: bit 
PAD丄mod_sel_2 C I IN 2 
pad—l_bit_sel_0 D 2 IN select 8 bits of an internal block's output: bit 0 
PAD_I_bit_sel_l D 13 IN select 8 bits of an internal block's output: bit 1 
PAD_I_bit_sel_2 E 2 IN select 8 bits of an internal block's output: bit 2 
p AD_I_word_num_0 E l IN word number of the vocabulary : bit 0 
gnd3aUp_01 F 3 IN ^  
pAD_I_word_num_l F 2 IN word number of the vocabulary : bit 1 
pAD_I_word_num_2 F 1 IN word number of the vocabulary : bit 2 
pAD_I_word_num_3 G 2 IN word number of the vocabulary : bit 3 
104 
Appendix III 
vdd3allp_02 G 3 IN V D D 
PAD_I_word_num_4 G 1 IN word number of the vocabulary : bit 4 
PAD_I_word_num_5 H 1 IN word number of the vocabulary : bit 5 
PADJ_fv_ack H 2 IN the Start signal 
gnd3aUp_02 H 3 IN G N D 
PAD_I_inl_0 J1 IN feature vector input: bit 0 
PADJLinl—1 J 2 IN feature vector input: bit 1 
PAD_I_inl_2 K 1 IN feature vector input: bit 2 
PAD_I_inl_3 K 2 IN feature vector input: bit 3 
PAD_I_inl_4 L 1 IN feature vector input: bit 4 
PAD_I_inl_5 M 1 IN feature vector input: bit 5 
vdd3allp_03 L 2 IN V D D 
PAD—I_inl_6 N 1 IN feature vector input: bit 6 
PAD_Linl_7 M 2 IN feature vector input: bit 7 
PAD—I_inl_8 N 2 IN feature vector input: bit 8 
gnd3aUp_03 M 3 IN G m  
PAD_I_inl_9 N 3 IN feature vector input: bit 9 
PAD_I_inl_10 M 4 IN feature vector input: bit 10 
PADJJnl _11 N 4 IN feature vector input: bit 11 
PAD—I_ml—12 M 5 IN feature vector input: bit 12 
PAD_I_inl_13 N 5 IN feature vector input: bit 13 
vdd3aUp—04 L 6 IN V D D  
PAD_Linl_14 M 6 IN feature vector input: bit 14 
PAD—I—inl_15 N 6 IN feature vector input: bit 15 
PAD_I_in2_0 M 7 IN model parameter input: bit 0 
gnd3allp_04 L 7 IN ^  
PAD_I_in2_l N 7 IN model parameter input: bit 1 
PAD_IJn2_2 N 8 IN model parameter input: bit 2 
PAD_I_in2_3 M 8 IN model parameter input: bit 3 
vdd3aUp_05 L 8 IN ^  
PAD_I_in2_4 N 9 IN model parameter input: bit 4 
PAD_I_in2_5 M 9 IN model parameter input: bit 5 
pad丄in2_6 N 10 IN model parameter input: bit 6 
105 
Appendix III 
gnd3allp_05 M 1 0 IN G N D 
PAD_I_in2_7 N i l IN model parameter input: bit 7 
PAD_I_in2_8 N 1 2 IN model parameter input: bit 8 
PAD—I_in2—9 M i l IN model parameter input: bit 9 
PAD_I_in2_10 N 1 3 IN model parameter input: bit 10 
vdd3aUp_06 M 1 2 IN V D D 
PAD_I_in2_l 1 M 13 IN model parameter input: bit 11 
PAD—I_in2」2 L 12 IN model parameter input: bit 12 
PAD_I_in2_13 L 13 IN model parameter input: bit 13 
PAD_I_in2_14 K 1 2 IN model parameter input: bit 14 
PAD_I_in2_15 K 1 3 IN model parameter input: bit 15 
gnd3aUp_06 J12 IN ^  
PAD_I_reset J13 IN reset signal 
PAD_0_overf H 11 O U T indicate the chip has been overflow 
。 T T 1 n ^ t t t t . address of the external memory storing feature 
PAD_0_address_ram_0 H 1 2 O U T vector: bit 0 
“ “ t t 1。 address of the external memory storing feature 
PAD—0—address一mm」 H 13 O U T vector: bit 1 
a 1。 address of the external memory storing feature 
PAD.0_address_ram_2 G 12 O U T vector: bit 2 
vdd3aUp_07 G i l IN ^  
‘“ ^ 1。 八T TTT^  address of the external memory storing feature 
PAD—〇—address-mm—3 G 13 O U T vector: bit3 
, …n 一 … address of the external memory storing feature 
PAD_0_address_ram_4 F 13 O U T vector: bit 4 
^ ~ 1 T^irr. address of the external memory storing feature 
PAD_Q_address_ram_5 F 12 O U T vector: bit5 
‘~ p 11 TT address of the external memory storing feature 
PAD_Q_address_ram_6 Fll O U T vector: bit6 
„ r 1。 ^Tim address of the external memory storing feature 
PAD_0_address_ram_7 E 13 OUT vector: bit7 
r … 。TT^ address of the external memory storing model 
PAD_Q_address_rom_Q E 12 O U T parameter: bit Q 
“ ^ .o MTT^ address of the external memory storing model 
PAD_0_address_romJ D i3 O U T parameter : bit 1 
gnd3aUp_07 D 12 IN ^  
~ 7 ~ ^ 1。 m address of the external memory storing model 
PAD_Q_address_rom_2 C 13 OUT parameter: bit 2 
‘ ~ ~ ^ 1。 ^Tim address of the external memory storing model 
PAD_0_address_rom_3 B 13 O U T parameter : bit 3 
‘ m n ^T rrn address of the external memory storing model 
PAD_0_address_rom_4 C 12 OUT parameter: bit 4 
106 
Appendix III 
dati c A 丁 斤 ~ address of the external memory storing model 
PAD_0_address rom 5 A 1 3 O U T , , . .  
— — parameter : bit 5 
nATv n ^^ ( io 八 tth^ address of the external memory storing model 
PAD_0_address_rom 6 B12 O U T , . ,  
- — parameter: bit 6 
o 〜冲 address of the external memory storing model 
PAD_0_address rom 7 A 1 2 O U T ^ , ^  
— — parameter: bit 7 
n A T ^ r ^ j j o "nil x^TTT^  address of the external memory storing model 
PAD_0_address rom 8 B11 O U T  
— — parameter : bit 8 
T^ -^r. ^  ,, A … 〜TT address of the external memory storing model 
PAD—0 address rom 9 A l l O U T , .  
— ~ parameter: bit 9 
A 。 1。 1。 MT TT address of the external memory storing model 
P A D 0 address rom 10 B10 O U T ‘ ,. 
“ “ parameter ： bit 10 
r w T ^ ^ i i ^ , common address of the two external 
P A D 〇 address common 0 A 1 0 O U T , . 
- : memories: bit 0 
vdd3aUp_08 B 9 IN V D D 
^ ,, 1 A n r^rr common addiess of the two external 
P A D 0 address common 1 A 9 O U T . , . 
- - — — memories: bit 1 
^ 。 r o nirr common address of the two external 
P A D 〇 address common 2 C 8 O U T , 。 
—— — - memories: bit I  
_ _ _ ,, 。 T^ 。 ^ ^ ^ common address of the two external 
PAD_0_address_common_3 B 8 O U T memories: bit 3 
_ _ _ ,, , A。 〜 m common address of the two external 
PAD_0_address_common_4 A 8 O U T memories: bit 4 
PAD_0_int_signal_out_0 B 7 O U T internal block's output: bit 0 
gnd3aUp_08 C 7 IN G N D 
PAD_0_int_signal_out_l A 7 O U T internal block's output: bit 1 
PAD_0_int_signal_out_2 A 6 O U T internal block's output: bit 2 
PAD_0_int_signal_out_3 B 6 O U T internal block's output: bit 3 
PAD_0_int_signal_out_4 C 6 O U T internal block's output: bit 4 
PAD_0_int_signal_out_5 A 5 O U T internal block's output: bit 5 
vdd3aUp_09 B 5 IN V ^  
PAD_0_int_signal_out_6 A 4 O U T internal block's output: bit 6 
P AD_0_int_signal_out_7 B 4 O U T internal block's output: bit 7 
PAD_0_result_ack A 3 O U T done signal 
PAD_I_clk_s A 2 IN elk for the double-mixture model 
PADJ_clk B 3 IN elk for the single-mixture model 
select the double-mixture model or the 
PAD丄dm_swap | A 1 | IN | single-mixture model  
107 
Reference 
Appendix IV The Testing Board of the IC 
、 ' I I I 塵I"ii""讀ii.::iii:ii 
隱晒^u 
謹疆 




圓圓 llll_lill saLJBjqi-n >iHnD 
