A VLSI Architecture for Output Probability and Likelihood Score Computations of HMM-Based Recognition Systems by Kazuhiro Nakamura et al.
Selection of our books indexed in the Book Citation Index 
in Web of Science™ Core Collection (BKCI)
Interested in publishing with us? 
Contact book.department@intechopen.com
Numbers displayed above are based on latest data collected. 
For more information visit www.intechopen.com
Open access books available
Countries delivered to Contributors from top 500 universities
International  authors and editors
Our authors are among the
most cited scientists
Downloads
We are IntechOpen,
the world’s leading publisher of
Open Access books
Built by scientists, for scientists
12.2%
122,000 135M
TOP 1%154
4,800
0A VLSI Architecture for Output Probability and
Likelihood Score Computations of HMM-Based
Recognition Systems
Kazuhiro Nakamura1, Ryo Shimazaki1, Masatoshi Yamamoto1,
Kazuyoshi Takagi2 and Naofumi Takagi2
1Nagoya University
2Kyoto University
Japan
1. Introduction
Due to their effectiveness and efficiency for user-independent recognition, hidden Markov
models (HMMs) are widely used in applications such as speech recognition (word
recognition, connected word recognition and continuous speech recognition), lip-reading and
gesture recognition. Output probability computations (OPCs) of continuous HMMs and
likelihood scorer computations (LSCs) are the most time-consuming part of HMM-based
recognition systems.
High-speed VLSI architectures optimized for recognition tasks have been developed for the
development of well-optimized HMM-based recognition systems (Mathew et al., 2003a;b;
Nakamura et al., 2010; Yoshizawa et al., 2004; 2006; Kim & Jeong, 2007). Yoshizawa et al.
investigated a block-wise parallel processing (BPP) for OPCs and LSCs, and proposed a
high-speed VLSI architecture for word recognition (Yoshizawa et al., 2002; 2004; 2006).
Nakamura et al. investigated a BPP, store-based block parallel processing (StoreBPP), for OPCs,
and proposed a high-speed VLSI architecture for OPCs (Nakamura et al., 2010). As for OPCs
and LSCs with StoreBPP, Viterbi scorer for the StoreBPP architecture is required, but not
presented yet. An easy application of a Viterbi scorer to the StoreBPP architecture requires
many registers and reduces the advantage of using StoreBPP. Different BPPs require different
architectures of Viterbi scorer. Viterbi scorer which is suitable for StoreBPP is required for the
development of well-optimized future HMM-based recognition systems.
In this chapter, we firstly show fast store-based block parallel processing (FastStoreBPP) for OPCs
and LSCs, and present a Viterbi scorer which supports FastStoreBPP. FastStoreBPP exploits
full performance of StoreBPP by doubling the bit length of the input to OPCs and LSCs,
e.g., from 8-bit to 16-bit. We demonstrate a high-speed VLSI architecture that supports
FastStoreBPP. We secondly show multiple store-based block parallel processing (MultipleStoreBPP)
for OPCs and LSCs, and present a Viterbi scorer which supports MultipleStoreBPP.
MultipleStoreBPP has high performance scalability by further extending the bit length of the
input to OPCs and LSCs, e.g., from 8-bit to 32-bit.
8
www.intechopen.com
2 Will-be-set-by-IN-TECH
......
.
.
.
.
.
.
Feature vectors
(stored in RAM)
HMM parameters
(stored in ROM)
voice
etc.
words
etc.
exclusive access Bus
Output probability computation (OPC) circuit
PE1s
PE1
PE2s
PE2
Register arrays
(feature vectors, HMM parameters,
intermediate results)
Register arrays
(HMM parameters, intermediate
results)
Output probabilities of HMMs
Viterbi scorer for likelihood scorer computation (LSC circuit)
Likelihood scores
Fig. 1. Basic structure of HMM-based recognition hardware.
Compared with the StreamBPP (Yoshizawa et al., 2002; 2004; 2006) architecture, our
FastStoreBPP and MultipleStoreBPP architectures have fewer registers and requires less
processing time. From a VLSI architectural viewpoint, a comparison shows the efficiency
of the MultipleStoreBPP architecture through its efficient use of processing elements (PEs).
The remainder of this chapter is organized as follows: the structure of HMM-based
recognition systems is described in Section 2, the FastStoreBPP architecture is introduced in
Section 3, the MultipleStoreBPP architecture is introduced in Section 4, the architectures are
evaluated in Section 5, and conclusions are presented in Section 6.
2. HMM-based recognition systems
2.1 HMM-based recognition hardware
Figure 1 shows the basic structure of the relevant part of HMM-based recognition
hardware (Mathew et al., 2003a;b; Nakamura et al., 2010; Yoshizawa et al., 2002; 2004; 2006;
Kim & Jeong, 2007). The OPC circuit and the Viterbi scorer for LSC work together as a
recognition engine. The inputs to the OPC circuit are feature vectors of several dimensions
and HMM parameters. These values are stored in RAM and ROM respectively, as shown
in Fig. 1. The RAM, ROM, OPC circuit and Viterbi scorer interconnect via a single bus,
and memory accesses are exclusive. The OPC circuit produces HMM output probabilities.
The inputs to the Viterbi scorer are these results and HMM parameters. The Viterbi scorer
computes likelihood scores using the Viterbi algorithm. In HMM-based recognition systems,
the most time-consuming task is OPCs and LSCs, and the OPC circuit and the Viterbi scorer
accelerate these computations. The OPC circuit and the Viterbi scorer have several register
arrays and PEs for efficient high-speed parallel processing. Feature vectors, HMM parameters
and intermediate results are effectively shared between PEs as shown in Fig. 1. More details
can be found in (Nakamura et al., 2010; Yoshizawa et al., 2002; 2004; 2006).
156 Embedded Systems – High Performance Systems, Applications and Projects
www.intechopen.com
A VLSI Architecture for Output Probability and Likelihood Score Computations of HMM-Based Recognition Systems 3
2.2 OPC of HMMs and LSC with Viterbi algorithm
Let O1, O2, ..., and OT be a sequence of P-dimensional input feature vectors to HMMs, where
Ot = (ot1, ot2, ..., otP), 1 ≤ t ≤ T. T is the number of input feature vectors, and P is the
dimension of the input feature vector. For any input feature vector Ot, the output probability
of N-state continuous HMMs at the j-th state is given by
log bj(Ot) = ωj +
P
∑
p=1
σjp(otp − µjp)
2 (1)
where ωj, σjp and µjp are the parameters of the Gaussian probability density function which
are precomputed and stored in ROM. The OPC circuit computes log bj(Ot) based on Eq. (1),
where 1 ≤ j ≤ N and 1 ≤ t ≤ T. All HMM parameters ωj, σjp, and µjp are stored in ROM
and the input feature vectors are stored in RAM. The values of T, N, P, and the number of
HMMs V differ for each recognition system. For isolated word recognition systems, T, N, P,
and V are 86, 32, 38, and 800, respectively (Yoshizawa et al., 2004; 2006), and for another word
recognition system, T, N, P, and V are 89, 12, 16 and 100 (Yoshizawa et al., 2002).
For output probabilitie log bj(Ot), where 1 ≤ j ≤ N and 1 ≤ t ≤ T, the log-likelihood score
log S∗ is given by
log δ1(j) = logpij + | log bj(O1)|, (2)
log δt(j) = min[log δt−1(j − 1) + | log aj−1,j|,
log δt−1(j) + | log aj,j|] + | log bj(Ot)|, (3)
log S∗ = min
1≤j≤N
[log δT(j)]. (4)
All HMM parameters logpij, log aj−1,j, and log aj,j are stored in ROM, and the Viterbi scorer
computes log δt(j) based on Eqs. (2) and (3). A flowchart of OPCs and LSCs is shown in Fig. 2
(Yoshizawa et al., 2004; 2006). All HMM output probabilities are obtained by P · N · T · V
times the partial computation of log bj(Ot) calls. Partial computation of log bj(Ot) performs
four arithmetic operations, an addition, a subtraction and two multiplications in Eq. (1). All
likelihood scores are obtained by N · T · V times the partial computation of log δt(j) calls.
Partial computation of log δt(j) performs three additions in Eq. (3). The OPC circuit and the
Viterbi scorer accelerate these computations. More details can be found in (Yoshizawa et al.,
2002; 2004; 2006).
2.3 Block parallel processing for OPCs and LSCs
BPP for OPCs and LSCs was proposed as an efficient high-speed parallel processing method
for HMM-based isolated word speech recognition (Yoshizawa et al., 2002). In BPP, the set of
input feature vectors is called a block, and HMM parameters are effectively shared between
the different input feature vectors for OPC. In recent years, two types of BPP are classified
according to input data flow: StreamBPP and StoreBPP (Nakamura et al., 2010).
A block can be considered as a M · P matrix whose elements are ot′p, where 1 ≤ t
′ ≤ M (≤ T)
and 1 ≤ p ≤ P. StreamBPP performs arithmetic operations on the input stream o11, o12, ..., o1P,
o21, o22, ..., o2P, ..., oM1, oM2, ..., oMP (Yoshizawa et al., 2006). StreamBPP performs N OPCs in
parallel with N PE1s (Fig. 1) and obtains N output probabilities log b1(Ot), log b2(Ot), ..., and
log bN(Ot) simultaneously. These N output probabilities are obtained every P clock cycles
157A VLSI Architecture for Output Probability and Likeliho d Sco e Computati ns of HMM-Based R cognit n Systems
www.intechopen.com
4 Will-be-set-by-IN-TECH
v = 0
j >N
t >T
NO
NO
YES
YES
Loop B
Loop C
v = v + 1
t = t + 1
v >V NO
YES
j = j + 1
Loop D
p = p + 1
p >P NO
YES
Loop A
Partial computation of logb  (O ) j t
t = 0
j = 0
p = 0
Partial computation of logδ ( j)
 t
Fig. 2. Flowchart of OPCs and LSCs.
and they are fed to the Viterbi scorer for LSCs. An Viterbi scorer that supports the OPCs and
LSCs in StreamBPP was presented in (Yoshizawa et al., 2006), where N LSCs are performed
with N PE2s (Fig. 1). In the Viterbi scorer, N intermediate scores log δt(1), log δt(2), ..., and
log δt(N) are computed simultaneously.
A block can be considered as a set of M input feature vectors whose elements are Ot′ where
1 ≤ t′ ≤ M (≤ T). StoreBPP performs arithmetic operations to locally stored input feature
vectors O1, O2, ..., and OM (Nakamura et al., 2010). StoreBPP performs ⌈M/2⌉ OPCs in
parallel with ⌈M/2⌉ PE1s (Fig. 1) and obtains M HMM output probabilities log bj(Ot′+1),
log bj(Ot′+2), ..., and log bj(Ot′+M). These M HMM output probabilities are obtained every
2 · P clock cycles and they are fed to the Viterbi scorer for LSCs. Different BPPs require
different Viterbi scorer architectures. In StoreBPP, a Viterbi scorer that supports the OPCs,
where M HMM output probabilities are computed simultaneously, is required, but the Viterbi
scorer for the StoreBPP was not addressed in (Nakamura et al., 2010). An easy introduction
of the Viterbi scorer to StoreBPP requires many registers, which reduces the advantage of the
StoreBPP architecture.
3. VLSI architecture for OPCs and LSCs with fast store-based block parallel
processing
3.1 Fast store-based block parallel processing
A two-step process was adopted in StoreBPP to compute M HMM output probabilities
log bj(Ot′+1), log bj(Ot′+2), ..., and log bj(Ot′+M), where half of the output probabilities
are computed simultaneously with ⌈M/2⌉ PE1s (Nakamura et al., 2010). The ⌈M/2⌉
158 Embedded Systems – High Performance Systems, Applications and Projects
www.intechopen.com
A VLSI Architecture for Output Probability and Likelihood Score Computations of HMM-Based Recognition Systems 5
computations which are performed in parallel and a ROM access are performed
simultaneously in StoreBPP, where two HMM parameters −µj,p+1 and σj,p+1 are required
for next OPC. Because it takes two cycles to read two HMM parameters from ROM, it was
appropriate to use a two-step process in StoreBPP.
StoreBPP performs ⌈M/2⌉ OPCs in parallel by using a register array of size M, where M is
the size of block (Nakamura et al., 2010). We improve StoreBPP by reducing the required
register size for performing M OPCs in parallel. We modify the parallel computation of
⌈M/2⌉ OPCs by doubling the bit length of the input to OPC. By this bit length extension,
two HMM parameters can be read simultaneously. We call the modified parallel processing
fast store-based block parallel processing (FastStoreBPP), and we show a pipelined Viterbi scorer
that supports FastStoreBPP. It performs M OPCs in parallel by using a register array of size
M.
A flowchart of our FastStoreBPP is shown in Fig. 3. The flowchart consists of two tasks,
M-parallel OPC and ⌈M/P⌉-stage pipelined LSC. In Fig. 3, the M-parallel OPC and ⌈M/P⌉-stage
pipelined LSC are framed by dashed and double dashed lines, respectively. The M-parallel
OPC performs M OPCs in parallel with M PE1s. The ⌈M/P⌉-stage pipelined LSC performs
LSC in serial with ⌈M/P⌉ PE2s and ⌈M/P⌉ registers. Loops A′, B′, and D′ correspond to
Loops A, B, and D (Fig. 2), where M HMM output probabilities and M intermediate scores
are computed with M PE1s and ⌈M/P⌉ PE2s. Loop C1 is based on StoreBPP, where Loop C
(Fig. 2) is partially expanded in Fig. 3.
PE11, PE12, ..., and PE1M (Fig. 3) represent M processing elements which perform partial
computations log bj(Ot′+1), log bj(Ot′+2), ..., and log bj(Ot′+M) using Eq. (1). M HMM output
probabilities log bj(Ot′+1), log bj(Ot′+2), ..., and log bj(Ot′+M) are simultaneously obtained
every P cycles by Loop A′. The modified M OPCs are performed in parallel and a ROM access
−µj,p+1 and σj,p+1 are performed simultaneously in Loop A
′. The two HMM parameters are
read from ROM simultaneously. These values are needed for next computation in Loop A′.
Next, the M HMM output probabilities log bj(Ot′+1), log bj(Ot′+2), ..., and log bj(Ot′+M)
are fed to ⌈M/P⌉-stage pipelined LSC (Fig. 3) which is a new Viterbi scorer introduced for
FastStoreBPP. PE2s represent ⌈M/P⌉ processing elements for computing M intermediate
scores log δt′+1(j), log δt′+2(j), ..., and log δt′+M(j) using Eqs. (2) and (3). In our FastStoreBPP,
Loops E1, E2, ..., and E⌈M/P⌉ (Fig. 3) perform LSC, where the i-th intermediate score log δt′+i(j)
is computed by Loop E⌈i/P⌉. Each Loop Ei′ , where 1 ≤ i
′ ≤ ⌈M/P⌉ − 1, has a PE2
and computes P intermediate scores sequentially, where P clock cycles are required for the
computation. Loop E⌈M/P⌉ has a PE2 and computes M − P · (⌈M/P⌉ − 1) intermediate
scores sequentially by using M − P · (⌈M/P⌉ − 1) clock cycles. All Loop Es are pipelined in
FastStoreBPP, where the last intermediate score log δt′+i′·P(j), obtained by Loop Ei′ , 1 ≤ i
′ ≤
⌈M/P⌉ − 1, is fed to the next stage Loop Ei′+1. Loop A
′ and Loop Ei′s, 1 ≤ i
′ ≤ ⌈M/P⌉,
proceed simultaneously, where each Loop Ei′ finishes its computation before the next M
output probabilities are obtained by Loop A′.
3.2 FastStoreBPP architecture for OPCs and LSCs
Our FastStoreBPP architecture that supports FastStoreBPP is shown in Fig. 4, where we
assume M ≤ P and hence ⌈M/P⌉ = 1. The FastStoreBPP architecture consists of an OPC
circuit and a Viterbi scorer. The architecture has two register arrays (RegO and Regω),
159A VLSI Architecture for Output Probability and Likeliho d Sco e Computati ns of HMM-Based R cognit n Systems
www.intechopen.com
6 Will-be-set-by-IN-TECH
 . . .
 . . .
 . . .
 . . .
 . . .
 . . .
 . . .
 . . .
Loop D′
Loop C1
Loop B′
Loop A′
Loop E1 Loop E⌈M/P⌉
v = v + 1
t′ = t′max , t
′
max = t
′
max + M
v = 0
t′max = 0
j = 1, p = 1
Load Ot to RegO (t = t′ + 1, t′ + 2, ..., t′ + M), (M · P/2 cycles)
Load −µ1,1 and σ1,1 to Regµ and Regσ, respectively (1 cycle)
j = 0
j = j + 1
Load ω j to Regω(and load log pi j to RegTmpδ j when t = t′ + 1 == 1) (1 cycle)
Load log a j, j to Rega j, j
(and Load log a j−1, j to Rega j−1, j when 2 ≤ j) (1 cycle).
M-parallel OPC
p = 0
p = p + 1
PE11, t = t′ + 1 PE1M , t = t′ + M
log b j(Ot′+1) log b j(Ot′+M)
Load −µ j,p+1 and σ j,p+1 to
Regµ and Regσ (1 cycle)
Copy Regω to RegInδ
t′′
⌈M/P⌉ last = t
′
+ M
t′′
⌈M/P⌉ = t
′
+ P · (⌈M/P⌉ − 1)
t′′1 last = t
′
+ P
t′′1 = t
′′
1 + 1 t
′′
⌈M/P⌉ = t
′′
⌈M/P⌉ + 1
PE2 PE2
log δt′′1 ( j) log δt′′⌈M/P⌉ ( j)
⌈M/P⌉-stage pipelined LSC
t′′1 = t
′
NO
NO
NO
NO NO
NO
YES
YES
YES
YES
YES
YES
t′′
⌈M/P⌉
≥ t′′
⌈M/P⌉ last
t′′1 ≥ t
′′
1 last
p ≥ P
j ≥ N
t′max ≥ T
v ≥ V
Fig. 3. Flowchart of OPCs and LSCs with FastStoreBPP.
160 Embedded Systems – High Performance Systems, Applications and Projects
www.intechopen.com
A VLSI Architecture for Output Probability and Likelihood Score Computations of HMM-Based Recognition Systems 7
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
. . .
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
. . .
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
. . . . . .
. . . . . .
.
 
.
 
.
. . . . . .
.
 
.
 
.
ROM
µ, σ, ω, a, a, pi
RAM
O
OPC circuit
Viterbi scorer
ot,p
Regµ Regσ
RegωRegO
Rega j, j 1 Rega j−1, j 1
RegTmpδ j 1
RegTmpδ j−1 1
RegLastδ
RegInδ1
P
M
M
M M N
PE11
PE12
PE1M
PE2
SEL SEL +
+
+
++
++
++ ×
×
×
×
×
×
CMP (t == 1)
CMP
<
Fig. 4. FastStoreBPP architecture for OPCs and LSCs (M ≤ P).
two registers (Regµ and Regσ), and M PE1s for OPCs. Each PE1 consists of two adders
and two multipliers, which are used for computing ωj + ∑
P
p=1 σjp(otp − µjp)
2. PE1i in
the FastStoreBPP architecture, the StreamBPP architecture (Yoshizawa et al., 2006), and the
StoreBPP architecture (Nakamura et al., 2010) are identical but differ in number. In addition,
the architecture has three register arrays (RegInδ1, RegLastδ, and RegTmpδj−1 1), three
registers (Regaj,j 1, Regaj−1,j 1, and RegTmpδj 1), and a PE2 for LSCs. PE2 consists of three
adders, two selectors and two comparators, which are used for LSC based on Eqs. (2) and (3).
PE2 in the FastStoreBPP architecture and the StreamBPP architecture (Yoshizawa et al., 2006)
are identical but differ in number.
OPC starts by reading M input feature vectors Ot′+1, Ot′+2, ..., and Ot′+M from RAM and
storing them in RegO in OPC circuit (Fig. 4) based on Loop C1 (Fig. 3). M · P/2 cycles are
required for reading M input feature vectors. Then, the HMM parameters of v-th HMM
are read from ROM, which are −µ11, σ11, and ω1, and stored in Regµ, Regσ, and Regω,
respectively, based on Loop C1 and Loop B′ (Fig. 3). For the stored input feature vectors,
M intermediate results of M OPCs are simultaneously computed with the stored HMM
parameters by using M PE1s (Fig. 4) based on Loop A′ (Fig. 3).
The stored HMM parameters are shared by all PE1s, and the obtained M intermediate results
are stored in Regω. At the same time, two HMM parameters −µjp+1 and σjp+1 of v-th
HMM are read from ROM and stored in Regµ and Regσ, respectively, where the values are
overwritten. The HMM parameters are used in next M computations which are performed in
parallel with M PE1s based on Loop A′ (Fig. 3).
M HMM output probabilities are simultaneously obtained every P cycles by M PE1s, which
are log bj(Ot′+1), log bj(Ot′+2), ..., and log bj(Ot′+M) of v-th HMM based on Loop A
′ (Fig. 3).
161A VLSI Architecture for Output Probability and Likeliho d Sco e Computati ns of HMM-Based R cognit n Systems
www.intechopen.com
8 Will-be-set-by-IN-TECH
The results are copied from Regω to RegInδ1 to start LSCs log δt′′+1(j), log δt′′+1(j), ..., and
log δt′′+M(j) and the next M OPCs for the (j + 1)-th state of v-th HMM log bj+1(Ot′+1),
log bj+1(Ot′+2), ..., and log bj+1(Ot′+M) based on Loop B
′ (Fig. 3). M · N HMM output
probabilities of v-th HMM are obtained by Loop B′ (Fig. 3). M · N · T HMM output
probabilities of v-th HMM are obtained by Loop C1 (Fig. 3). M · N · T · V HMM output
probabilities of all HMM are obtained by Loop D′ (Fig. 3).
Viterbi scorer, denoted by double -dashed lines in (Fig. 4), performes ⌈M/P⌉-stage pipelined
LSC, denoted by double -dashed lines in Fig. 3. LSC starts by reading HMM parameters
of v-th HMM, logpi1 and log a1,1, from ROM and storing them in RegTmpδj 1 and Regaj,j 1,
respectively based on Loop B′ (Fig. 3). Then, an intermediate score log δ1(1) is computed by
PE2 with the HMM parameter, logpi1, and the HMM output probability log b1(O1) obtained
using Eq. (1). The obtained intermediate score is stored in both RegTmpδj 1 and RegTmpδj−1 1
(Fig. 4). RegTmpδj 1 stores an intermediate score that is needed in the next computation in
Loop E1 (Fig. 3). RegTmpδj−1 1 stores M intermediate scores log δt′+1(j), log δt′+2(j), ..., and
log δt′+M(j), which is needed in the next LSC for the (j + 1)-th state of the HMM in Loop
B′. After the computation of log δ1(1), M − 1 intermediate scores log δ2(1), log δ3(1), ..., and
log δM(1), are sequentially computed by PE2 using Eq. (3). In sequential computation, the
last obtained intermediate score log δM(1) is stored in RegTmpδj−1 1 and RegLastδ (Fig. 4).
RegLastδ stores N intermediate scores that are the last obtained intermediate scores by Loop
E1 during Loop B
′ (Fig. 3). These intermediate scores are log δt′+M(1), log δt′+M(2), ..., and
log δt′+M(N) of v-th HMM, which are required when starting LSC with new M HMM output
probabilities log bj(Ot′+M+1), log bj(Ot′+M+2), ..., and log bj(Ot′+2·M) at the first computation
in Loop E1 (Fig. 3), given as log δt′+M+1(1), log δt′+M+1(2), ..., and log δt′+M+1(N). A required
intermediate score is read from RegLastδ and is stored in RegTmpδj 1 before computation.
Regaj−1,j 1 (Fig. 4) stores an HMM parameter log aj−1,j of v-th HMM, which is used for
computing log δt(j) based on Eq. (3), when 2 ≤ t and 2 ≤ j.
Our Viterbi scorer, which support FastStoreBPP, was presented in Fig. 4, where M ≤ P
and ⌈M/P⌉ = 1. Our ⌈M/P⌉-stage pipelined Viterbi scorer, which supports P < M and
1 < ⌈M/P⌉, is shown in Fig. 5. The Viterbi scorer in Fig. 4 is an instance of the generalized
⌈M/P⌉-stage pipelined Viterbi scorer where ⌈M/P⌉ = 1. The ⌈M/P⌉-stage pipelined Viterbi
scorer consists of RegLastδ and ⌈M/P⌉ sub Viterbi scorers. The i-th stage, i.e., i-th sub Viterbi
scorer consists of two register arrays (RegInδi and RegTmpδj−1 i), three registers (RegTmpδj i,
Regaj,j i and Regaj−1,j i) and a PE2. Each RegInδi consists of i · P registers, where i = 1,
i = 2, ..., ⌈M/P⌉ − 1. RegInδ⌈M/P⌉ consists of ⌈M/P⌉ · (M mod P) registers. In each
RegInδi , rows are shifted upward every P clock cycles. Each RegTmpδj−1 i consists of P
registers, where i = 1, i = 2, ..., and ⌈M/P⌉ − 1. RegTmpδj−1 ⌈M/P⌉ consists of M mod P
registers. HMM parameters in Regaj,j i and Regaj−1,j i are copied every P clock cycles to
Regaj,j i+1 and Regaj−1,j i+1, respectively, where i = 1, i = 2, ..., and ⌈M/P⌉ − 1. The
last obtained intermediate score by PE2 based on Loop Ei (Fig. 3), where i = 1, i = 2, ...,
and ⌈M/P⌉ − 1, is stored in RegTmpδj−1 i and RegTmpδj i+1 every P clock cycles based on
⌈M/P⌉-state pipelined LSC (Fig. 3). The last obtained intermediate score by PE2 based on
Loop E⌈M/P⌉ (Fig. 3) is stored in RegTmpδj−1 ⌈M/P⌉ and RegLastδ every P clock cycles during
Loop B′ (Fig. 3). The stored intermediate scores are required when starting LSC with new M
output probabilities at the first computation by Loop E1 (Fig. 3). The required intermediate
score is read from RegLastδ and stored in RegTmpδj 1 before computation.
162 Embedded Systems – High Performance Systems, Applications and Projects
www.intechopen.com
A VLSI Architecture for Output Probability and Likelihood Score Computations of HMM-Based Recognition Systems 9
. . . . . .
. . . . . .
. . . . . .
.
 
.
 
.
. . . . . .
. . . . . .
.
 
.
 
.
. . . . . .
. . .
. . .
.
 
.
 
.
. . . . . .. . .
.
 
.
 
.
. . .
P
P
M mod P
M mod P
M mod P
Rega j, j 1 Rega j, j ⌈M/P⌉Rega j−1, j 1 Rega j−1, j ⌈M/P⌉
RegTmpδ j 1 RegTmpδ j ⌈M/P⌉
RegTmpδ j−1 1 RegTmpδ j−1 ⌈M/P⌉
RegLastδ
RegInδ1
RegInδ⌈M/P⌉
M
NPE2 PE2
SEL SELSEL SEL +
+
++
+
+
CMP (t == 1)CMP (t == 1)
CMP
<
CMP
<
⌈M/P⌉
Fig. 5. Pipelined Viterbi scorer for the FastStoreBPP architecture.
4. VLSI architecture for OPCs and LSCs with multiple store-based block parallel
processing
FastStoreBPP is obtained from StoreBPP by doubling the bit length of the input to OPC for
reading two HMM parameters simultaneously.
We further extend the bit length of the input to OPC and we obtain multiple store-based block
parallel processing (MultipleStoreBPP).
4.1 Multiple store-based block parallel processing
StoreBPP performs M/2 OPCs in parallel by using a register array of size M and M/2 PE1s,
where M/2 OPCs are performed by a single HMM (Nakamura et al., 2010). We improve
StoreBPP and further reduce the required register size for performing OPCs in parallel.
We modify M/2-parallel OPC, where M/2 OPCs are performed in parallel, to deal with
multiple HMMs by further extending the bit length of the input to OPC. By this bit-length
extension, L′ · M′/2 OPCs are performed in parallel, where L′ HMM parameters can be read
from ROM simultaneously. We call the modified parallel processing MultipleStoreBPP. Our
MultipleStoreBPP performs M′/2-parallel OPC, where M′/2 OPCs are performed in parallel,
to L′ HMMs and L′ · M′/2 OPCs are performed in parallel by using a register array of size M′.
A flowchart of our MultipleStoreBPP is shown in Fig. 6. Loops D2′, C1′, B′′, and A′′ in Fig. 6
are based on StoreBPP. In our MultipleStoreBPP, Loop D’ (Fig. 3) is partially expanded as
shown in Fig. 6 for performing L′ · M′/2 OPCs in parallel in Loop A′′. By the expansion,
input feature vectors are effectively shared between different M′/2-parallel OPCs.
The flowchart consists of L′ M′/2-parallel OPCs and L′ ⌈M′/(2 · P)⌉-stage pipelined LSCs.
Each M′/2-parallel OPC and ⌈M′/(2 · P)⌉-stage pipelined LSC are denoted by dashed and
163A VLSI Architecture for Output Probability and Likeliho d Sco e Computati ns of HMM-Based R cognit n Systems
www.intechopen.com
10 Will-be-set-by-IN-TECH
 . . .
 . . .
 . . .
 . . .
LoopD2′
LoopC1′
LoopB′′
LoopA′′
v′max = v
′
max + L′
t′ = t′max, t
′
max = t
′
max + M′, v′ = v′max − L′
v′max = 0
t′max = 0
v = v′ + 1, j = 1, p = 1
Load Ot to RegO (t = t′ + 1, t′ + 2, ..., t′ + M′), (M′ · P/L′ cycles)
For L′ HMMs, load −µ1,1 and σ1,1 to Regµ and Regσ, respectively (2 cycles)
j = 0
j = j + 1
For L′ HMMs, load ω j to Regω
(and load log pi j to RegTmpδ j when t = t′ + 1 == 1) (2 cycles)
For L′ HMMs, load a j, j to Rega j, j
(and load a j−1, j to Rega j−1, j when 2 ≤ j) (2 cycles)
p = 0
p = p + 1
M′/2 OPCs
(t = 1, ..., M′/2)
M′/2 OPCs
(t = 1, ..., M′/2)
M′/2 OPCs
(t = 1, ..., M′/2)
M′/2 OPCs
(t = M′/2+1, ..., M′)
M′/2 OPCs
(t = M′/2+1, ..., M′)
M′/2 OPCs
(t = M′/2+1, ..., M′)
For L′ HMMs, load −µ j,p+1
to Regµ (1 cycle)
For L′ HMMs, load σ j,p+1
to Regσ (1 cycle)
For L′ HMMs, copy Regω to RegInδ
⌈M′/(2 · P)⌉-stage
pipelined LSC
⌈M′/(2 · P)⌉-stage
pipelined LSC
⌈M′/(2 · P)⌉-stage
pipelined LSC
v = v′ + 1
v = v′ + 1
v = v′ + 1
v = v′ + 2
v = v′ + 2
v = v′ + 2
v = v′ + L′
v = v′ + L′
v = v′ + L′
NO
NO
NO
NO
YES
YES
YES
YES
.........
.........
.........
p ≥ P
j ≥ N
t′max ≥ T
v′max ≥ V
Fig. 6. Flowchart of OPCs and LSCs with MultipleStoreBPP.
64 Embedded Systems – High Performance Systems, Applications and Projects
www.intechopen.com
A VLSI Architecture for Output Probability and Likelihood Score Computations of HMM-Based Recognition Systems 11
double-dashed lines, respectively, in Fig. 6. Each M′/2-parallel OPC performs M′/2 OPCs
in parallel with the same PE1s used in StoreBPP and FastStoreBPP, but differ in number.
Each ⌈M′/(2 · P)⌉-stage pipelined LSC computes likelihood scores based on ⌈M/P⌉-stage
pipelined LSC denoted by double-dashed lines in Fig. 3 with the same PE2s used in
FastStoreBPP, but differ in number. Loops A′ and B′ (Fig. 3) correspond to Loops A′′ and
B′′, respectively. By Loop A′′, the output probabilities of L′ HMMs, (v′ + 1)-th to (v′ + L′)-th
HMMs, are computed with L′ · M′/2 PE1s. These output probabilities are simultaneously
obtained every P clock cycles by Loop A′′. In Loop A′′, firstly, L′ M′/2-parallel OPCs and
ROM access −µj,p+1 of L
′ HMMs are performed simultaneously. Secondly, L′ M′/2-parallel
OPCs and ROM access σj,p+1 of L
′ HMMs are performed simultaneously. 2 · L′ HMM
parameters are read from ROM using two cycles. These HMM parameters are needed for
next computation in Loop A′′. Then, the obtained L′ · M′/2 output probabilities are fed to L′
LSCs, where each LSC is the same Viterbi scorer as that introduced for FastStoreBPP, as shown
in Fig. 3. These L′ LSCs support LSC of (v′ + 1)-th to (v′ + L′)-th HMMs. Loop A′′ and L′
LSCs proceed simultaneously.
4.2 MultipleStoreBPP architecture for OPCs and LSCs
Our MultipleStoreBPP VLSI architecture is shown in Fig. 7, where we assume M′ ≤ 2 · P
and hence ⌈M′/(2 · P)⌉ = 1. The MultipleStoreBPP architecture consists of L′ OPC circuits
and L′ Viterbi scorers. The architecture has 2 · L′ + 1 register arrays (RegO, Regσ and
Regω), L′ registers (Regµ), and L′ · M′/2 PE1s for OPCs of L′ HMMs. Each PE1 consists
of two adders and two multipliers, which are used for computing ωj + ∑
P
p=1 σjp(otp − µjp)
2.
PE1s in our MultipleStoreBPP and FastStoreBPP architectures, Yoshizawa et al. (2006), and
Nakamura et al. (2010) are identical but differ in number. In addition, the architecture has 3 · L′
register arrays (RegInδ1, RegLastδ, and RegTmpδj−1 1), 3 · L
′ registers (Regaj,j 1, Regaj−1,j 1,
and RegTmpδj 1), and L
′ PE2s for LSCs of L′ HMMs. PE2 consists of three adders, two
selectors and two comparators, which are used for LSC on the basis of Eqs. (2) and (3). PE2s in
our MultipleStoreBPP and FastStoreBPP architectures and Yoshizawa et al. (2006) are identical
but differ in number.
OPC starts by reading M′ input feature vectors Ot′+1, ..., and Ot′+M′ from RAM and storing
them in RegO in OPC circuit1 (Fig. 7) based on Loop C1
′ (Fig. 6). M′ · P/L′ cycles are required
for reading M′ feature vectors. Then, the HMM parameters of L′ HMMs, i.e., (v′ + 1)-th to
(v′ + L)-th HMMs, are read from ROM, which are −µ11, σ11 and ω1, and stored in Regµ,
Regσ, and Regω, respectively, based on Loop C1′ and Loop B′′ (Fig. 6). For half of the stored
input feature vectors, Ot′+1, ..., and Ot′+M′/2, L
′ · M′/2 intermediate results of L′ OPCs
are simultaneously computed with the stored HMM parameters by using L′ · M′/2 PE1s
(Fig. 7) based on Loop A′′ (Fig. 6). Then, for the other half of the stored input feature vectors,
Ot′M′/2+1, ..., and Ot′+2·M′ , L
′ · M′/2 intermediate results of L′ OPCs are simultaneously
computed with the stored HMM parameters by using L′ · M′/2 PE1s (Fig. 7) based on Loop
A′′ (Fig. 6). The stored M′ input feature vectors are effectively shared by all OPC circuits.
In each OPC circuit, denoted by dashed line in Fig. 7, the two HMM parameters, −µjp and σjp
are shared by all PE1s, and the obtained first M′/2 intermediate results and the second M′/2
intermediate results are stored in Regω using two cycles. At the same time, the two HMM
parameters −µjp+1 and σjp+1 are read from ROM and stored in Regµ and Regσ, respectively,
165A VLSI Architecture for Output Probability and Likeliho d Sco e Computati ns of HMM-Based R cognit n Systems
www.intechopen.com
12 Will-be-set-by-IN-TECH
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
. . .
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
. . .
.
 
.
 
.
. . .
.
 
.
 
.
. . .
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
. . .
.
 
.
 
.
. . .
. . .
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
. . .
.
 
.
 
.
. . .
. . .
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
. . . . . .
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
ROM
µ, σ, ω, a, a, pi
RAM
O
OPC circuit1
OPC circuit2
OPC circuitL′
Viterbi scorer1
Viterbi scorer2
Viterbi scorerL′
ot,p
Regµ
Regµ
Regµ
Regσ
Regσ
Regσ
Regω
Regω
RegωRegO
Rega j, j 1
Rega j, j 1
Rega j, j 1
Rega j−1, j 1
Rega j−1, j 1
Rega j−1, j 1
RegTmpδ j 1
RegTmpδ j 1
RegTmpδ j 1
RegTmpδ j−1 1
RegTmpδ j−1 1
RegTmpδ j−1 1
RegLastδ
RegLastδ
RegLastδ
RegInδ1
RegInδ1
RegInδ1
P
M′
M′
M′
M′
M′
M′
M′
M′
M′
M′
N
N
N
PE11
PE11
PE11
PE1M′/2
PE1M′/2
PE1M′/2
PE2
PE2
PE2
SEL SEL
SEL SEL
SEL SEL
+
+
+
++
++
+
+
+
++
++
+
+
+
++
++
×
×
×
×
×
×
×
×
×
×
×
×
CMP (t == 1)
CMP (t == 1)
CMP (t == 1)
CMP
<
CMP
<
CMP
<
L′-set of OPC circuit and Viterbi scorer
Fig. 7. MultipleStoreBPP architecture for OPCs and LSCs.
66 Embedded Systems – High Performance Systems, Applications and Projects
www.intechopen.com
A VLSI Architecture for Output Probability and Likelihood Score Computations of HMM-Based Recognition Systems 13
using two clock cycles. The HMM parameters are used in the next M′-parallel computation
in the OPC circuit.
L′ · M′ output probabilities are simultaneously obtained every 2 · P clock cycles by L′ OPC
circuits, which are log bj(Ot′+1), ..., and log bj(Ot′+M′) of L
′ HMMs based on Loop A′′ (Fig. 6).
The results are copied from Regω to RegInδ1 for starting LSCs of L
′ HMMs and next OPCs
for (j + 1)-th states of L′ HMMs log bj+1(Ot′+1), ..., and log bj+1(Ot′+M′) based on Loop B
′′
(Fig. 6). L′ · M′ · N output probabilities of L′ HMMs are obtained by Loop B′′ with the same
M′ input feature vectors Ot′+1, ..., and Ot′+M′ .
Each Viterbi scorer, denoted by double-dashed lines in Fig. 7, performs ⌈M′/(2 · P)⌉-stage
pipelined LSC, denoted by double-dashed lines in Fig. 3. LSC starts by reading HMM
parameters of L′ HMMs, logpi1 and log a1,1, from ROM and storing them in RegTmpδj 1
and RegTmpaj,j 1 based on Loop B
′′ (Fig. 6). Then, in each Viterbi scorer, an intermediate
score log δ1(1) is computed by PE2 with the HMM parameter and the output probability
obtained using Eq. (1). The obtained intermediate score is stored in both RegTmpδj 1 and
RegTmpδj−1 1 (Fig. 7). RegTmpδj 1 stores an intermediate score that is needed in the next
computation in Loop E1 (Fig. 3). RegTmpδj−1 1 stores M
′ intermediate scores log δt′+1(j),
..., and log δt′+M′(j), which is needed in the next LSC for the (j + 1)-th state of the HMM
in Loop B′′. After the computation of log δ1(1), M
′ − 1 intermediate scores log δ2(1), ...,
and log δM′ (1) are sequentially computed by PE2 on the basis of Eq. (3). In sequential
computation, the last obtained intermediate score log δM′ (1), is stored in RegTmpδj−1 1 and
RegLastδ (Fig. 7). RegLastδ stores N intermediate scores that are the last obtained intermediate
scores by Loop E1 during Loop B
′′. These intermediate scores are log δt′+M′(1), ..., and
log δt′+M′(N) of v-th HMM, which are required when starting LSC with new M
′ output
probabilities log bj(Ot′+M′+1), ..., and log bj(Ot′+2·M′) at the first computations in Loop E1,
given as log δt′+M′+1(1), ..., and log δt′+M′+1(N). A required intermediate score is read from
RegLastδ and is stored in RegTmpδj 1 before computation. Regaj−1,j 1 (Fig. 7) stores an HMM
parameter log aj−1,j, which is used for computing log δt′ (j) on the basis of Eq. (3) when 2 ≤ t
′
and 2 ≤ j.
Our Viterbi scorers, which support MultipleStoreBPP, were presented in Fig. 7 as Viterbi
scoreri, where M
′ ≤ 2 · P and ⌈M′/(2 · P)⌉ = 1. Our ⌈M′/(2 · P)⌉-stage pipelined Viterbi
scorer, which supports 2 · P < M′ and 1 < ⌈M′/(2 · P)⌉, is shown in Fig. 8. The Viterbi
scorers shown in Fig. 7 are instances of the generalized ⌈M′/(2 · P)⌉-stage pipelined Viterbi
scorer where ⌈M′/(2 · P)⌉ = 1. The ⌈M′/(2 · P)⌉-stage pipelined Viterbi scorer consists of a
register array RegLastδ and ⌈M′/(2 · P)⌉ sub Viterbi scorers. The i-th stage, i.e., i-th sub Viterbi
scorer consists of two register arrays (RegInδi and RegTmpδj−1 i), three registers (RegTmpδj i,
Regaj,j i, and Regaj−1,j i), and a PE2. Each RegInδi consists of i · P registers, where i = 1, ...,
and ⌈M′/(2 · P)⌉ − 1. RegInδ⌈M′/(2·P)⌉ consists of ⌈M
′/(2 · P)⌉ · (M′ mod (2 · P)) registers.
In each RegInδi, rows are shifted upward every P clock cycles. Each RegTmpδj−1 i consists
of P registers, where i = 1, ..., and ⌈M′/(2 · P)⌉ − 1. RegTmpδj−1 ⌈M′/(2·P)⌉ consists of
M′ mod (2 · P) registers. HMM parameters in Regaj,j i and Regaj−1,j i are copied every
P cycles to Regaj,j i+1 and Regaj−1,j i+1, respectively, where i = 1, ..., ⌈M
′/P⌉ − 1. The
last obtained intermediate score by PE2 based on Loop Ei (Fig. 3), where i = 1, ..., and
⌈M′/(2 · P)⌉ − 1, is stored in RegTmpδj−1 i and RegTmpδj i+1 every P cycles based on
⌈M′/(2 · P)⌉-stage pipelined LSC (Fig. 3). The last obtained intermediate score by PE2 based
167A VLSI Architecture for Output Probability and Likeliho d Sco e Computati ns of HMM-Based R cognit n Systems
www.intechopen.com
14 Will-be-set-by-IN-TECH
. . . . . .
. . . . . .
. . . . . .
.
 
.
 
.
. . . . . .
. . . . . .
.
 
.
 
.
. . . . . .
. . .
. . .
.
 
.
 
.
. . . . . .. . .
.
 
.
 
.
. . .
Rega j, j 1
Rega j, j ⌈M′/(2·P)⌉
Rega j−1, j 1
Rega j−1, j ⌈M′/(2·P)⌉
RegTmpδ j 1 RegTmpδ j ⌈M′/(2·P)⌉
RegTmpδ j−1 1
RegTmpδ j−1 ⌈M′/(2·P)⌉
RegLastδ
RegInδ1
RegInδ⌈M′ /(2·P)⌉
P
P
M′ mod (2 · P)
M′ mod (2 · P)
M′ mod (2 · P)
N
M′
PE2 PE2
SEL SELSEL SEL +
+
++
+
+
CMP (t == 1)CMP (t == 1)
CMP
<
CMP
<
⌈M′/(2 · P)⌉
Fig. 8. Pipelined Viterbi scorer for the MultipleStoreBPP architecture (Viterbi scoreri,
i = 1, ..., L′).
on Loop E⌈M′/(2·P)⌉ (Fig. 3) is stored in RegTmpδj−1 ⌈M′/(2·P)⌉ and RegLastδ every P cycles
during Loop B′′ (Fig. 6). The stored intermediate scores are required when starting LSC
with new M′ output probabilities for first computation in Loop E1 (Fig. 3). The required
intermediate score is read from RegLastδ and stored in RegTmpδj 1 before computations.
5. Evaluation
We compared StreamBPP Yoshizawa et al. (2004; 2006), StoreBPP Nakamura et al. (2010),
FastStoreBPP (Figs. 3, 4, and 5), and MultipleStoreBPP (Figs. 6, 7, and 8) VLSI architectures.
Table 1 shows the register size of MultipleStoreBPP, FastStoreBPP, StoreBPP, and StreamBPP
architectures, where xµ, xσ , xω , xo , xa, and x f represent the bit length of µjp, σjp, ωj, otp,
ajj, and the output of PE1, respectively. N, P, and M are the number of HMM states,
the dimension of the input feature vector, and the number of input feature vectors in a
block, respectively. M′ and L′ are the number of input feature vectors in a block with
MultipleStoreBPP and the number of HMMs whose output probabilities are simultaneously
computed by Loop A′′ (Fig. 6) with MultipleStoreBPP, respectively. OPC and Viterbi scorer
represent the register size of the OPC circuit and Viterbi scorer, respectively.
Table 2 shows the processing time for computing output probabilities of V HMMs
and likelihood scores with MultipletStoreBPP, FastStoreBPP, StoreBPP, and StreamBPP
architectures, where L is the number of HMMs whose output probabilities are computed
using the same input feature vectors with StoreBPP Nakamura et al. (2010). OPC and the
Viterbi scorer represent the number of clock cycles for OPC and additional cycles for LSC,
respectively.
68 Embedded Systems – High Performance Systems, Applications and Projects
www.intechopen.com
A VLSI Architecture for Output Probability and Likelihood Score Computations of HMM-Based Recognition Systems 15
MultipleStoreBPP [bit]
OPC P · M′ · xo + (2 · xµ + xσ + M′ · x f ) · L
′
Viterbi scorer
L′ · [{N + (2 · P + 1)(⌈(M + 1)/(2 · P)⌉ − 1) + 2 · P · ∑
⌈(M+1)/(2·P)⌉−1
i=0 i
+ (M mod 2 · P) + 1 + (M mod 2 · P) · ⌈M/(2 · P)⌉} · x f
+ 2 · ⌈M/(2 · P)⌉ · xa]
FastStoreBPP [bit]
OPC P · M · xo + xµ + xσ + M · x f
Viterbi scorer
{N + (P + 1)(⌈(M + 1)/P⌉ − 1) + P · ∑
⌈(M+1)/P⌉−1
i=0 i + (M mod P)
+ 1 + (M mod P) · ⌈M/P⌉} · x f + 2 · ⌈M/P⌉ · xa
StreamBPP Yoshizawa et al. (2006) [bit]
OPC N · P · xµ + N · P · xσ + N · x f
Viterbi scorer (2 · N − 1) · xa + N · xω + N · x f
StoreBPP Nakamura et al. (2010) [bit]
OPC P · M · xo + 2 · xµ + xσ + M · x f
Viterbi scorer — not available
Table 1. Register size.
MultipleStoreBPP [cycle]
OPC ⌈V/L′⌉{⌈P · M′/L′⌉+ (1 + 2 · P) · N}⌈T/M′⌉
Viterbi scorer ⌈V/L′⌉{2 · N · ⌈T/M′⌉+ N}
FastStoreBPP [cycle]
OPC V · {P · ⌈M/2⌉+ (1 + P) · N}⌈T/M⌉
Viterbi scorer V · {N · ⌈T/M⌉+ N}
StreamBPP Yoshizawa et al. (2006)[cycle]
OPC V · (2 · N · P + N + P · T)
Viterbi scorer V · (3 · N − 1)
StoreBPP Nakamura et al. (2010) [cycle]
OPC ⌈V/L⌉{P · M + (1 + 2 · P) · L · N}⌈T/M⌉
Viterbi scorer — not available
Table 2. Processing times.
Table 3 shows the register size, processing time, and the number of PEs for computing
output probabilities of 800 HMMs and likelihood scores, where it is assumed that N = 32,
P = 38, T = 86, xµ = 8, xσ = 8, x f = 24, xo = 8, xa = 8, and V = 800. These
values are the same as those used in a recent circuit design for isolated word recognition
Nakamura et al. (2010); Yoshizawa et al. (2004; 2006). In addition, we assume that M′ = 12,
L′ = 4 for the MultipleStoreBPP architecture. Futhermore, ratios compared with StreamBPP
are shown in Table 3. Compared with the StreamBPP architectures, the MultipleStoreBPP
architecture has fewer registers (48% = 10,432/21,752) and requires less processing time (91%
= 4,233,600/4,661,600). The number of PE2s in MultipleStoreBPP and FastStoreBPP are less
than that in StreamBPP.
Figure 9 shows the processing time, and the number of PEs in MultipleStoreBPP, FastStoreBPP,
and StreamBPP architectures, and the value of M, where M = M′ · L′/2 for MultipleStoreBPP.
169A VLSI Architecture for Output Probability and Likeliho d Sco e Computati ns of HMM-Based R cognit n Systems
www.intechopen.com
16 Will-be-set-by-IN-TECH
Reg. Proc. #PEs
size [bit]
time
[cycle] PE1 PE2
MultipleStoreBPP (M′, L′) = (12, 4)
10,432
(48%)
4,233,600
(91%)
24
(75%)
4 (13%)
FastStoreBPP (M = 24)
9,848
(45%)
5,580,800
(120%)
24
(75%)
1 (3%)
StreamBPP
21,752
(100%)
4,661,600
(100%)
32
(100%)
32
(100%)
Table 3. Evaluation of the MultipleStoreBPP, FastStoreBPP, and StreamBPP performance.
 0
 2
 4
 6
 8
 10
 0  10  20  30  40  50  60  70  80  90
Pr
oc
es
si
ng
 T
im
e 
[M
cy
cle
]
The value of M(block size of FastStoreBPP)
#PEs = #PE1s + #PE2s
#PEs(StreamBPP) = all 32 + 32
#PEs
(FastStoreBPP)
12+1
16+1
20+1
44+2, ...,
..., 84+3
#PEs(MultipleStoreBPP)
12+4
16+4
24+4
32+4, ...
..., 84+4
StreamBPP
FastStoreBPP
MultipleStoreBPP
Fig. 9. Processing time, the number of PEs and value of M.
This graph shows that the processing time of MultipleStoreBPP is less than that of
FastStoreBPP architecture. It is also less than that of the StreamBPP architecture when M
is greater than 22.
Figure 10 shows the register size of MultipleStoreBPP, FastStoreBPP, and StreamBPP
architectures as well as the value of M (block size). This graph shows that the register size
of MultipleStoreBPP is less than those of FastStoreBPP, and StreamBPP architectures when M
is greater than 36 and less than 74.
Table 4 shows the circuit area, clock period, and power dissipation of the OPC and LSC circuits
based on the MultipleStoreBPP and FastStoreBPP architectures, which are derived from the
report of the Synopsys Design Compiler (Ver. B-2008.09-SP5), where the target technology
is the 90nm technology (STARC 90nm) and the report on power dissipation is obtained with
report_power command after logic synthesis. In the table, the delay and area represent the
70 Embedded Systems – High Performance Systems, Applications and Projects
www.intechopen.com
A VLSI Architecture for Output Probability and Likelihood Score Computations of HMM-Based Recognition Systems 17
 0
 5
 10
 15
 20
 25
 30
 35
 40
 10  20  30  40  50  60  70  80  90
R
eg
ist
er
 S
ize
 [K
bit
]
The value of M(block size of FastStoreBPP)
StreamBPP
FastStoreBPP
MultipleStoreBPP
Fig. 10. Register size and the value of M.
Arch.
area
[µm2]
delay
[ns]
power
[mW]
MultipleStoreBPP (#PE1 = 44, M′ = 22, L′ = 4) 1,042,492 2.8 4.5
FastStoreBPP (#PE1 = 32, M = 32) 849,955 2.7 3.5
FastStoreBPP (#PE1 = 44, M = 44) 1,155,595 2.7 4.8
Table 4. Area, delay and power of OPC and LSC circuits.
minimum clock period and area of the circuit, respectively. Power represents the power
dissipation of the circuit whose operating clock frequency is 11 MHz, and for an 800-word
real-time isolated word recognition, recognition in 0.2 s is achieved by the MultipleStoreBPP
architecture for T = 86, a 1-s speech, V = 800, N = 32, P = 38, M′ = 22 and L′ = 4.
Compared with the FastStoreBPP architecture, the MultipleStoreBPP architecture has lower
power and less area, because the MultipleStoreBPP architecture has fewer registers when the
number of PE1s is 44.
6. Conclusions
We presented MultipleStoreBPP for OPCs and LSCs and presented a new VLSI
architecture. MultipleStoreBPP performs parallel-OPCs and pipelined-LSCs for multiple
HMMs. Compared with the conventional StoreBPP architecture, the MultipleStoreBPP
architecture supports LSC. Furthermore, compared with StreamBPP and FastStoreBPP
architectures, the MultipleStoreBPP architecture requires fewer registers and less processing
time. In terms of the VLSI architecture the comparison shows the efficiency of the
MultipleStoreBPP architecture.
171A VLSI Architecture for Output Probability and Likeliho d Sco e Computati ns of HMM-Based R cognit n Systems
www.intechopen.com
18 Will-be-set-by-IN-TECH
7. Acknowledgements
This work is supported by the VLSI Design and Education Center (VDEC), the University
of Tokyo in collaboration with Synopsys, Inc. and the Semiconductor Technology Academic
Research Center (STARC).
8. References
B. Mathew; A. Davis & A. Ibrahim. (2003). Perception Coprocessors for Embedded Systems.
Proc. of ESTIMedia, 109 – 116
B. Mathew; A. Davis & Z. Fang. (2003). A Low-Power Accelerator for the SPHINX 3 Speech
Recognition System. Proc. of Int’l Conf. on Compilers, Architecture and Synthesis for
Embedded Systems, 210 – 219
K. Nakamura; M. Yamamoto; K. Takagi & N. Takagi. (2010). A VLSI Architecture for Output
Probability Computations of HMM-Based Recognition Systems with Store-Based
Block Parallel Processing, IEICE TRANS. INF. & SYST., Vol. E93-D, No. 2, 300 – 305
S. Yoshizawa; Y. Miyanaga & N. Yoshida. (2002). On a High-Speed HMM VLSI Module with
Block Parallel Processing. IEICE TRANS. Fundamentals, Vol. J85-A, No. 12, 1440 – 1450
S. Yoshizawa; N. Wada; N. Hayasaka & Y. Miyanaga. (2004). Scalable Architecture for Word
HMM-Based Speech Recognition. Proc. of ISCAS’04, 417 – 420
S. Yoshizawa; N. Wada; N. Hayakawa & Y. Miyanaga. (2006). Scalable Architecture for Word
HMM-based Speech Recognition and VLSI Implementation in Complete System.
IEEE TRANS. ON CIRC. & SYST., Vol. 53, No. 1, 70 – 77
Y. Kim & H. Jeong. (2007). A Systolic FPGA Architecture of Two-Level Dynamic Programming
for Connected Speech Recognition. IEICE TRANS. INF. & SYST., Vol. E90-D, No. 2,
562 – 568
72 Embedded Systems – High Performance Systems, Applications and Projects
www.intechopen.com
Embedded Systems - High Performance Systems, Applications and
Projects
Edited by Dr. Kiyofumi Tanaka
ISBN 978-953-51-0350-9
Hard cover, 278 pages
Publisher InTech
Published online 16, March, 2012
Published in print edition March, 2012
InTech Europe
University Campus STeP Ri 
Slavka Krautzeka 83/A 
51000 Rijeka, Croatia 
Phone: +385 (51) 770 447 
Fax: +385 (51) 686 166
www.intechopen.com
InTech China
Unit 405, Office Block, Hotel Equatorial Shanghai 
No.65, Yan An Road (West), Shanghai, 200040, China 
Phone: +86-21-62489820 
Fax: +86-21-62489821
Nowadays, embedded systems - computer systems that are embedded in various kinds of devices and play an
important role of specific control functions, have permeated various scenes of industry. Therefore, we can
hardly discuss our life or society from now onwards without referring to embedded systems. For wide-ranging
embedded systems to continue their growth, a number of high-quality fundamental and applied researches are
indispensable. This book contains 13 excellent chapters and addresses a wide spectrum of research topics of
embedded systems, including parallel computing, communication architecture, application-specific systems,
and embedded systems projects. Embedded systems can be made only after fusing miscellaneous
technologies together. Various technologies condensed in this book as well as in the complementary book
"Embedded Systems - Theory and Design Methodology", will be helpful to researchers and engineers around
the world.
How to reference
In order to correctly reference this scholarly work, feel free to copy and paste the following:
Kazuhiro Nakamura, Ryo Shimazaki, Masatoshi Yamamoto, Kazuyoshi Takagi and Naofumi Takagi (2012). A
VLSI Architecture for Output Probability and Likelihood Score Computations of HMM-Based Recognition
Systems, Embedded Systems - High Performance Systems, Applications and Projects, Dr. Kiyofumi Tanaka
(Ed.), ISBN: 978-953-51-0350-9, InTech, Available from: http://www.intechopen.com/books/embedded-
systems-high-performance-systems-applications-and-projects/a-vlsi-architecture-for-output-probability-and-
likelihood-score-computations-of-hmm-based-recognitio
© 2012 The Author(s). Licensee IntechOpen. This is an open access article
distributed under the terms of the Creative Commons Attribution 3.0
License, which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
