In this paper, a fast and memory-efficient VLSI architecture for output probability computations of continuous Hidden Markov Models (HMMs) is presented. These computations are the most timeconsuming part of HMM-based recognition systems. High-speed VLSI architectures with small registers and low-power dissipation are required for the development of mobile embedded systems with capable human interfaces. We demonstrate store-based block parallel processing (StoreBPP) for output probability computations and present a VLSI architecture that supports it. When the number of HMM states is adequate for accurate recognition, compared with conventional stream-based block parallel processing (StreamBPP) architectures, the proposed architecture requires fewer registers and processing elements and less processing time. The processing elements used in the StreamBPP architecture are identical to those used in the StoreBPP architecture. From a VLSI architectural viewpoint, a comparison shows the efficiency of the proposed architecture through efficient use of registers for storing input feature vectors and intermediate results during computation.
Introduction
Mobile embedded systems with sophisticated natural human interfaces, such as speech recognition, lip reading, and gesture recognition, are required for the realization of future ubiquitous computing.
Recognition tasks can be implemented either on processors (CPUs and DSPs) or dedicated hardware (ASICs). Although processor-based approaches offer flexibility, realtime recognition tasks using state-of-the-art recognition algorithms exceed the performance level of current embedded processors, and require modern high-performance processors that consume far more power than dedicated hardware [2] - [5] . Dedicated hardware, which is optimized for low-power, real-time recognition tasks, is more suitable for implementing natural human interfaces in low-power mobile embedded systems. Fast and memory-efficient VLSI architectures with small number of registers and processing elements are required for the development of well-optimized embedded systems with capable future human interfaces. VLSI architectures optimized for recognition tasks with low power dissipation have been developed [2] - [6] . Yoshizawa et al. investigated a block-wise parallel processing method for output probability computations of continuous hidden Markof models (HMMs) and proposed a low power, high-speed VLSI architecture [2] - [4] . Output probability computations are the most time-consuming part of HMM-based recognition systems. Mathew et al. developed low-power accelerators for the SPHINX 3 [7] speech recognition system [5] and perception accelerators for embedded systems [6] .
Robust VLSI architecture for the increase of HMM states, which requires small number of registers and processing elements even when the number of HMM states is increased for accurate recognition, is required for the development of well-optimized future HMM-based recognition systems.
In this paper, we present a fast and memory-efficient VLSI architecture for HMM computations using a new blockwise parallel processing method. We show storebased block parallel processing (StoreBPP) for HMM computations, and present an appropriate VLSI architecture for its implementation. Compared with a conventional stream-based block parallel processing (StreamBPP) architecture [2] - [4] , when there are a sufficient number of HMM states for accurate recognition, the proposed architecture requires fewer registers and processing elements and less processing time. A comparison demonstrates the efficiency of the proposed architecture through its efficient use of registers in storing input feature vectors and intermediate results for the computations. The processing elements used in the StreamBPP and StoreBPP architectures are identical.
The remainder of this paper is organized as follows: the structure of HMM-based recognition systems is described in Sect. 2, StoreBPP and our VLSI architecture are introduced in Sect. 3, the evaluation of the proposed architecture is described in Sect. 4, and conclusions are presented in Sect. 5. cations such as speech recognition, lip-reading, and gesture recognition. Figure 1 shows the basic structure of HMM-based recognition hardware [2] - [5] . The output probability computation circuit and Viterbi scorer work together as a recognition engine. The inputs to the output probability computation circuit are feature vectors of several dimensions and model parameters of HMMs. These values are stored in RAM and ROM respectively. The RAM, ROM and output probability computation circuit interconnect via a single bus, and memory accesses are exclusive. The output probability computation circuit outputs the results of the output probability computation of HMMs. The Viterbi scorer outputs likelihood score using the Viterbi algorithm. In HMMbased recognition systems, the most time-consuming task is output probability computations, and the output probability computation circuit accelerates these computations.
HMM-Based Recognition Systems
The output probability computation circuit has several register arrays and processing elements (PEs) for efficient high-speed parallel processing.
Typical application examples of our VLSI architecture for the output probability computation circuit are speech recognition systems such as isolated word recognition, connected word recognition and continuous speech recognition, where feature vectors are extracted from input speech signal by any other external circuit or processor outside the recognition hardware of Fig. 1 . We assume that the output probability computation circuit computes output probability of continuous HMMs. Model parameters of continuous HMMs are precomputed from training samples of words, etc. In isolated word recognition, the results of the output probability computation and the likelihood score lead to a recognition result, which is a word. In connected word recognition and continuous speech recognition, the results of the output probability computation and the likelihood score are used to lead to a recognition result, which is a sequence of words or a sentence, by any other external circuit or processor outside the recognition hardware of Fig. 1 . Our VLSI architecture is directly appliciable to the design of the output probability computation circuit in the systems without a major redesign of the whole hardware of Fig. 1 . By utilizing our VLSI architecture to the design of the output probability computation circuit, further design optimization which reduces the number of PEs in the Viterbi scorer by introducing several registers in it is possible but it is our future work.
Output Probability Computation of HMMs
Let O 1 , O 2 , . . . , O T be a sequence of P-dimensional input feature vectors to HMMs, where
T is the number of input feature vectors, and P is the dimension of the input feature vector. For an input feature vector O t , the output probability of N-state left-toright continuous HMM at the j-th state is given by
where ω j , σ jp , and μ jp are the factors of the Gaussian probability density function. The output probability computation circuit ( Fig. 1 ) computes log b j (O t ) based on Eq. (1), where all HMM parameters ω j , σ jp , and μ jp are stored in ROM, and the input feature vectors are stored in RAM. The values of T , N, P, and the number of HMMs V differ for each recognition system. For a recent isolated word recognition system [2] , [3] , T , N, P, and V are 86, 32, 38, and 800, respectively, and for another word recognition system [4] , T , N, P, and V are 89, 12, 16 and 100 respectively, where the number of HMMs V of the recent system is eight times larger than that of the other system and the number of HMM states of the recent system is increased from 12 to 32 for accurate recognition with the sufficient number of HMM states. For a continuous speech recognition system [5] , T , N, P, and V are approximately 20, 10, 40, and 50, respectively. Different applications require different output probability computation circuit architectures.
A flowchart of output probability computations is shown in Fig. 2 . Output probabilities are obtained by P · N · T · V times the partial computation of log b j (O t ) calls. Partial computation of log b j (O t ) performs four arithmetic operations, a subtraction, an addition, and two multiplications for Eq. (1), and computes log b j (O t ). Block parallel processing (BPP) for output probability computations was proposed as an efficient parallel processing method for word HMM-based speech recognition by Yoshizawa et al. [2] - [4] . In this method, the set of input feature vectors is called a block, and HMM parameters are effectively shared between different input feature vectors in the computation. N-parallel computation is performed by their BPP.
In this paper, we classify two types of BPP according to input data flow: store-based block parallel processing (StoreBPP) and stream-based block parallel processing (StreamBPP). A block can be seen as a set of M(≤ T ) input feature vectors, whose elements are A flowchart of the output probability computations with the conventional StreamBPP [2] - [4] is shown in Fig. 3 . PEi represents the i-th processing element, which computes log b j (O t ) by a subtraction, an addition, and two multiplications for Eq. (1). Loop B (Fig. 2) is expanded as shown in Fig. 3 , and log b 1 (O t ), log b 2 (O t ), . . . , and log b N (O t ) are computed simultaneously with N PEs. In addition to the Nstate parallel computation, the same HMM parameters μ jp 's, σ jp 's, and ω j 's, 1 ≤ j ≤ N, 1 ≤ p ≤ P, are used repeatedly during Loop C in Fig. 3 .
A flowchart of the output probability computation with StoreBPP is shown in Fig. 4 . The PEs in Figs. 4 and 3 are identical. Loop C in Fig. 2 is partially expanded in Fig. 4 , and log b j (O t +1 ), log b j (O t +2 ), . . . , and log b j (O t +M/2 ) are computed simultaneously with M/2 PEs in Loop C1. In addition to the M/2-parallel computations, log b j (O t +M/2+1 ), log b j (O t +M/2+2 ), . . . , and log b j (O t +M ) are also computed with the same M/2 PEs. In this double M/2-parallel computation, the same HMM parameters μ jp and σ jp are used two times, because the parameters are independent of t. In addition to the M/2-parallel computations, Loop D (Fig. 2) is divided into Loops D1 and D2 (Fig. 4) . The same feature vectors O t +1 , O t +2 , . . . , and O t +M are used repeatedly during Loop D1, because the input vectors are independent of v.
A New VLSI Architecture for Output Probability Computation
Our StoreBPP VLSI architecture for output probability computations is shown in Fig. 5 . The architecture consists of five register arrays and M/2 PEs. RegO stores M input feature vectors O t +1 , O t +2 , . . . , O t +M . Regμ and Regσ store HMM parameters −μ jp , and σ jp , respectively. Regμ has space for storing −μ jp and for prestoring −μ j p+1 before the computation with μ j p+1 during the computation using Figure 6 shows the flowchart of output probability computations using the StoreBPP architecture. The computation starts by reading M input feature vectors from RAM and storing them to RegO in Loop C1. The HMM parameters of v-th HMM are read from ROM and stored in Regμ, Regσ, and Regω, which are μ 11 , σ 11 , and ω 1 ously computed with the stored μ 11 and σ 11 by M/2 PEs, where the HMM parameters are shared by all PEs. At the same time, an HMM parameter μ jp+1 of v-th HMM is read from ROM and stored in Regμ, where Regμ still holds μ 11 for the next computation using μ 11 . Then, for the other half of the stored input feature vectors O t +M/2+1 , O t +M/2+2 , . . . , and O t +M , M/2 intermediate results are simultaneously computed with the same μ 11 and σ 11 by M/2 PEs. At the same time, an HMM parameter σ jp+1 of v-th HMM is read from ROM and stored in Regσ, where the value is overwritten because the computation with σ 11 has been finished. In this double M/2-parallel computation, the stored HMM parameters μ 11 and σ 11 are used two times. In the next double M/2-parallel computation, the stored HMM parameters μ jp+1 and σ jp+1 are used two times. M output probabilities log b j (O t +1 ), log b j (O t +2 ), . . . , and log b j (O t +M ) of v-th HMM are obtained by Loop A. The obtained results are copied from Regω to Regδ for starting the next output probability computation, log b j+1 
Evaluation
We compared the proposed StoreBPP and StreamBPP (Fig. 7) VLSI architecture [2] - [4] . The StreamBPP architecture consists of three register arrays and N PEs. Regμ and Regσ store HMM parameters −μ jp and σ jp , respectively, and Regω stores HMM parameter ω j and intermediate re- Register size (bit) 
sults. The PEs in Figs. 7 and 5 are identical. Table 1 shows the register size of the StoreBPP and StreamBPP architectures, where x μ , x σ , x o , and x f represent the bit length of μ jp , σ jp , o tp , and the output of PE, respectively. N, P, and M are the number of HMM states, the dimension of input feature vector, and the number of input feature vectors in a block, respectively. Table 2 shows the processing time for computing output probabilities of V HMMs with the StoreBPP and StreamBPP architectures, where T and L are the number of input feature vectors and the number of HMMs whose output probabilities are computed with the same input feature vectors during Loop D1 of Fig. 6 , respectively. Table 3 shows the register size, the processing time, and the number of PEs for computing output probabilities of 800 HMMs, where we assume that N = 32, P = 38, T = 86, x μ = 8, x σ = 8, x f = 24, x o = 8, and V = 800-the same values used in a recent circuit design for isolated word recognition [2] , [3] . We also assume that The delay times of control paths differ between the two, but the control path delay is small compared with the data path delay.
Conclusions
We presented StoreBPP for output probability computations and presented a new VLSI architecture. StoreBPP performs arithmetic operations to locally stored input feature vectors. Compared with the conventional StreamBPP architecture, when the number of HMM states is large enough for accurate recognition, the StoreBPP architecture requires fewer registers and PEs, and less processing time. In terms of the VLSI architecture, a comparison shows the efficiency of the proposed architecture. A logic design, a Viterbi scorer for the StoreBPP architecture, and a reconfigurable architecture for both the StreamBPP and StoreBPP architectures are our future works.
Miyanaga of Hokkaido University for their helpful comments.
