In this paper, we present an efficient architecture for connected word recognition that can be implemented with field programmable gate array (FPGA). The architecture consists of newly derived two-level dynamic programming (TLDP) that use only bit addition and shift operations. The advantages of this architecture are the spatial efficiency to accommodate more words with limited space and the absence of multiplications to increase computational speed by reducing propagation delays. The architecture is highly regular, consisting of identical and simple processing elements with only nearest-neighbor communication, and external communication occurs with the end processing elements. In order to verify the proposed architecture, we have also designed and implemented it, prototyping with Xilinx FPGAs running at 33 MHz. key words: speech recognition, hidden Markov model (HMM), two-level dynamic programming (TLDP), FPGA
Introduction
Speech recognition is a process that allows a computer to map acoustic speech signals to text. That is, speech recognition converts acoustic speech signals provided by a microphone or a telephone into words, a group of words, or sentences. Recognition results may be used as final results by application fields such as instructions, controls, data inputs, and documentation, and may also be used as inputs of language processing in the field of speech understanding. Furthermore, speech recognition is an attractive technique allowing interactive communication between humans and computers, making computer usage environments more convenient for human beings.
For most speech recognition applications, it is sufficient to produce results in real time, and software solutions that perform recognition in real time already exist. However, to increase the use of speech recognition in embedded systems, we need a speech recognition chip with low power consumption, small size, and low cost. In previous works, dedicated hardware architectures for hidden Markov model (HMM)-based speech recognition were introduced in [1] - [8] . These are summarized in Table 1 . The existing architectures are designed for isolated speech recognition. Unfortunately, there is no direct implementation on hardware for connected speech recognition. In this paper, we derive a new architecture based on bit additions and shift operations only, excluding any integer multiplications for Manuscript connected word recognition, using the TLDP [12] , [13] algorithm. In the connected word recognition, most of the problems arise from the difficulty in reliably determining the word boundaries. TLDP is a well-known speech recognition algorithm that assigns word strings to speech segments. We introduce an efficient linear systolic array architecture that is appropriate for FPGA implementation. The array is highly regular, consisting of identical and simple processing elements. The design is very scalable, and since these arrays can be concatenated, it is also easily extensible. A scalable technique always provides optimum hardware resources that can cope with variable conditions by making small modifications to the hardware architecture. The present architecture relates to the chip design technique based on the ASIC and FPGA, and allows the realization of small devices with lower power consumption and low costs by developing an algorithm optimized to the chip. The hardwired speech recognition system allows easy installation in a device that uses speech recognition through a small and convenient interface without a computer, and allows realtime speech recognition due to the parallel architecture.
The organization of this paper is as follows. Section 2 derives the TLDP algorithm for connected speech recognition. Section 3 shows the detailed systolic architecture of TLDP. The test results are discussed in Sect. 4 and finally, conclusions are given in Sect. 5.
Background of Two Level Dynamic Programming Algorithm
When boundaries are unclear (connected speech case), the TLDP algorithm can be used to find them quite well. A very brief outline is given below. For a more detailed description of TLDP theory, see [11] . The notation here is based on [11] .
Basic Principles of TLDP
The basic idea of the TLDP is to break up the computation of connected speech recognition into two stages. At the first level, the algorithm matches each individual word reference pattern, R v , against an arbitrary portion of the test string, T . T and R v utterances are experssed as in (1) .
where 
where d(·, ·) is a local spectral distance measure, and w(m) is a window for dividing the total frame input for signal analysis during a very short time that is assumed to be stable. We can eliminate v by finding the best match between s and e for any v, giving
= best reference index, thereby significantly reducing the data storage without losing optimality. Given the array of best scores,D(s, e), the second level of the computation pieces together the individual reference pattern scores to minimize the overall accumulated distance over the entire test string. This can be accomplished using dynamic programming as
whereD l (e) is the distance of the best path ending at frame e using a concatenated sequence of l reference patterns. The best path ending at frame e using exactly l reference patterns is the one with minimum distance over all possible beginning frames, s, of the concatenation of the best path ending at frame s − 1 using exactly l − 1 reference patterns plus the distance of (3) of the best path from frame s to frame e.
The Systolic Architecture of Two-Level Dynamic Programming
Figure 1 is a basic block diagram of a TLDP. The system includes a first processing element group, a comparison module, a second processing element group, and a backtracking module. The first processing element group includes a plurality of parallel processing elements that have the same configuration, and calculate matching costs by using the HMM algorithm. The comparison module determines the minimum matching cost from the first processing element group, and stores it for later calculation. The second processing element group finds the optimized matching cost with the reference pattern for the total frame by using the minimum value determined by the comparison module, detects the word's end point, and recognizes a connected word. The second processing element group also includes a plurality of parallel processing elements having the same configuration. The backtracking module finds a word arrangement of the reference pattern that corresponds to the speech recognition result based on the calculation result by the second processing element group. In this instance, the first processing element group and the comparison module form the first level dynamic programming (first level DP), while the second processing element group and the backtracking module form the second level dynamic programming (second level DP).
Architecture of First Level DP
The first group processing elements use the HMM algorithm and the dynamic programming scheme to calculate the matching costs. For example, the matching cost PE lev.1 (v, p, m) at the p th processing element is: where M is the dimension of the total frame. (5) shows the matching cost between the test pattern and the reference pattern during the interval of (s, e). Hardware architectures for HMM-based speech recognition were introduced in [1] - [8] , thereby not presented in this paper. As demonstrated in (5) and Fig. 2 , the p th processing element sequentially calculates the matching costs from p to M when the start point is given to be p. The number of functioning processing elements is M; hence, the matching costs from all the start points to all the end points can be calculated. Therefore, realization of the above process through software requires the matching time of M 2 clock signals. However, realization through the parallel hardwired configuration of the present architecture generates the same calculation results by using M clock signals, corresponding to the dimension of the total frame. Figure 2 (a) shows the systolic array architecture of the first level DP. The first level DP includes the first processing element group and the comparison module. The first processing element group includes a state input unit and a plurality of parallel processing elements that have the same configuration for calculating the matching cost. The first level DP calculates the matching costs of a test pattern in comparison with the reference patterns at a start point and an end point by using the HMM algorithm and the dynamic programming scheme, determines the minimum matching cost, and extracts an index of the reference pattern corresponding to it. That is, since the start points for comparing the test pattern and the reference pattern are established at different values, the matching costs that have the respective components as start points may be calculated by using M input clock signals when the test pattern has M components.
When the state input unit receives a feature vector of a speech signal from the feature vector generator, the HMM parameters state transition probability distribution, A v , and observation symbol probability distribution, B v,m are calculated according to the learned probabilistic value, which are provided to the state input unit. To calculate the matching costs, the HMM parameters are sequentially input to the processing elements as clock signals. Fig. 2 (a) . The comparison module stores the minimum value matching costs from the first processing element group and an index to the corresponding reference pattern. The calculation of the minimum matching cost is:
where C memory (v, p, m) is the stored matching cost, and I memory (v, p, m) is the corresponding index. As (6) shows, the previously stored minimum matching cost, C memory (v − 1, p, m) is compared to the current input matching cost, PE lev.1 (v, p, m) , and the lesser one is stored in the memory. That is, the minimum one among the matching costs that are input up to a specific time is stored in the memory. In this instance, since the values of e in PE lev.1 (v, p, m) are sequentially input from 1 to M, the memory in the comparison module is configured to have M first-input firstoutput (FIFO) memories for sequential comparative calculation. Also as shown in Fig. 2 (a) , the vertical axis stores the cost of the start point and the horizontal axis stores the cost of the end point. In this instance, since the start point cannot be greater than the end point, the available values correspond to those with slash marks in the comparison module of Fig. 2 (a) . The required information is not only the minimum cost but also the corresponding word index. Therefore, all memory elements store minimum cost and index at v = V.
3.2 Architecture of Second Level DP At the second level DP, we compute the distance of the best path ending at frame e,D l (e) using a concatenated sequence of l reference patterns as in (4) . Figure 3 illustrates an algorithm for finding the optimized matching costD l (e). Let us define the cost of p th second level DP processing element at l reference patterns as
As shown in Fig. 3 , the second level findsD l (e) by using the values of
) and the number of cases to be compared increases when the value of e increases. That is, as demonstrated in (8) and Fig. 3 , the second level adds D(s, e) found by the first level toD l−1 (s − 1) by using (l − 1) reference patterns and the best path ending at frame (s − 1), and finds the matching costs with the l reference patterns by using the dynamic programming scheme. As shown in Fig. 4 (a) , the second level includes a second processing element group with a plurality of processing elements that have the same configuration, and a register for storing the matching costs calculated by the respective processing elements. The second level is easy to design and modify since all processing elements have the same configuration.
The remaining task is to describe the internal structure of the second level DP processing element. Figure 4 (b) shows the processing element. The trapezoidal block represents comparators. The block chooses the smaller of D(s, e)+D l−1 (s−1) and E l (s−1) for M clock. At M+1 clock it updatesD l (s) with the register value, E l (s). Notice that no multiplier is involved in this design, or in other parts of the system. As shown in Fig. 4 (b) , the processing elements of the second processing element group sequentially receive the value ofD(s, e) from the memory module, and concurrently receive the value ofD l−1 (s − 1) calculated and stored in the register. While M clock signals are applied, these two types of input valuesD(s, e) andD l−1 (s − 1), are transmitted to the processing elements, the minimum oneD l (e) is selected among the sums of the two inputs, and the value of D l (e) is output when the (M + 1)th clock signal is applied. The output value ofD l (e) is stored in the register, and the matching cost thereof with the (l + 1)th reference pattern is calculated. That is, the processing elements repeatedly calculate the matching costs with the reference patterns during the M clock signals to provide update results to the register at the (M + 1)th clock signal. After this has been done for L max , the final matching cost is found by using all thē D l (e) values stored in the register.
The backtracking module performs traceback on the reference patterns stored in the memory by using the final matching costD l (e), and extracts a corresponding reference index, thereby recognizing the speech signals.
System Implementation and Experimental Results
The system configuration is shown in Fig. 5 . The speech signal is bandlimited and sampled at 16 KHz with 12 bits. Feature extraction gives the Mel Frequency Cepstral Coeffients (MFCCs) of 13-dimension vector, each with 12 bits from pre-processing data. The HMM parameters were extracted using the feature vectors in preprocessor and the trained data in memory. The pattern matching element chooses the reference that matches the signal parameters set from the input with the HMM and TLDP algorithms.
The system is designed for an FPGA (Xilinx Virtex-II XC2V8000) running at 33 MHz. The entire chip is designed with VHDL code, fully tested and error free. The following experimental results are all based upon the VHDL simulation. The chip has been simulated extensively with the Cadence simulation tools. It is designed to interface with the PLX9656 PCI chip which has a maximum clock frequency of 66 MHz. The PLX9656 PCI is used within a PC with a Pentium 4, 3.06 GHz processor. The full design (M = 40) occupies 44,027 of the XC2V8000's slices, equal to 94%, requiring 87,394 LUTs and 52,391 FFs (See Table 2 ).
The speech data used for the testing and training were taken from the database designed by the Speech Technology Research Center of Korea. Both the test and training groups consisted of 10 male and 10 female speakers. We have achieved a very good performance with a 91.4% correctness rate in a vocabulary of more than 500 words.
Conclusion
Hardware needs different algorithms for the same application in terms of performance and quality. Since the algorithms used for hardware and software implementation differ significantly, it will be difficult, if not impossible, to migrate software implementations directly to hardware implementations. We have presented a fast and efficient architecture and implementation of a previously presented TLDP algorithm.
A systolic TDLP was derived and tested with VHDL code simulation. This scheme is fast and reliable since the architectures are highly regular. In addition, the processing can be done in real time owing to the parallel hardware implementation. A full scale system can be easily obtained by scaling the number of processing elements and the number of words.
