The paper introduces an isolated word speech recognition system in which the speech signal is acquired in real time. Half Raised-Sine function is applied to the MFCC parameters of the audio files, and improved DTW algorithm is implemented. Simulation results show that compared the conventional DTW with the improved DTW algorithm, the latter can obtain a better recognition rate and faster response time. Finally, the system was implemented on FPGA and yields satisfactory performance.
I. INTRODUCTION
Automatic Speech Recognition is technology that allows a computer to identify the words that a person speaks into a microphone or telephone and convert it to written text. A fundamental problem of speech recognition is a reasonable selection of phonetic features. Linear prediction (LP) analysis is widely used in speech feature parameter extraction technology. There are many successful application systems built based on LP technology. However, linear prediction model is a pure mathematics model without consideration on processing features of human auditory system. Other techniques used are: Linear Predictive Cepstral Coefficients (LPCC); Perceptual Linear Prediction (PLP); Mel-Frequency Cepstral Coefficients (MFCC); and Neural Predictive Coding (NPC) [1] , [2] , [3] . MFCC is a popular technique because it is based on the known variation of the human ear's critical frequency bandwidth. MFCC coefficients are obtained by de-correlating the output log energies of a filter bank which consists of triangular filters, linearly spaced on the Mel frequency scale. There have been many attempts to enhance the robustness of MFCC features. For example, cepstral mean and variance normalization (CMVN) [4] , RASTA filtering [5] , temporal structure normalization [6] , feature warping [7] , and MVA processing [8] , HRSF [9] (Half Raised-Sine function) are commonly used for enhancing MFCC robustness against additive noises and channel distortions.
Dynamic time warping (DTW) algorithm has been widely used in speech recognition, especially the speaker-independent speech recognition template matching processing. DTW is a simple and effective method in speech recognition. The algorithm is based on dynamic programming, solving the problem of the random timeline in the various parts of each word. However, the DTW algorithm needs test voices to match all the models, and then to find the most similar model corresponding to the speaker as a recognition result. The more models are stored, more time is needed for recognition. So improving the matching speed is a crisis matter for the application of speech recognition system. This problem can be addressed by considering an implementation strategy which uses the parallel processing paradigm to provide the opportunity for reducing processing times required. Recently, parallel techniques [9] - [11] have been proposed for its implementation in order to achieve real time speech recognition, taking advantage of VLSI technology. This paper is organized as follows: In Section2, the parameter feature method MFCC and HRSF are explained. Section3 deals with DTW algorithm and implements the improved DTW algorithm. The whole system completed on FPGA is introduced in section4 which including the simulation test, FPGA implementation and the final experimental results on speaker-independent and speakerdependent speech recognition. Section5 outlines the conclusion and future work.
II. FEATURE EXTRACTION METHOD

A. Mel-Frequency Cepstral Coefficients (MFCC)
For each tone with an actual frequency f measured in Hz, a subjective pitch is measured on a scale called the 'Mel' scale. (1) Where f mel is the subjective pitch in Mels corresponding to a frequency in Hz. This leads to the definition of MFCC, a baseline acoustic feature set for speech and speaker recognition applications.
MFCC coefficients are a set of Discrete Cosine Transform(DCT) decorrelated parameters, which are computed through a transformation of the logarithmically compressed filter-output energies, derived through a perceptually spaced triangular filter bank that processes the Discrete Fourier Transformed (DFT) speech signal.
( ), 0,1, , 1, 0,1, , 1
(2) Where m is the number of filters, N is the points of the frame of the speech. The filter in the frequency domain is simple triangle, whose center frequency is f m , which distributes uniformly in the Mel frequency axis. Bandpass filter parameters should be calculated beforehand, which can be used directly in the calculation of MFCC parameters. The filterbank designed in this paper contains 24 filters, the speech signal frame length is 256, signal sampling frequency is 8KHz, the filterbank waveform is stated as follows: The computing of MFCC parameter is shown as following:
(1) Identify the points of each frame of the speech sample sequence, in this paper N = 256. Obtain discrete power spectrum X ( n ) after discrete FFT transform.
( 
B. Half Raised-Sine Function
The standard MFCC parameters only reflects the static characteristics voice parameters, but a human ear is more sensitive to the dynamic characteristics of the voice. It is possible to obtain more detailed speech features by using a derivation on the MFCC acoustic vectors. This approach permits the computation of the delta MFCC (DMFCCs), as the first order derivatives of the MFCC. Then, the delta-delta MFCC (DDMFCCs) are derived from DMFCC, being the second order derivatives of MFCCs.
Differential parameter is calculated as the following:
Where c and d are expressed in a speech frame parameter, K is a constant, usually 2, then the differential parameter is the linear combination of parameters of the two frame ahead and the following two frame. In practical, the MFCC parameters and the differential order component parameters are combined into a vector, as the speech parameters of one frame.
Researches prove that the contribution for recognition rate of each component in the feature vector is different. In speech recognition, higher order MFCC components are more susceptible to the influence of noise than the lower order MFCC components, so if we put half raised sine function in use can add weight to smaller high-order component, and reduce weight of the low order component which could be easily interfered by noise.
The formula is shown as following.
III. THE ALGORITHM OF DYNAMIC TIME WARPING
A. The Conventional DTW Algorithm
Supposes reference template have M frame vector ,
, A M is the M th feature vector of speech, and the testing template have N feature vector,
,... , 2 1 , B N is the N th feature vector of speech. D(B( in ), A( im )) is used to indicate the distance between the M th feature in A and N th feature in B.
The distance is usually expressed by Euclidean Distance. DTW algorithm is to find a warping function ) ( n m i W i = which will make the timeline n of testing template mapped to the timeline m of reference template non-linear and enable the function to meet the equation as follow.
When the A and B are identical, they are mapped to be a straight line whose slope is 1, when the A and B aren't identical, in order to make the M th sample in A and N th sample in B alignment, the corresponding point is not in a straight line, but forms a curve, this curve corresponds to a function that is the Warping Function. If each frame number of the testing template is marked on the horizontal axis in a two-dimensional coordinate system and the reference template marked on the ordinate axis, we can draw a vertical and horizontal line which can form a network through the integer coordinate. In this network, each spot (n, m) expresses the intersection of the frame in the testing template and the frame in the reference template. The DTW algorithm may sum up to seek for a way through a number of grid points, which is the frame number which will calculate the frame distance between the two templates.
In practice, the function ( ) W n is restrained:
Thus as fig2, the path must start from the bottom left corner and end at the upper right corner; secondly, in order to prevent the blind search, there is no toleration of the path in favor to the horizontal or vertical axis, usually the smallest is 1/2 and the biggest slope is 2. The partial constrained path is shown in Fig3. The previous grid of (i n , i m ) can only be (i n-1 , i m ), (i n-1 , i m-1 ),(i n-1 , i m-2 ). 
Where g(n, m) is the nonlinear weighting.
w n w n g n m w n w n ≠ − ⎧ = ⎨ ∞ = − ⎩ (11) When carries on the speech recognition, the testing template will match to the reference template and receive the minimum matching distance D min (N,M) which is the recognition result.
To guarantee that the optimum path to (n, m) does not stay flat for two consecutive frames. The final desired solution to (9) is:
Thus the DTW algorithm requires on the order of NM distance calculations, and NM sets of combinatorics [ (10) and (11)] to obtain the best path and the total distance for each reference pattern. We now consider alternative finding techniques which seek to reduce the number of local distance calculations.
B. The Improved DTW Algorithm
We have shown that the conventional DTW algorithm solves the problem of optimally time aligning a test and a reference pattern, at the same time providing a measure of the similarity (distance) between test and reference along the alignment path. By restructuring the entire time alignment problem as a problem in finding the best path through a finite grid of points, we can take advantage of a large class of ordered tree and graph searching algorithms, to find the best path with substantially reduced computation of local distances.
The distance between vector U i and V i is considered in the existing algorithms, where minimum distance is the requirement for the matching points, and the minimum sum weighted distance is the dynamic similarity measurement of the U and V sequences. While this paper uses similarity between vector U i and V i for calculation. The max similarity is needed to search the matching points, and the maximum sum similarity is the measurement of the U and V sequences. Similarity L ( i, j ) is shown as follow:
Usually L(i，j)<= 1，while L(i，j)=1, vector U is the same as vector V.
There must exist N*M matrix to calculate maximum similarity between a reference pattern of N frames and a test pattern of M frames, which is time consuming. So we find several points 
< <
Then we only need to consider : 
So we obtain:
Where K is the number of matching points. 
Thus the N*M matrix is divided into several sub matrix and the computing time is reduced relatively.
IV. SYSTEM IMPLEMNTATION
A. Simulation
The simulation experiments have been done on the platform Matlab2010a. The recognizer was trained by using a Chinese isolated word database. The database consists of 10 words which were uttered by 60 different speakers, including 30 males and 30 females. Each speaker pronounced every word 5 times. 2000 utterances from this database were used as training data and the 1000 remaining utterances were used as test data. The sampling rate for the speech signal is 8KHz, each frame consists of 256 sample points. 12 MFCC and the log energy component were used together with their delta and delta-delta coefficients, Then HRSF is put in use for contrast. Table 1 shows the recognition results for different approaches on the given dataset. In this research, the performance of speaker verification based on different methods was evaluated. The percentage of Identification for the MFCC is (86.3% and 90.5%) is lower than HRSF(92.7% and 96.1%). A improvement in the performance is seen by using HRSF. From table 1 we can conclude that the average recognition time decreases rapidly by about 29.5%.
B. FPGA Implementation
An isolated word recognition system was developed for evaluating the proposed approach. The flowchart of the system is shown as figure4.
The hardware of the system is implemented on Altera's DE2 board which mainly includes CycloneII 2C35 FPGA chip, serial Flash EPCS16, 4MB Flash memory, kB SRAM, 8MSDARM 512, MIC 8, bitADC 0809, LCD, serial communication interface, PS2, etc .
The audio signal is acquired by 16 bit Audio CODEC audio module on the board, and the timer interrupt.is used to control the system. The system enables audio module read audio data from the buffer of the module then sent it to DDR SDRAM . When the audio module buffer is empty, the following data process will continue as audio data pre-processing, endpoint detection, MFCC coefficients extraction and HRSF upgrading. If the process is in the training phase, the upgraded MFCC coefficients will be stored as reference templates into FLASH. While in the recognition stage, the coefficients will be transferred to the DDR memory model as the test template to match the reference template. The improved dynamic time warping (DTW) algorithm.is applied in pattern matching procedure and the recognition results are given as the output . 
1) Framing & Pre-emphasis
For spectrum or track analysis, Pre-emphasis process wides and smoothes the spectrum of the signal by raising the high frequency part. With short-time stationary characteristics, the speech signal can be divided into frames to reduce the negative effect causing by time variant of the speech signal.
The equation is shown as (18):
Where α is 0.9375; s(n) is digital speech signal while sign(n) is pre-emphasized speech signal. In FPGA, it can be computed as (19), The S n-1 /16 can be obtained through the data shifting, the whole formula effectively improves the speed of data processing without decreasing the data accuracy by avoiding floating-point, multiplication and division operations.
Framing:
Where s(n) is the origin signal, s w (n) is the framed signal, w(n) is window function.
) Endpoint detection and feature extraction
The speech signal is serial input into FPGA after sampling and A/D converting (the sampling frequency is 8 kHz and 16 bit sampling depth) . The FPGA chip has a user-set hardware module which can receive the signal in real time, compute the difference from the prior data and the square value of the data simultaneously, then store the results in SRAM. While the memory is full, the module informs CPU to read data by the interrupt mode. According to the formula (21), (22) by calculating the short-time energy of each frame and zero-crossing rate, the end point is detected and data is valid through the double threshold. In equation (21 ), ( 22 ), N is the sampling points per frame.
Because the sampling frequency is much lower than the frequency of the system, MFCC coefficients extraction can be completed in idle time during the process of speech acquisition and detection. In the training phase, the reference templates are established after MFCC coefficients extraction, and stored in the Flash memory. Each reference template contains the flag and the character array. While in the recognition stage, the MFCC parameters extracted in real time and the reference template are sent into the DTW function module and the number of template with the maximum similarity is output as the candidate results.
3) The parallelism of DTW algorithm
The DTW algorithm is realized on the IP core hardware(Figure6). The IP core is finally added to the FPGA PLB bus system for system integration by using EDK tool after the design, compiling, simulation, synthesis, and verification. The IP core contains certain processing element (PE) (Figure7) queue to realize parallel computing [9] [10], reference templates management unit, the test templates management, similarity data management unit, PE, and control unit. The basic building blocks of each PE is shown as Figure 7 which includes: i) A control unit(CU): Instruction decoding is performed by the CU which next generates those signals in order to control the various arithmetic elements of the PE. The instruction memory contains the program (depending on the improved DTW algorithm as mentioned in Section III).
ii) An arithmetic unit performs a number of operations required for the computation of the partial sum and the local similarities.
iii) Memory unit consisting of two memory blocks Memery A and Memery B whose sizes are equal to p where p is the dimension of the feature vectors U i ,V j . iv) Registers S0,S1,S2,S3,L1,L2,L3,L4 retaining the partial sums S t (k-1),S t (k),S t (k+l),S t-1 (k+1) and the local distances L t (k),L t (k-1),L t (k+1) and L t-1 (k+l). v) Two input ports In1 and In2 and two output ports Out1and Out2 for I/O transfers. In1(P k )and In2(P k ) are the input ports of P k connected to the output ports Out1(P k-1 ) and Out2(P k+1 ) respectively. Out1(P k )and Out2(P k ) are the output ports of P k connected to the input ports In1(P k+1 )and In2(P k+1 ),respectively, as shown in Figure8. vi) Each P k is connected via another port to the RPB bus. The system can trace the best path with high-speed through the parallel processing and pipeline processing [5] . The reference template is connected to one input port of every PE k , at the same time, the In2 port of PE 1 is connected with the test template. In order to solve the problems caused by too much PEs, which consumed too much hardware resources, K could be chosen as M/2 ( a number M for the reference templates) or another appropriate value according to the space of SRAM. The scheme of parallel DTW computation is shown as Figure9.
C. Experimental Results
The experiment was divided into two parts: speakerdependent speech recognition and speech-independent speech recognition.
The data for the former was the same as for the simulation test, which was sent into FPGA via the serial port in Wav format. The recognition results were displayed on the LCD finally. For the latter, Chinese speech sample database is applied, which has a high signal-to-noise ratio that can minimize the impact of noise on the recognition performance.
It can be seen in the experimental that the system average recognition rate is higher in the speakerdependent recognition(reached 96.47%), while due to the different pronunciation, stress and the speed of each speaker, the average recognition rate of speakerindependent has declined slightly to 87.93% as shown in table 2. 
V. CONCLUSIONS
In this paper, MFCC and HRSF methods for feature extraction, the conventional DTW and the improved DTW for recognition of isolated words is presented. The system including the parallel algorithm of DTW is implemented on FPGA. Results show that the performance of HRSF is better than MFCC since higher order MFCC components are more susceptible to the influence of noise than the lower order MFCC components. Compared the traditional DTW with improved DTW algorithm, the latter can obtain a better recognition rate and faster response time. The system obtained a better recognition rate on speakerdependent speech recognition than speaker-independent speech recognition. The overall recognition rate can to be improved by future work on the feature extraction and endpoint detection.
