This paper presents the design of a System on Programmable Chip (SoPC) based on Field Programmable Gate Array (FPGA) for speech recognition in which Mel-Frequency Cepstral Coefficients (MFCC) for speech feature extraction and Vector Quantization for recognition are used. The implementing process of the speech recognition system undergoes the following steps: feature extraction, training codebook, recognition. In the first step of feature extraction, the input voice data will be transformed into spectral components and extracted to get the main features by using MFCC algorithm. In the recognition step, the obtained spectral features from the first step will be processed and compared with the trained components. The Vector Quantization (VQ) is applied in this step. In our experiment, Altera's DE2 board with Cyclone II FPGA is used to implement the recognition system which can recognize 64 words. The execution speed of the blocks in the speech recognition system is surveyed by calculating the number of clock cycles while executing each block. The recognition accuracies are also measured in different parameters of the system. These results in execution speed and recognition accuracy could help the designer to choose the best configurations in speech recognition on SoPC.
Introduction
Speech recognition system is applied in many application fields such as health care, military, human computer interaction, avionics technicians… were presented in [7] [8] [9] [10] . Among these, the work presented in [8] proposed a hardware/software co-design method to tradeoff between the performance and the flexibility of the recognition system while [7] and [10] presented FPGA based implementations of the recognition systems. None of them discuss about the optimization method for MFCC algorithm.
In our work, the MFCC method is used with some proposed modifications to increase the performance of the system. The whole system has been implemented using Altera FPGA technology to be more flexible.
The paper is organized as follows. The overview of speech recognition system is briefly presented in Section 2. The design and implementation of the proposed speech recognition system as a SoPC (System on Programmable Chip) are mentioned and discussed in Section 3. Section 4 will show the achieved experimental results. Finally, conclusion is given in Section 5.
Overview of Speech Recognition
In this section, the overview of speech recognition is presented as shown in Fig. 1 . Audio samples go through feature extraction block to retrieve the characteristics of sound. Through the Feature Extraction block presented in Section 2.1, the audio input will be transformed into the spectral coefficients. Then, these spectral characteristics go through Training block to create codebook for each word. In the recognition step, Recognition block using Vector quantization technique given in Section 2.2 will capture spectral features and will decide which words based on comparing spectral features with the codebooks already trained. 
Vector Quantization process
The Vector Quantization process is described in In speech recognition, it is common to use the Euclidean distance as in (1) that will be used in the classification stage, labeled as feature vector and used in the recognition step (1) 3. Implementation
In our implementation process, some blocks will be adjusted, modified to obtain higher computing speed.
In this section, we will present some improvements in a few blocks to optimize the computing speed.
Evaluated results in terms of the number of clock cycles will be presented in the next section to show the improvements.
Feature Extraction implementation 3.1.1 Voice Activation Detection (VAD)
Voice signal after recording through the microphone will gain a certain number of samples.
In this work, the sampling frequency is 8 kHz, each time recording in 1 second, corresponding to 8000
samples. However, in the 8000 samples, not all are meaningful sound, much of them are silence. So, before the audio samples are extracted features, it requires the program to extract the significant audio and remove the silence.
As mentioned in previous section, audio signal is divided into M segments (i.e., blocks), L samples in each segment. In this work, we assigned that with , which means for each segment.
Then the energy function will be calculated for each segment by the following formula:
VAD will reject segment if . In this work, we chose . The selection of TH is due to the test, go back and forth several times to select the appropriate value makes the correct signal clipping.
Pre-emphasis
In pre-emphasis block, the coefficient "" has the value from 0.9 to 1. In theory, the normal value of "" is 0.97. Building the system on SoPC, we modify 
With , Equation 4 will be simplified as: .
Advantage of using 15/16 as "" coefficient is expressed in Equation 5 . can be realized in binary computation system by shifting 4 bits to the right.
Using this value, the multiplication step in Equation (3), (4) is simplified to shift and subtract operations.
Discrete Fourier Transform (DFT)
In general, and are the complex numbers.
N-point DFT can be calculated as follows:
(244)
If DFT transformation uses two equations 6 and 7
to calculate, it costs trigonometric calculations, real multiplications, and additions. This shows that when the direct calculation using the DFT formula above arise large computational cost, it will slow speed in program execution. Therefore, in this case, we use the Fast Fourier Transform (FFT) algorithm instead. In addition, the look-up table of coefficients cosine is proposed to be used, the computing speed is also increased.
Magnitude computation
If using the conventional formula for calculating the complex amplitude as Equation 8 , then the calculation speed will be very slow. 
In this system, as 1 and as ¼ are proposed to be used after our several tests. This approach reduces the number of calculations with acceptable error.
Mel frequency filter bank
The of power coefficient of the frame is calculated by (10) as (10) where, is the number of the filters, is the point of the frame's spectrum, and is the coefficient of the filter. When implementing the speech recognition system on SoPC, the rectangular filter bank is proposed in the new algorithm instead of the triangular filter bank in traditional approach. So, the Equation (10) becomes (11) The rectangular filters are proposed to be used instead of the triangular filters because the output characteristic of a rectangular filter is either a "1"
or a "0", the multiply and sum operations can be simplified to simple "add" and "no add" operations.
No multiplication step is required in the proposed approach and will help to increase calculation speed.
By adding time derivatives to the basic parameter in the Delta block, the performance of a speech recognition system can be greatly enhanced.
After Feature Extraction block, a 160-sample frame is converted to a vector composed of 26 elements, including 12 cepstral coefficients, 1 energy coefficient and their first order time derivatives.
Training and recognition implementation
In this work, codebook size of 128 is considered.
We use K-Mean algorithm for training codebook. Fig. 4 .
The input speech sample is extracted the feature by the MFCC algorithm first. Then the feature vectors are calculated to find the VQ distortion for each codebook. The word having smallest distortion is the word which needs to be identified (Fig. 5) . 
SoPC implementation
The proposed speech recognition has been intently implemented on Altera FPGAs for high performance.
We propose a SoPC architecture for speech recognition system as described in Fig. 6 .
In this architecture, Nios II Processor is the most LCD is used to show the recognition result. In particular, the Interval Timer is used to calculate the number of the pulse clocks when executing each block.
Experimental Results
As mentioned above, we use Interval Timer to survey the program execution speed of each functional block in speech recognition system. The input speech samples are used for system input is 2400 samples. The clock of the system is.
Feature Extraction
The proposed algorithms as presented in section 3
and traditional approachs are studied in program implementation speed and also the recognition accuracy. The obtained results are presented in Table 1 .
As we can see in Table 1 , the pre-emphasis block with is executed fastest. The value in the Table) to increase the speed of execution.
As shown in the Therefore, the FFT with LUT is proposed to be used in the system.
In the magnitude computation step (the results in Table 1 ), the estimation algorithm calculates very fast amplitude of a complex number almost exact compared to the traditional algorithm by taking the square root operation. However, the recognition accuracy is reduced of 3.26% by using estimation amplitude algorithm in the system. By using the rectangle filters to replace the triangle filters, the program execution speed of the Mel Frequency Filter Bank block is increased larger 46 times than using the triangle filters, as in Table   1 , but it gives lower recognition accuracy than using triangle filters about 1.19%. Therefore, we propose using rectangle filters in speech recognition system.
In Fig. 7 , the program execution speed of all blocks in MFCC based feature extraction is shown to help the designer define which block consumes the longest time. The FFT block is the slowest, requires 94,874,620 clock cycles to complete the given input samples. The Cepstrum block also costs many clock cycles because in this block the logarithm has not been optimized.
Vector Quantization
With codebook size of 128, Vector Quantization is used in the recognition step and it costs 531,067,721
clock cycles.
Recognition accuracy
The whole recognition system with proposed architectures, parameters as stated above has the recognition accuracy of 87%, in which 7,416 utterances recorded from male and female adults in three regions of the North, Middle, and South of Vietnam are used.
Conclusion
In this paper, we propose efficient architectures 
