Modern speech recognition applications are heading to wards embedded systems and hand-held devices. Distributed Speech Recognition (DSR) system architecture emerged to address this kind of applications. Most of the existing im plementations of this system are presented in software fash ion, with little consideration to the end product platform in which the system will be deployed. In this paper, an opti mized hardware implementation of the front end part of the DSR specified in the basic ETSI Aurora standard ETSI ES 201 108 is presented in FPGA platform prototype, with con sideration of migration to structured ASIC in case of mass production. Main design issues and tips are highlighted. Results are presented in terms of hardware resources utiliza tion, comparison of some basic system components to third party reference designs and compliance to the Aurora stan dard.
INTRODUCTION
Human-machine interaction is likely to take place in natural language in future embedded systems and mobile devices. Speech enabled car navigation; natural language e-Learning applications and home automation are among those applica tions. This inherits all the embedded systems design con straints to the speech recognition domain, like limited hard ware, memory, power consumption and cost, which creates the need to re-architecture the already existing speech recog nition systems.
Early attempts generated the client-server Network Speech Recognition (NSR) architecture [5] , where the voice signal is encoded with normal speech coding techniques at the client Front end and sent via voice channel to the sever Back end, where all the recognition process takes place after decoding the voice signal. Communication between the two parts is to take place via wired! wireless channel. This architecture pro vides light front end mobile terminal, which reduces its cost, hardware and power consumption. However, this scheme suffered two main problems: First, ordinary speech encod ing/decoding techniques do not preserve many characteristics needed for the speech recognition, where speech codes care about perceptual quality rather than improving the recogni tion results, which degrades the Word Error Rate (WER) at the back end [5] . Second, encoding the voice signal makes it exposed to the channel errors during transmission, especially in case of un-reliable wireless channel [5] .
Distributed speech recognition (DSR) architecture emerged to tackle the above problems of NSR. Multi-modal applica tions are those applications where interaction between user and the computer may take place in keyboard strokes, voice commands or even hand writing. This requires voice and data channels. DSR reduces the two channels to only single data channel over which voice and data can be transmitted, by sending parameterized representation of the speech signal (the features vector) instead of coding the voice signal di rectly as in NSR. This has two advantages: First, the speech signal is not directly encoded and transmitted, which protects it against channel errors, and improves the WER significantly over unreliable channels. Second, special encoding and fram ing algorithms are used, that focus on lowering the bit rate (4.8 kbps) rather than preserving the perceptual quality. ETSI AURORA STANDARD OVERVIEW
Y
The STQ-Aurora DSR group at the European Telecommuni cations Standards Institute (ETSI) has published four stan dard specifications featuring the front end side of the DSR. The main algorithms specified are: features extraction (Mel Frequency Cepstral Coefficients-MFCC), the compression split-vector quantization and the framing and error protection algorithms. The front end algorithm block diagram specified in [3] is shown in Figure 2 . Three sampling rates configura tions are supported (8, 11 and 16 kHz) , where the frame length, frame rate and frame overlapping varies according to the configured sampling rate. For the three configurations, the input speech frame shift interval is 10 ms. The features vector (14 features) is compressed into 7 indices using split vector quantization algorithm. The compressed speech frame is then formatted in a multi-frame packet format, where the target data rate out of the front end is 4.8 kbps.
AbbreriaiiollS: In the following, we will present a proposal for a hardware implementation of the front end part of the basic standard.
3.
HARDWARE IMPLEMENTATION
In our design, Field Programmable Gate Array (FPGA) design style is chosen for prototyping and migration to structured Application Specific Integrated Circuit (ASIC) is chosen for mass production. The customization capabilities offered by this style gives flexibility in optimizing the hard ware utilization and area in the target chip, which reduces the final product cost. In addition, this style is characterized by its low power consumption, which makes it suitable for hand-held battery powered devices. The design proposed here is either optimized for memory resources or for processing time. As appears in Figure 2 , the algorithm has many complex components, like Hamming window, LogE, FFT, LOG, DCT and many others. Those components contain complex non-linear trigonometric, loga rithmic and other complex functions that can be either calcu lated on the fly, or stored in a Look-up table (LUT), like the Hamming window factors or the trigonometric functions, which requires extra memory. Some hardware optimized numerical algorithms exist to calculate such functions, like the COrdinate Rotation Digital Computer (CORDlC) [7] , which was used extensively in the memory optimized solu tion to calculate complex functions instead of storing their results in a LUT, which comes on the expense of extra proc essing time. On the contrary, the time optimized solution chooses to store the look-up table (LUT) of such functions, which reduces the required processing time. The idea of having two solutions comes from the relaxed time constraint on the system, where the frame rate specified in [3] is 10 ms. The net processing time available for the front end part is only 9.16 ms after removing the header and CRC overheads, which is relatively relaxed time compared to nowadays chip frequencies, hence, giving room for optimiz ing hardware by using hardware optimized, but time consum ing algorithms like CORDIC. On the other hand, if time is critical for the user of the chip, the other time optimized op tion is also available.
In the next sub-sections, some main components of the sys tem will be discussed.
2.1
CORDIC Core The CORDIC algorithm is used extensively in this design, due to two main reasons; first, its hardware implementation is highly optimized, where it utilizes only adders/subtractors, shift registers and one look-up table. This simple hardware can perform a lot of complex functions, ranging between trigonometric, hyperbolic and linear functions, which are the three types of the algorithm. Second, the accuracy of the re sult is high in small number of iterations, and simple conver gence constraints. The main target of the algorithm is to rotate an input complex vector by certain angle; this is called the rotation mode. The other mode is the vectoring mode, where it is required to align the input vector to the x-axis. The combination of the three types with the two modes of the algorithm can give a very wide range of complex functions. A simple, configurable hardware is presented in [7] . Table 1 and Table 2 show the usage of the CORDIC in our system, and the corresponding configuration. Table 1 shows the us age of CORDIC in both time and memory optimized solu tions, while Table 2 shows the extra usages in the memory optimized solution only. For more information about how the functions are calculated, please refer to [7] . In time optimized solution, only two CORDIC cores are needed, the first to calculate LogE feature, and the other one to be re-used between the FFT magnitude, the natural loga rithm of the Mel-filter output and hardware divider. In mem ory optimized solution, a CORDIC core is needed for the Hamming window, and another one for the LogE feature, and the last one to be re-used between FFT, Mel-output natural logarithm, hardware divider and DCT.
Hamming window
The Hamming window equation is [3] :
Where N is the speech frame length in samples, n is the sam ple order, Spe is the pre-emphasis filter result and Sw is the window filtered sample.
In the memory optimized solution, CORDIC module [7] is used to calculate the cosine in the above equation with every new sample, and hence the window factor is calculated.
[n the time optimized solution, only half of the window is stored in a LUT ROM, and the rest is deduced from it.
Fast Fourier Transform (FFT)
The basic FFT equation is [3] :
bin. = I FlI:-lsw(n)e -Jl l k� I , k= O, . .. ,FFTL-l.
Where bink is the magnitude of the resulting FFT coefficients, and FFTL is the length of the FFT result vector. Split radix-2 algorithm was used for FFT calculation [1] [2]
and [6] . [n the memory optimized solution of FFT, shown in Figure 3 , the twiddle factor complex multiplication involved in the butterfly operations is interpreted as vector rotation, since multiplication by a complex exponential is equivalent to rotating the multiplied complex vector by the argument of the exponential. Here the hardware optimized CORDIC core [7] is utilized as shown in the architecture below. This opti mized core reduces the hardware utilization and memory requirements.
On the other hand, the time optimized solution tends to store the twiddle factors in a LUT ROM of length equals to the length of the FFT vector, and performing complex multipli cation of the FFT vectors and the complex exponential, which requires extra memory and hardware multipliers. However, instead of storing all the twiddle factors, only the flrst quadrant values of the cosine function is stored, where the rest of the wave can be deduced from it. Sine values of the complex exponential can be deduced from the cosine values. This reduces the memory requirements by 75%. The time optimized architecture is similar to the one in Figure 3 , with substitution of the CORDIC core with LUT ROM of the twiddle factors.
Discrete Cosine Transform (OCT)
The DCT basic equation is [3] : 
23

)
Where Cj is the 13 dimension result vector of the DCT, and fj is the result of the non-linear transformation after the Mel filter banks output. Same discussion about the Hamming window goes here, where the memory optimized solution calculates the cosine values using CORDIC core [7] , while the time optimized solution stores the cosine values in a LUT ROM. 
Memory manager
This component is implementation speciflc, and not men tioned in [3] . It controls the access to the system RAM and ROM. The RAM memories used in the system are:
• Input samples RAM: this memory is managed in a circular buffer fashion to manage the frame overlap ping requirements. [t has size of N (speech frame length).
• I RAM: this memory is used to store the real part of the intermediate FFT radix-2 stages. It holds the real input and output of the FFT. It has size of FFTL (the FFT length).
• Q RAM: same as I RAM, but for the imaginary part of the intermediate FFT radix-2 stages. The last two memories are re-used in the Mel-filter and DCT components after the FFT is finished. The system has some ROM memories to hold some constant values: 
SYSTEM EVALUATION
The evaluation of the presented design will be held on three axes: First, the system hardware utilization and time per formance will be presented. Second, some of the main sys tem components that are usually used in benchmarking are compared to third party reference designs and other designs. And at last, the compliance of the system output is validated against the offIcial standard test vectors provided by the ETSI.
3.1
Hardware utilization and Processing time performance Table 3 Hardware utilization of the memory and time optimized solutions over the three configurations Table 4 shows the frame processing time performance as a percentage of the net allowed frame time after removing the header and CRC overheads (9.16 ms). A chip frequency of 100 MHz (after synthesis) is assumed. Table 4 processing time performance of the memory and time opti mized solutions Table 4 shows that the time optimized solution gives better results in terms of hardware utilization of logic gates and time performance. On the other hand, the memory optimized solution is better in terms of memory usage, but not so far from the time optimized solution.
Comparison to other designs
[n this section the FFT and CORDIC modules are evaluated against the Altera reference designs and other designs. Altera reference designs are available from Altera to be used as IP core (called MegaCore function or Mega function), with documentation available on Altera web site, www.altera.com.
FFT vs. other design
The FFT component is usually used to benchmark most of Digital Signal Processing (OSP) systems. The results in Table 5 show that the FFT design presented here outperforms Altera reference design in terms of hard ware utilization of logic elements and memory bits, while they both utilize 4 18xl8 multipliers. In terms of time per formance, the local FFT design takes only 256 cycles to [m ish, while the reference design takes 1628 cycles, which means that the reference design takes 6.36 times as that of the local FFT design. The design in [9] is very efficient in terms of clock cycles count, however, this comes on the expense of hardware, memory and multipliers resources usage. The enhanced performance of the FFT design proposed here is due to the following reasons:
• The optimized usage of memory, especially in the LUT of the sine and cosine factors as discussed in 2.3.
• Fixing the internal signals lengths to 16 bits (less than [8) optimizes the usage of the embedded mul tipliers (18 x 18) on the FPGA.
• Using the dual-port RAM feature of the used FPGA enables simultaneous memory access during FFT stages, which improved the time performance by 50%.
• Using split-radix FFT algorithm with only one but terfly core and iterating on it highly reduced the hardware utilization.
• Pipelining between the FFT components (bit reversal, butterfly, twiddle factors and address gen erator) improved the time performance. The results in Table 6 show that the local CORDIC design takes the same clock cycles to [mish as the reference design. However, in terms of logic elements utilization, the reference CORDIC design uses logic elements about 2.4 times as the local CORDIC design.
CORDIC vs. Altera's reference design
3.3
Compliance to the ETSI Aurora Basic Standard The ETS[ provides reference high level C-Code together with reference test vectors of 8 ms of continuous speech, which represents about 813 speech frame, to test proprietary implementations against them to prove compliance to the Aurora standard. The design presented here was tested against those results. Table 7 shows that the system under test output is correct to the third decimal place. Note that, the above average error is the error between the features after the quantization block in both reference and under test systems.
RELATED WORK
A similar implementation of the front end module in Aurora ETS[ system is presented in [8] . Note that; the design in [8] is only the features extraction part, so the comparison held here does not include the rest of quantization and framing components of the front end client. Table 8 Comparison between proposed front end design and the one in [8] The usage of algorithms with low hardware resources utiliza tion like CORDIC reduced the hardware resources and DSP units in the proposed design in this paper. Also, limiting the fixed point length of the internal signals to 16 bits enabled using the embedded multipliers and DSP MAC units on the used FPGA (18 bits width). Other platform dependent op timizations like using the dual-port RAM capability; DSP MAC units improved the utilization. Finally, re-use of some components and using single processing cores and iterating on them in many complicated operations like FFT, Mel filter ing and DCT optimized the resources utilization. The design in [8] is more concerned with re-usability, so some of the platform capabilities might not be exploited as the one pro posed here, which is more customized.
FPGA
CONCLUSION
[n this paper, we presented a hardware solution to imple ment the front end part of a distributed speech recognition system, taking the front end of the ETS[ Aurora basic stan dard as a reference. The hardware platform chosen for im plementation is FPGA for prototyping and structured AS[C for mass production. Two solutions were presented: memory optimized and time optimized. Hardware optimized algorithms like CORDIC were used in the design, especially in the memory optimized solution to reduce the ROM needed to store constant values and calculate it instead. Results show that the design can fit in a low cost Cyclone III 10K gates FPGA. Two main com ponents were used to obtain system benchmarks against corresponding reference designs provided by Altera and other designs, which are FFT and CORDIC components. The result of the comparison is highly in the favour of our system in terms of hardware resources or time performance. Finally, compliance to the reference standard being imple mented (the basic ETSI Aurora Standard) is proved by com paring the system final output to the reference standard out put over 8 ms of continuous speech. The result of comparison
957
shows compliance between fixed point and reference outputs to the third decimal place.
