This paper describes the design and implementation results of an adaptive Noise Canceller useful for the construction of Robust Speech Enhancement Interfaces. The algorithm being used has very good performance for real time applications. Its main disadvantage is the requirement of calculating several operations of division, having a high computational cost. Besides that, the accuracy of the algorithm is critical in fixed-point representation due to the wide range of the upper and lower bounds of the variables implied in the algorithm. To solve this problem, the accuracy is studied and according to the results obtained a specific word-length has been adopted for each variable. The algorithm has been implemented for Altera and Xilinx FPGAs using high level synthesis tools. The results for a fixed format of 40 bits for all the variables and for a specific word-length for each variable are analyzed and discussed.
INTRODUCTION
The Adaptive Noise Canceller here described, is devoted to the construction of Robust Speech Enhancement Interfaces to be used in adverse environments with high levels of noise, whose spectral power characteristics are continuously changing. This previous pre-processing step has several applications as Robust Speech Recognition interfaces for their use in Command-Driven systems, such as advanced videogames, virtual reality applications; manmachine communications in computer aided manufacturing, intercommunication systems in factories, aircraft cockpit communications, etc. These environments are characterized by high noise levels (more than 95 dB) produced by costumer's chatting, ambient music, sirens, motors, etc. There are different techniques to reduce noise levels [1] [2] . Adaptive Noise Cancellation for non-stationary environments in the time domain is an adequate technique This work is funded by grant TIC2006-12887-C02-00 and projects HESPERIA, 11310L (CCG06-UPM/INF-2) and SIB3MATI (FIT-3600000-2007-32 The implementation of these algorithms has been carried out traditionally with general-purpose DSP microprocessors using floating-point arithmetic. These implementations minimize round-off errors but tend to be limited in processing speed because they have usually available a single or a few processing units. In DSP microprocessors word-length implementation is defined by the hard-wired architecture but in reconfigurable computing the size of each variable may be customized in order to get the best trade-offs in numerical precision, speed, size and power consumption. It is shown that reconfigurable computing designs are capable of achieving up to 500 times speed up and 70°0 energy over microprocessor implementations for specific applications [3] .
The most difficult task in translating an algorithm written in MATLAB, for a general-purpose processor or DSP microprocessor into an algorithm optimized for custom logic, is the floating-point to fixed-point conversion due to accuracy problems [4] . The problem of word-length optimization is NP-hard [5] and different approaches have been adopted and tools developed for its treatment [4] The recording scheme is based on a two-microphone structure as shown in Fig. 1 . One for noisy speech (primary, x(n)) and the other the noise in itself (reference, ra(n)). The speech source is assumed to be well separated from the reference microphone to avoid crosstalk. The noise is estimated by a lattice filter (which is adapted by its estimation errors), its backward residuals being used to adapt the weights of a ladder filter in combination with the noise estimation generated. Clean speech is then obtained as the error output of the ladder filter.
Variable initialization is shown in The aspects which have more influence in the computational complexity of the algorithm are the sampling frequency (11025 Hz, enough for speech) and the number of stages of the lattice filter (14, required to support two microphones separated 20 cm), for these characteristics a cancellation average from 6 to 12 dB is obtained. The filter is recursive in the order of the lattice filter (m) and in time (n). The convergence rate ofthe algorithm is good and has a computational complexity of N. The stages for the calculation of the algorithm are three: initialization, lattice and ladder.
The lattice filter computation is the most expensive part of the noise removal algorithm, as show in Table 2 . It begins with n = 0 and computes the updates for m = 0, 1, ... N-2; with N = 14. The first step is to update the Parcor coefficient for the next stage and from it to calculate the reflection coefficients. The forward and backward errors, and the forward and backward residual errors are evaluated next. Finally, the adaptive parameter is estimated. 
Forward and backward errors
bmI (n + 1) bm (n) + ±Xbm+i(n) fn(n + 1)
Forward and backward residual errors
The ladder filter calculates the gain factor, estimates the noise and produces the clean signal as a final result (see Table 3 ). It begins with n = 0 and computes the updates for m=0, 1 . .. N-1;withN= 14. 
Gain factor
rmb (n) 7 Estimate noise xem(n) xemil(n) + bm(n) gm(n)
Clean signal em+i(n + 1) em(n + 1) + gm(n) bm(n + 1)
The algorithm demands 9 Table 4 . As the final implementation of the algorithm is to be carried out using reconfigurable logic by high level synthesis methodologies, the limitation of the synthesis tools in using integer data types must be specially taken into consideration, due to the implications in the algorithm computation accuracy [7] . Table 5 . It can be observed that the three more significant figures remain unchanged. And changes may be appreciated in the last significant figure from 10-4 to 10-6 positions. Thus, to consider the influence of this last significant figure the adaptive parameter must be scaled by 106 or 220 having in mind the hardware implementation of this scale factor. The reflection coefficients and the adaptation step were scaled in the same proportion than the adaptive parameter. The gain factor requires to be scaled by 104 or 214from the same analysis than in the case of the adaptive parameter. Taking into consideration the values of the scale factor mentioned before, an exhaustive simulation study has been carried out in order to adjust the number of bits for each variable (NB). This factor has been adjusted according to the values of the lower and upper bounds obtained during the computation of the algorithm for all the commands enclosed in the proprietary data base mentioned before. The criterion to validate results consisted in estimating the errors between the clean signals obtained in floating point format considering them as integer numbers including the scaling factor. The clean waveform result has also been listened to subjectively evaluate the quality of command intelligibility. Table 6 summarizes the optimal word-length for each variable and its associated scale factor. To give an idea about the quality of results, the words down and eight corrupted by noise are shown in Figure 2a) . The clean signal obtained after floating point computation is shown in Fig. 2b ) and finally the clean signal obtained using the word-length and parameters from Table 3 are presented in Fig 2c) . When comparing the clean trace obtained with float point arithmetic and with optimally adjusted word length it can be concluded that the results are interchangeable. The algorithm description has been carried out in ANSI C and automatically translated into VHDL by means of the CATAPULT-C tool from Mentor Graphics [8] . Later on, the VHDL resulting code was synthesized by the Quartus II from Altera and ISE from Xilinx tools. The results presented next correspond to two word-length cases. The first case, considers a fixed 40 bit word-data format for all the variables implied in the algorithm because it is the longest data format needed after optimization. The second one uses the word-length adjusted ad hoc after optimization for each variable according to Table 6 . Table 7 shows the Altera results for the device EP2S15F484C3 from the Stratix family. Table 7 show that there is a reduction of the 37 00 in the ALUTs, 41,5 00 in registers and 32,3 in the bits of memory. But these reduction rates seem to imply a 68,7 00 increment in the DSP blocks needed. The frequency increments a 5,1 %O. And the significant number of a 30,8 00 of reduction is achieved in the dynamic power. Concerning the results for Xilinx shown in Table 8 , a similar saving percentage is found for function generators, CLB slices and Dff, this being a 28,1 00 for RAM blocks. The DSP blocks show the same tendency than the Altera case increasing a 100 00. Not significant differences for frequency and power dissipation were observed. The Altera and Xilinx results can't be strictly compared because the FPGAs being used in the implementations have different characteristics and the synthesis, optimization and mapping modules of the tools may not use the same strategies. Physical resource demand in the optimal case shows the same tendency for both tools, a reduction for the generation of combinational and memory parts and an increment in the number of DSP blocks. This increment is natural as the number of bits decreases because the tool can map functionality more easily to DSP units according to its number of bits, 9 for Altera and 48 for Xilinx. The maximum clock frequency and dynamical power show a better behaviour for Altera than for Xilinx.
CONCLUSION
A study on word-length optimization of a speech enhancement noise-cancelling filter has been presented. The optimization has been carried out taking a set of spoken commands from a data base as a reference. Initially, the upper and lower bounds of the variables implicated in the algorithm were determined in float point calculation. These initial results evidence that the most critical variable is the filter adaptation step um(n). The procedure used in the case of this variable serves as a model to scale the rest of the variables. To properly optimize the length of each individual variable an exhaustive simulation with all the spoken commands has been carried out. When comparing the clean trace produced with float-point arithmetic using an optimally adjusted word length it can be concluded that the results are comparable. Finally, the longest data-format after optimization was implemented for all the variables and contrasted with the data format optimized for each one of them. The quality of the results shows a high dependency on the tools and implementation devices when design methodologies based on high level synthesis are used.
