The modeling of the acoustic echo path was presented using multiple of small adaptive filters rather than using one long adaptive filter. A new approach is proposed using the concept of decomposing the long adaptive filter into low order multiple subfilters in which the error signals are independent on each other. The independency of the error signals exhibits the parallelism technique. This achieves our goal in increasing speed of the convergence rate. Simulation results show that the proposed decomposed least-mean-square (LMS) adaptive algorithm significantly improved the convergence rate with respect to that of the original long adaptive filter. The proposed algorithm is also compared with multiple sub-filters approach used for acoustic echo cancellation as the technique of decomposition of error. This technique is based on using multiple subadaptive filters in which the error signals are dependent on each other. In this way the parallelism technique is not achieved and as the result the convergence rate increases. This is different from our proposed technique which is based on independency of the error signals to assure that our algorithm has faster convergence rate. The steady state error of our proposed technique is still high as the technique of decomposition of error. This steady state error is small with respect to using one long adaptive filter and this will be obvious in our simulation results. The hardware implementation of this proposed technique was also introduced using field programmable gate arrays (FPGAs). Filtering data in real-time requires dedicated hardware to meet demanding time requirements. If the statistics of the signal are not known, then adaptive filtering algorithms can be implemented to estimate the signals statistics iteratively. The modeling of the acoustic echo path was represented by using three sub-adaptive filters of order =10 with fixed step size =0.05/3 for each adaptive filter. We use sinusoidal input signal with additive white gaussian noise (AWGN) for different signal-to-noise ratio (SNR) to examine our approach.
Introduction:
The increase in data band-width for telecommunications today has created a need for high quality audio teleconferencing. Echo-cancellers are a common feature to teleconferencing systems which use "hand-free" operating systems, where by the users at each end of the conference can freely interact with each other. The purpose of an acoustic echo-canceller for these applications is to reduce the amount of sound which a far-end teleconference transmits from returning to them. A common approach for estimating the impulse response of the acoustic echo path is the LMS algorithm. Acoustic echo originates due to the coupling of the loudspeaker and microphone in hands-free telephony and teleconferencing [1] . On systems that perform real-time processing of data, performance is often limited by the processing capability of the system [2] . Therefore, evaluation of different architectures to determine the most efficient architecture is an important task. Digital Signal Processing (DSP) has revolutionized the manner in which we manipulate data. The DSP approach clearly has many advantages over traditional methods, and furthermore, the devices used are inherently reconfigurable, leading to many possibilities. Modern computational power has given us the ability to process tremendous amounts of data in real-time. DSP is found in a wide variety of applications, such as: filtering, speech recognition, image enhancement, data compression, neural networks; as well as functions that are unpractical for analog implementation, such as linear-phase filters [3] . Signals from the real world are naturally analog in form, and therefore must first be discretely sampled for a digital computer to understand and manipulate. The signals are discretely sampled and quantized, and the data is represented in binary format so that the noise margin is overcome. This makes DSP algorithms insensitive to thermal noise. Further, DSP algorithms are predictable and repeatable to the exact bits given the same inputs. This has the advantage of easy simulation and short design time. Additionally, if a prototype is shown to function correctly, then subsequent devices will also. There are many advantages to hardware that can be reconfigured with different programming files. Dedicated hardware can provide the highest processing performance, but is inflexible for changes. Reconfigurable hardware devices offer both the flexibility of computer software, and the ability to construct custom high performance computing circuits [2] . The hardware can swap out configurations based on the task at hand, effectively multiplying the amount of physical hardware available. In space applications, it may be necessary to install new functionality into a system, which may have been unforeseen. For example, satellite applications need to be able to adjust to changing operation requirements [4] . With a reconfigurable chip, functionality that was not predicted at the outset can be uploaded to the satellite when needed. A simplified model for acoustic echo path is developed based on the idea that the propagation delay is caused due to the speed of the sound wave, while reflections experience attenuation of high frequency components and some energy loss. The response of the acoustic echo path is broken into frames according to the reflections received at the microphone using process of segmentation approach for modeling of non-stationary processes [5] . Many signal processing applications call for adaptive filters with very long impulse responses. In acoustic echo cancellation, thousands of finite impulse response (FIR) filter coefficients may be required to sufficiently model the echo path. Moreover, the input data are often very strongly correlated which causes slow convergence of most adaptation algorithms, such as the well-known normalized least-mean-square algorithm. The requirements are particularly demanding for high-quality and/or multi-channel audio reproductions so that more sophisticated algorithms taking into account the input signal correlations have to be used [6] . Usually acoustic echo cancellers are realized by adaptive FIR filters, requiring thousands of coefficients to accurately model the echo return path. This leads to excessive burden of computation and slower convergence rate. One of the ways to mitigate this slowly convergent and computationally intensive long adaptive filter problem is to use decomposition. This idea is based on distributing the load of adjusting a long adaptive filter to low order multiple sub-filters updated individually by a separate adaptive algorithm. It is generally found that adaptive LMS algorithm with lower order has faster convergence [7] . In most of the cases, the eigen-value spread of the auto correlation decreases as the order of the filter decreases except for white input [8] . We present the modeling of the acoustic echo path based on segmentation approach. A new algorithm is proposed using the concept of decomposing the long adaptive filter into low order multiple sub-filters with the parallelism technique. Simulation results show that the proposed decomposed LMS adaptive algorithm significantly improved the convergence rate.
Previous Approaches To Acoustic Echo Cancellation
In this Session, we try to discuss different approaches used for acoustic echo cancellation and discuss the main problems in these approaches.
Echo Cancellation Using One Long Adaptive Filter
This approach , as shown in figure (1) , is based on using the filtering algorithms which try to estimate the impulse response of the acoustic echo path, h(n), and filter the incoming signal from the far-end, x(n) [9, 10] . The near-end input, d(n), e.g., from a microphone, will contain both the far-end sound and the new near-end sound. The farend sound is convolved with the estimated, h(n), and subtracted from y(n) before being sent to the far-end. The main problems that have to be addressed are system identification of the loudspeaker-room-microphone path in order to cancel the acoustic echo, d(n), and that this technique requires thousands of coefficients to accurately model the echo return path. These lead to excessive burden of computation and slower convergence rate.
Decomposition of Error Technique
This approach is based on using multiple sub-adaptive filters in which error signals are dependant on each other. The idea of decomposing the input signal vector and the weight vector into sub-vectors was first presented in [11] . Here the decomposition is to partition the long single adaptive filter into smaller multiple subfilters. Each sub-filter is updated by an individual adaptive algorithm. Adaptive algorithms are constructed depending upon how the error signal is generated. The error signal can be obtained at each stage of the sub filter for its updation. These arrangements are shown in figure (2) .
Proceedings of the 6 th ICEENG Conference, 27-29 May, 2008 EE134 -6

Figure (2): Decomposition of error technique
The adaptation factor is separately chosen for each sub-filter. In the different error mode although algorithm appears to be fast at the beginning but the steady state error is high [12] . The main problems of this approach are its slower convergence rate and it has large steady state error.
The Proposed Approach
The drawbacks of the previous approaches are their slower convergence rate and their large steady state error. Our proposed approach tries to overcome these drawbacks. The main idea of our algorithm is based on using multiple sub-adaptive filters in which error signals are independent on each other. In our proposed technique the error signals e 1 (n), e 2 (n) and e 3 (n), as shown in figure (3), can be obtained by comparing the output of FIR filter that represents the impulse response of the echo path in the near-end physical environment and the output of multiple low order sub-adaptive filters. Then we take the average of these error signals, using the average block, before being sent to the far end system. By this way we make error signals independent on each other and this exhibits the parallelism technique that achieves faster convergence rate. In another word the independency of the error signals, in our proposed approach, introduces the parallelism technique that tries to make the convergence rate faster. In this way we solve the problem of slower convergence rate but still the steady state error the same as that of the decomposition of error approach. This steady state error is small with respect to using one long adaptive filter.
Proceedings of the 6 th ICEENG Conference, 27-29 May, 2008 EE134 -7
Figure (3): The proposed approach
From figure (3) we note that the error signals e 1 (n),e 2 (n) and e 3 (n) are independent on each other as we discussed before. The average block is used to get the average of these three error signals before being sent to the far end system. Each sub-filter can be updated by an individual adaptive algorithm. The adaptation factor is separately chosen for each sub-filter.
We should have been observed that FIR filter must be system identification of the loudspeaker-room-microphone path. Acoustic echo can result from a combination of direct acoustic coupling and multi-path effect where the sound wave is reflected from various surfaces and then picked up by the microphone. The reflected sound signal experiences attenuation, propagation delay and energy loss. The model of the acoustic echo path cannot be static unless there is no change or movement of the person and objects in the environment. Keeping in view the characteristics of the reflected sound signal, the model for multiple reflections together with the direct path from Loudspeaker to microphone can be obtained. These attenuation constants depend upon the size of the room and surface from which reflections occur.
Convergence Behavior of The Proposed Approach
Here we will discuss only the adaptive filter portion of the Acoustic Echo Canceller rather than discussing the other components like double talk detector and residual echo suppressor. We assume that the person in the near end physical environment is silent.
The input time series, x(n), is assumed sinusoidal function with AWGN for different SNR. The microphone output is described as T is the Wiener solution of the j th sub-filter, X j (n) = [x(n -(k 1 +k 2 +… +k j )),x(n -(k 1 +k 2 +…..+ k j + 1)) ,....., x(n -(k 1 
T is input sequence, M is the number of iterations taken into account and L j is the length of low path FIR filter corresponding to each reflection. Also H 0 = 1.0, X 0 (n) = x (n) and (n) is the ambient noise assumed independent of sequence X j (n) with a variance min =E[ 2 (n)]. E[*] represents the expectation operation. The order of each adaptive sub-filter is considered to be same as that of the individual reflection path coefficient length in exact modeling.
The output of the adaptive sub-filter is given by
, , ,
The error signals are defined as
The LMS adaptation of each sub-filter, W i (n), is given as We define the weight error vector VW nH
Then the mean square error (MSE) under well known assumption is given as We must note that for the MSE to converge, it is necessary that E[V i (n)] and This gives the essential condition for convergence but still tighter bound for the step size can be obtained to confirm stability and convergence of sub-filters. Therefore we obtain the correlation matrix of weight error vector by post multiplying (7) 
Loadable Coefficient Filter Taps
The heart of any digital filter is the filter tap. This is where the multiplications take place and is therefore the main bottleneck in implementation. Many different schemes for fast multiplication in FPGAs have been devised, such as distributed arithmetic, serial-parallel multiplication, and Wallace trees [13] , to name a few. Some, such as the distributed arithmetic technique, are optimized for situations where one of the multiplicands is to remain a constant value, and are referred to as constant coefficient multipliers (KCM) [14] . Though this is true for standard digital filters, it is not the case for an adaptive filter whose coefficients are updated with each discrete time sample. Consequently, an efficient digital adaptive filter demands taps with a fast variable coefficient multiplier (VCM). A VCM can however obtain some of the benefits of a KCM by essentially being designed as a KCM that can reconfigure itself. In this case it is known as a dynamic constant coefficient multiplier (DKCM) and is a middleway between KCMs and VCMs [14] . A DKCM offers the speed of a KCM and the reconfiguration of a DCM although utilizes more logic than either. This is a necessary price to pay however, for an adaptive filter.
A Multiplier
An approach to multiplication that uses the Full_Adder , 8-bit ripple-carry adder, Positive-edge-triggered D flip-flop with asynchronous clear, an 8-bit register , an 8-bit multiplexer, a zero detector , a variable-width shift register and a Moore state machine
Proceedings of the 6 th ICEENG Conference, 27-29 May, 2008 EE134 -10
is as shown in Figure ( 4). This figure shows a schematic that describes the interconnection of all the components for the multiplier. Notice that the schematic comprises two halves: an 8-bit-wide datapath section (consisting of the registers, adder, multiplexer, and zero detector) and a control section (the finite-state machine). The arrows in the schematic denote the inputs and outputs of each component. VHDL has strict rules about the direction of connections.
Figure (4):
The multiplier used for our proposed approach
Hardware Verification
The proposed design was thoroughly tested on the FPGA. The VHDL design using the FPGA fabric only was tested as well as the hybrid designs using the FPGA fabric for filtering and utilizing the PowerPC for the training algorithm. To test the validity of the hardware results, an Avnet Virtex-II Pro Development kit with a Xilinx XC2VP20 FPGA was used. This board can plug directly into the PCI slot of a host computer for fast data transfer over the PCI bus. The included PCI Utility enabled this as well as quick FPGA reconfiguration over the PCI.
Simulation Results
Matlab Simulation
We examined our approach by using sinusoidal input signal, x(n), with AWGN for different SNR. Echo path impulse response, h(n), in the near end physical environment was measured at a distance of 1.0 ft using computer loudspeaker and unidynamic microphone. The sampling frequency of the simulation is16,000 Hz. We assume that low path FIR filter which represents the impulse response of the echo path 
Figure(8): a) The input signal ,b) The output of FIR filter, c) The average output of the three adaptive filters For SNR=20dB
The MSE, defined in equation (6) , versus the number of iterations M in the proposed algorithm for different SNR is compared with the previous techniques as shown in figure (9) through (12) . (12), we conclude that in our proposed technique, we need a small number of iterations to reach steady state error. We also conclude that in the technique of decomposition of error and the technique of using one long adaptive filter we need large number of iterations to reach steady state error. In this way it is obvious that our proposed technique achieves faster convergence rate. This is expected as our proposed technique is based on using multiple sub-adaptive filters with the independency of error signals to achieve the parallelism technique. This is, as we discussed before, the key idea for the improvement of our proposed technique compared with the previous techniques. It is obvious from our simulations that the steady state error of our proposed approach is the same as using technique of decomposition of error. Also it is obvious that this steady state error is small with respect to using one long adaptive filter. The algorithms for adaptive filtering were coded in Matlab and experimented to determine optimal parameters such as the learning rate for the LMS algorithm. Next, the algorithms were converted to a fixed-point representation, and finally, coded for the Virtex-II Pro. The above algorithms were converted so that all internal calculations would be done with a fixed-point number representation. This is necessary, as the embedded PowerPC has no floating-point unit (FPU), and FPGA's don't natively support floating-point either. Although a FPU could be designed in an FPGA, they are resource intensive, and therefore can feasibly only support sequential operations. Doing so however would fail to take full advantage of the FPGA's major strength, which is high parallelization. Figures (13) through (16) show the timing diagram of the proposed approach at different SNR. We perform rounding operation on the signals before converting it into binary format. From figure (13) through (16) we must note that, The desired is the output of FIR filter after conversion to binary format and data out is the average output of adaptive filter after conversion to binary format. We must also note that these binary numbers after 1usec begin to converge from each other this is due to the rounding operation which we preformed on the signals before the conversion into binary format. we try to introduce the device utilization for 2VP2fg256 for our proposed approach through table (2) to show how our proposed approach has an efficient realization 
Conclusions:
An acoustic echo cancellation based on using multiple sub-adaptive filters in our proposed technique shows the improvement in performance. This technique achieves faster convergence rate than using the method of decomposition of error and that of using only one long adaptive filter. The key idea for this improvement is that our proposed algorithm is based on independency of the error signals which exhibits the parallelism technique. In this way, we achieve faster convergence rate.
The Least Mean-Square algorithm was found to be an efficient training algorithm for FPGA based adaptive filters. The used resources in the proposed approach show the efficient of this algorithm for FPGA realization. The issue of whether to train in hardware or software is based on power specifications, and is dependent on the complete system being designed. While the extra power consumed would make the PowerPC seem unattractive, as part of a larger embedded system this could be practical. If many processes can share the PowerPC then the extra power would be mitigated by the creation of extra hardware that it has avoided. With no microprocessor, a finite state machine for timing, as well as a memory interface is needed, and these will consume more power, although still less than the PowerPC. Lastly, the microprocessor can be used to easily swap out software training algorithms for application testing and evaluation. Embedded microprocessors within FPGA's are opening up many new possibilities for hardware engineers, although it requires a new design process. The future of embedded Systems-on-Chip design will involve more precisely determining the optimal hardware and software tradeoffs for the functionality needed.
