ABSTRACT
Introduction
The performance of real-time data processing is often limited to the processing capability of the system. Therefore, evaluation of different digital signal processing platforms to determine the most efficient platform is an important task. There have been many discussions regarding the preference of Digital Signal processors (DSPs) or Field Programmable Gate Arrays (FPGA) in real time noise cancellation. The purpose of this work is to study features of DSPs and FPGAs with respect to their power consumption, speed, architecture and cost. DSP is found in a wide variety of applications, such as filtering, speech recognition, image enhancement and data compression, neural networks, as well as analog linear-phase filters. Signals from the real world received in analog form, then discretely sampled for a digital computer to understand and manipulate. There are many advantages of hardware that can be reconfigured with different programming. Reconfigurable hardware devices offer both the flexibility of computer software, and the ability to construct custom high performance computing circuits. In space applications, it may be necessary to install new functionality into a system, which may have been unforeseen. For example, satellite applications need to adjust to changing operation requirements. With a reconfigurable chip, functionality that is not normally predicted at the outset can be uploaded to the satellite when needed. To test the adaptive noise cancelling, the least mean square (LMS) approach has been used. Besides the standard LMS algorithm, the modified algorithms that are proposed by Stefano [1] and by Das [2] have been implemented for the noise cancellation approach, giving the opportunity of comparing both platforms with respect to their speed, noise, architecture, cost, and power.
Adaptive Filter Design on Motorola DSP56300
Adaptive filters have the ability to adjust their own parameters and coefficients automatically. Hence, their design requires little or no prior knowledge of the input signal or noise characteristics of the system. Adaptive filters have two inputs, x(n) and d(n), which are usually correlated in some manner. Figure 1 gives the basic concept of the adaptive filter. The filter's output y(n), which is computed with the parameter estimates, is compared with the input signal d(n). The resulting prediction error e(n) is fed back through a parameter adaption algorithm that produces a new estimate for the parameters and as the next input sample is received, a new prediction error can be generated. The adaptive filter features minimum prediction error. Two aspects of the adaptive filter are its internal structure and adaptation algorithm. Its internal structure can be either that of a nonrecursive (FIR) filter or that of a recursive (IIR) filter. An adaptation algorithm can be divided into two major classes; gradient algorithms and nongradient algorithms. A gradient algorithm is used to adjust the parameters of the FIR filter. The least mean square (LMS) algorithm is the most widely applied gradient algorithm. This adjusts the filter's parameters to minimize the mean-square error between the filter's output y(n) and the desired response input d(n) [3] . When an adaptive filter is implemented on the DSP56300 processer, address pointer to mimic FIFO (First-In-FirstOut)-like shifting of the RAM data, modulo addressing capability to provide wrap around data buffers, multiply/accumulate (MAC) instruction top both multiply two operands and add the product to a third operand in a single instruction cycle, data move in parallel with the MAC instructions to keep the multiplier running at 100% capacity and Repeat Next Instruction (REP) to provide compact filter code are being used by the processor. The processor's capability to perform modulo addressing allows an address register (Rn) value to be incremented (or decremented) and yet remain within an address range of size L, where L is defined by a lower and an upper boundary. For the adaptive FIR filter, L number of coefficients (taps). The value L-1 is stored in the processor's Modifier Register (Mn). The upper address boundary is calculated by the processor and is not stored in a register. When modulo addressing is used, the Address Register (Rn) points to a modulo data buffer located in X-Memory and/or Y-Memory. The address pointer (Rn) is not required to point at the lower address boundary; it can point anywhere within the defined modulo address range L. If the address pointer increments past the upper address boundary (base address plus L-1 plus 1), it will wrap around to the base address. Modulo Register M1 is programmed to the value NTAPS-1 (modulo NTAPS). Address Register R1 is programmed to point to the state variable modulo buffer located in X-Memory. Modulo Register M4 is programmed to the value NTAPS-1. Address Register R4 is programmed to point to the coefficient buffer located in Y-Memory. Given that the FIR filter algorithm has been executing for some time and is ready to process the input sample x(n) in the Data ALU input Register X0, the address in R4 is the base address (lower boundary) of the coefficient buffer. The address in R1 is M, where M is greater than or equal to the lower boundary of X-Memory address and less than or equal to the upper boundary of X-Memory address. The X-Memory map for the filter states, the Y-Memory map for the coefficients, and the contents of the processor's A and B Accumulators and Data ALU Input Registers X0, X1, Y0 and Y1 are shown in the Figure 2 . The CLR instruction clears the A-Accu-tim Y1 and the error sample e(n) to the Data In mulator and simultaneously moves the input sample x(n) from the Data ALU's Input Register X0 to the X-Memory location pointed to by address register R1, and moves the first coefficient from the Y-Memory location pointed to by address register R4 to the Data ALU's Input Register Y0. Both Address Registers R1 and R4 are automatically incremented by one at the end of the CLR instruction (post-incremented). The REP instruction regulates execution of NTAPS-1 iteration of the MAC instruction. The MAC instruction multiplies the filter state variable X0 by the coefficient in Y0, adds the product to the A-Accumulator and simultaneously moves the next state variable from the X-Memory location pointed to by the Address Register R1 to the Input Register X0, and moves the next coefficient from the Y-Memory location pointed to by Address Register R4 to Input Register Y0. Both Address Registers R1 and R4 are automatically incremented by one at the end of the MAC instruction (post-incremented).
During the execution of the filter algorithm, Address Register R4 is post incremented to a total of NTAPS es; once in conjunction with the CLR instruction and NTAPS-1 times (due to the REP instruction) in conjunction with the MAC instruction. Since the modulus for R4 is NTAPS and R4 is incremented NTAPS times, the address value in R4 wraps around and points to the coefficient buffer's lower boundary location [3] . Also Address Register R1 is post incremented to a total NTAPS times; once in conjunction with the CLR instruction and NTAPS-1 times (due to the REP instruction) in conjunction with the MAC instruction. Also at the beginning of the algorithm, the input sample x(n) is moved from the Data ALU Input Register X0 to the X-Memory location pointed to by R1. Since the modulus for R1 is NTAPS and R1is incremented NTAPS times, the address value in R1 wraps around and points to the state variable buffer's X-Memory location M. The MACR instruction calculates the final tap of the filter algorithm and performs convergent rounding of the result. The data move portion of this instruction loads the input sample x(n) into the B-Accumulator. At the end of the MACR instruction, the accumulator contains the filter output sample y(n) as shown in Figure 3 .
The two Move instructions transfers the loop gain K to the data register put Register X1. The first MOVE instruction in the "do loop" transfers the parameter b i (n) to the A-Accumulator and the filter state x(n-i) to the Data Input Register X0. Address Register R1 is incremented by one to point to the next filter state. The MAC instruction multiplies the filter state, in X0, by the product of the loop gain and the error sample, in Y1, and adds the product to the A-Accumulator. The result in the A-Accumulator is the updated parameter b i (n+1). The second Move instruction in the "do loop" transfers the parameter b i (n+1) to the Y-Memory location pointed to by the Address Register R4. R4 is incremented by one to point to the next filter parameter as shown in Figure 4 . The LUA instruction decrements R1 by one, and R1 then points to the state variable buffer's X-Memory location M-1. When the algorithm is executed, a new (next) input sample x(n+1) will overwrite the value in X-Memory location M-1. Thus FIFO-like shifting of the filter state variables is accomplished by adjusting the R1 address pointer as shown in Figure 5 . Consider the problem of finding the linear minimum mean square estimate (LMMSE) of a zero-mean signal vector, S, from a noisy zero-mean data vector, X = S + N, where N denotes the additive noise vector. A LMMSE of S is given in Equation (1), where A denotes a matrix of filter coefficients as given in Equation (2).
Here, C SS and C nn denote the covariance matrices of signal and noise, respectively. Notice that if X has a non-zero mean vector, μ, Equation e becomes:
For point-wise processing of a non-stationary signal of a local mean, µ S , and local variance, σ S , and the noise to be zero-mean, white with a local variance, σ n 2 , the point-wise LMMSE will be given by:
σ n 2 is constant, while σ S 2 and μ S vary with the time index, k. Thus the filtered estimate at time, k can be written as:
where μ (k) and σ 2 (k) d
S S
of local mean and local variance ad filtering. Lee's adaptive wiener filter suffers from oising performance of the filter is improved by introducing a non-rectangular window to process weighted dat second, a scheme for online estimation of noise power is observed data consists of predominantly low-frequency signal components and additive white noise, the can be modeled as a sum of the spectral density of the enote the time varying estimates of S(k). An improved version of Lee's aptive wiener filter has been proposed by Das [4] . The main contributions of this algorithm include a better technique for estimation of noise variance, and incorporation of a data window for adaptive two major drawbacks. First, it requires prior knowledge of noise power and second, its performance deteriorates when the signal-to-noise ratio (SNR) is low and noise power is imprecisely known. The improved wiener filter incorporates two modifications. First, the de-n a samples and incorporated which is based on analyzing the power spectral density, S(ω), of the data. Assuming that the n S(ω) signal and a constant, σ n 2 , which represents the variance of noise. The estimated σ n 2 is the average value of the high-frequency section of S(ω) [2] . The improved wiener filter can be done in a fashion similar to that of Lee's wiener filter, but Equation (2) now takes the form S = AWX, where A denotes a matrix of filter coefficients, and W is a (diagonal) data weighting matrix. The LMMSE of S is now given by Equation (6), where X W = WX, and similarly, the point-wise LMMSE is given by
FPGAs Adaptive Filter Design
The efficient realization of complex algorithms on FPGAs requires a familiarity with their specific architectures. The modifications needed to implement an algorithm on an FPGA and also the specific architectures for adaptive filtering and their advantages are given below.
FPGA Realization Issues
FPGAs are ideally suited for the implementation of adaptive filters. However, there are several issues that need to be addressed. When performing software simulations of adaptive filters, calculations are normally carried out with floating point precision. Unfortunately, the resources required of an FPGA to perform floating point arithmetic are normally too large to be justified. A the filter tap itself. Numerous techniques have been devised to efficiently calculate the convolution when the filter's coefficients are fixed in advan nother concern is operation ce. For an r time, these ugh computing floating point arithmetic in FPGA is d with the inclusion of costly in terms of decidecimal places is adefor a given algorithm to s only four bits. For simple convolution, then dividing the output adaptive filter whose coefficients change ove methods will not work or need to be modified significantly [5] . The reconfigurable filter tap is the most important issue for high performance adaptive filter architecture, and as such it will be discussed at length.
Finite Precision Effects
Altho possible, it is usually accomplishe custom floating point units, which are logic resources. Therefore, a small number of floating point units can be used in the entire design, and must be shared between processes. This does not take full advantage of the parallelization that is possible with FPGAs and is therefore not the most efficient method. All calculation should therefore be mapped into fixed point only, but this can introduce some errors. The main errors in DSP include ADC quantization error, coefficient quantization error, overflow error caused impermissible word length, and round off error. The other three issues will be addressed later.
Scale Factor Adjustment
A suitable compromise for dealing with the loss of precision when transitioning from a floating point to a fixedpoint representation is to keep a limited number of mal digits. Normally, two to three quate, but the number required converge must be found through experimentation. When performing software simulations of a digital filter for example, it is determined that two decimal places is sufficient for accurate data processing. This can easily be obtained by multiplying the filter's coefficients by 100 and truncating to an integer value. Dividing the output by 100 recovers the anticipated value. Since multiplying and dividing be powers of two can be done easily in hardware by shifting bits, a power of two can be used to simplify the process. In this case, one would multiply by 128, which would require seven extra bits in hardware. If it is determined that three decimal digits are needed, then ten extra bits would be needed in hardware, while one decimal digit require multiplying by a preset scale and by the same scale has no effect on the calculation. For a more complex algorithm, there are several modifications that are required for this scheme to work [6] . The first change needed to maintain the original algorithm's consistency requires dividing by a scale constant any time and previously scaled values are multiplied together. Consider, for example, the values a and b and the scale constant s, the scaled integer values are represented by a s  and b s  . To multiply theses values requires dividing by s to correct for the s 2 term that would be introduced and recover the scaled product
Likewise, division must be corrected with a subsequent multiplication. It should now be evident why a power of two is chosen for the scale constant, since multiplication and division by power of two results in simple bit shifting. Addition and subtraction require no additional adjustment. The aforementioned procedure must be applied with caution, however, and does not work in all circumstances. While it is perfectly legal to apply to the convolution operation of a filter, it may need to be tailored for certain aspects of a given algorithm. Consider the tap-weight adaptation equation for the LMS algorithm in Equation (9).
where μ is the learning rate parameter; its purpose is to control the speed of the adaptation process. The LMS rithm i onvergent in the mean square provided in Equation (10) .
where
is the largest eigenvalue of the correlation matrix R x of the filter's input. Typically this is a fraction value and its product with the error term has the effect of keeping the algorithm from diverging. If µ is blindly multiplied by some scale factor and truncated to a fixed-point integer, it will take on a value greater than one. The affect will be to make the LMS algorithm diverge, as its inclusion will now amplify the added error term. The heuristic adopted in this case is to divide by the inverse value, which will be greater than one. Similarly, division by values smaller than one should be replaced by multiplication with its inverse. The outputs of the algorithm will then need to be divided by th obtain the true output. The following algorithm Scale = accuracy rounded up to a power of two. Multiply all constants by scal vide by e scale to describes the fixed point conversion:
Determine Scale Through simulations, find the needed accuracy (# decimal places). 
Training Algorithm Modification
The training algorithms for the adaptive filter need some minor modifications in order to converge for a fixedpoint implementation. Changes to the LMS weight update equation were discussed in the previous section.
Specifically, the learning rate µ and all other constants should be multiplied by the scale factor. When µ is adju rm in Equation (11) . With µ modifica sted it takes the fo tion weight update Equation (11) can be modified as in Equation (12) . Figure 6 R structure is shown in Figure 6 and the output y at any time n is given by Equation (13), where nodes B and C are described respectively. The direc rmined by the depth of the output adder pendent on the filter's order. The transposed F ier a e other hand, has a delay of only one multipl der, regardless of the filter length. It is therefore a ntageous to use the transposed form for FPGA impl mentation to achieve maximum bandw shows the direct and Figure 7 shows the transposed FIR structures for a three tap filter. The relevant nodes have been labeled A, B and C for a data flow analysis. Each filter has three coefficients, and are labeled h 0 [n], h 1 [n] and h 2 [n] . The coefficients' subscript denotes the relevant filter tap, and the n subscript represents the time index, which is required since adaptive filters adjust their coefficients at every time instance.
The direct FI in Equations (14) and (15) 
n Figure 7 and The trans the ou any time n en ow.
ucture i is giv tput y at bel
with the direct FIR output, the di
Compared the [n-k] index of the coefficient indicates th produce equivalent output only when the don't change with time. This means architecture is used, the LMS algorithm will not con verge differently from the direct implementation i [7] . The change needed was to account for the weights as shown in Equation er actually aditional LMS gh, due to the sposed form FIR, when quired.
Implementing Adaptive Noise Filter with FPGAs
Adaptive noise filtering techniques are applied to low frequency like voice signals, and high frequency signals such as video streams, modulated data, and multiplexed data coming from an array of sensors. Unfortunately in all high frequency and high speed applications, a software implementation of the adaptive noise filtering usually doesn't meet the required processing speed, unless a high end DSP processor is used. A convenient solution can be represented by a dedicated hardware implementation using a Field Programmable Gate Array (FPGA). In this case the limiting factor is represented by a number of
ultipliers. Moreover experimental data showed that the modified algorithm achieves the same or even better performan the standard LMS version. There are many possi ost IR) digital filter, whose coefficients are iteratively updated multiplications required by the adaptive noise cancellation algorithm. By using a novel modified version of the LMS algorithm, the proposed implementation allows the use of a reduced number of hardware m ces than ble implementations for an adaptive noise filter, but the m widely used employs a Finite Impulse Response (F using the LMS algorithm. The algorithm is described in Equations (24) to (26), leading to the evaluation of the FIR output, the error, and the weights update.
In the above equations, X i is a vector containing the reference noise samples, D i is the primary input signal, W i is the filter weights vector at the i th iteration, and e i is the error signal. The µ coefficient is often empirically chosen to optimize the learning rate of the LMS algorithm. The hardware implementation of the algorithm in an FPGA device is not trivial, since the FIR filter has not constant coefficients, so multipliers cannot be synthesized by using a look-up table (LUT) based approach. This however, should be straightforward in FPGA architecture. Multipliers with changing inputs instead need to be built by using a significantly greater number of internal logic resources (either elementary logic blocks or embedded multipliers). In an Nth order filter the algorithm requires at least 2N multiplications and 2N additions. Note the factor 2µ that is usually chosen to be a power of two in order to be executed by shifting. This makes it impractical for fully parallel hardware implehe value of N grows. This mentation of the algorithm as t is due to the huge number of m der to reduce the complexity of weights update expression (Equation as pability of the filter. To overcome this weakness, and significantly improve the characteristics, a dynamic learning rate coefficient t an adaptive filter whose order can iultipliers required. In orthe algorithm, the (26)) is simplified in Equation (27).
 
As a consequence the weights are updated using a factor proportional to the error and the sign of the current reference noise sample, instead of its value. This implies that weights can be updated by using an addition (or subtraction) instead of a multiplication. This simplified algorithm requires only N multiplications and 2N additions. However the simplification of the weights update rule usually results in worse learning performances, i.e. in a slower adaptation ca learning α has been used. Generally this can be done by updating it with an adaptive rule, or, by using a heuristic function. Simulations of the above mentioned method shows that a dynamic learning rate gives an advantage not only in the learning characteristics, but also in the accuracy of the final solution (in term of improvement of the signal to noise ratio of the steady state solution). The product αe i is used to update all weights; only one additional multiplication is required.
Architecture for Implementation on FPGA
The architecture of the adaptive noise filtering based on the modified LMS algorithm is shown in Figure 8 . It was designed to implement 32 tap adaptive noise filter in a medium density FPGA device. It has a modular and scalable structure composed by 8 parallel stages, each one capable of executing 1 to 4 multiply and accumulate (MAC) operations and weights update. By controlling the number of operation performed by each block it is possible to implemen range from 8 to 32. In the first case, by exploiting max mum parallelism, the filter is capable of processing a data sample per clock cycle. In the other cases 2 to 4 clock cycles are requested. Some FPGA's internal RAM blocks were used to implement the tap delays and to store weights coefficients. Each weights update block is mainly composed by an adder/subtractor accumulator. The weights update coefficients Δ i are computed by a separated block, which also handles the learning rate update function, following the above mentioned heuristic algorithm, and implements its multiplication with the error signal. By slightly modifying this unit, a more sophisticated adaptive function, can be easily obtained, thus enhancing the performances of the adaptive noise filtering for non stationary signals. 
Simulations and Results
Adaptive noise filters have been implemented on DSPs and FPGAs. Motorola DSP56303 has been used for DSP platform, while Xilinx Spartan III boards are used to implement FPGA adaptive noise filtering. Matlab Simulink has been used to test the effectiveness and correctness of the adaptive filters before hardware implementation.
Matlab Simulink Simulations and Results
To test the theory and see the impro er that is proposed by ugh Matlab Simulink.
tool ise vements visually that is proposed by Das, the adaptive filt Lee and Das has been compares thro (see Figure 9) The target simulink model is responsible for code generation where as the host simulink model is responsible for testing. The host drives the target model with heavy wavelet noisy test data consisting of 4096 samples generated from wnoise function in Matlab. Matlab's fda is used for designing the bandpass filter to color the no source. A colored Gaussian noise is then added to the input test signal. This noisy signal and the reference noise are inputs to the terminal of the LMS filter Simulink block. iltering of heavy sine noisy signal consisting of 4096 samples per frame. Figure 11 shows the comparison between the Das proposal of the wiener filter and the Lee's wiener filter proposal in the signal to noise ratio aspect. As it can be seen from the Figure 11 the performance for the Das proposal is higher than the Lee's wiener filter. The improved adaptive wiener filter provides SNR improvement from 2.5 to 4 dB as compared to Lee's adaptive wiener filter.
Motorola DSP56300 Results
The DSP system consists of two analog-to-digital (A/D) converters, and two digital-to-analog converters (D/A) converters. The DSP56303EVM evolution module is used to provide and control the DSP56300 processor, the two A/D converters, and the two D/A converters. The left analog input sig sired in t sig-5 m ms to compute the f nal x(t) consists of the de pu nal s(n) plus a white noise signal w(n). The left analog input signal x(t) is first digitized using the A/D converter on the evaluation board. DSP Processor executes the adaptive filter algorithm to process the left digitized input signal x(n), the left and right output signals y 1 (n) and y 2 (n) will be generated. The left output signal y 1 (n) is the error signal. The right output signal y 2 (n) is the filtered version of the left digitized input signal x(n), which is an estimate of the desired input signal s(n). The two D/A converters on the evaluation board are then used to convert the left and right digital output signals y 1 (n) and y 2 (n) to the left and right analog output signals y 1 (t) and y 2 (t).
Figure 11. SNR performance comparison between Lee and Das proposals
The continuous analog signal was sampled at a rate of twice the highest frequency present in the spectrum of the sampled analog signal in order to accurately recreate the analog audio signal from the discrete samples. The analog audio signal was mixed with noise using a sum block which is bound to occur when the audio signal passes through the channel. The noise however, first low pass passed filter using a finite impulse response filter to make it finite in bandwidth. FIR noise filter was observed to have little or no significant effect on the signal with noise. The information bearing signal is a sine wave of sample cycles 055 . 0 is shown in Figure 12 . The noise picked p by the secondary microphone is the input for the adapu tive filter as shown in Figure 13 . The noise that corrupts the sine wave is a low pass filtered version of the noise. The sum of the filtered noise and the information bearing signal is the desired signal for the adaptive filter. The noise corrupting the information bearing signal is a filtered version of noise as shown in the Figure 14 . Figure  15 shows that the adaptive filter converges and follows the desired filter response. The filtered noise should be completely subtracted from the signal noise combination and the error signal should only have the original signal. The results can be seen in Figures 12 to 16 . 
Xilinx Spartan III Results
The algorithm for adaptive filtering were coded in Matlab experimented to determine optimal parameters such the learning rate for the LMS algorithm. After the para ters have been determined, algorithms were coded for Xilinx in VHDL language.
Standard LMS Al Results

The d
t was corrupted by a higher frequency sinusoid and random Gaussian noise with a signal to noise ratio of 5.86 dB. The input signal can be seen in Figure 17 . A direct form FIR filter of length 32 is used to filter the input signal. The adaptive is trained with the LMS algorithm with a learning rate 
odified LMS Algorithm Results
The se reduction obtained by both the standard LMS algorithm and the modified algorithm as applied to a staalgorithm h sults have shown that the standard LMS algorithm removes the noise from the signal, the next section. The timing analyzer has showed or t 17 M noi tionary signal composed by 3 frequencies, corrupted by a random Gaussian noise, with signal to noise ratio of 5.86 dB were studied. Both algorithms used 16 bit fixed point representation for data and filter coefficients [14] . The frequency spectrum of the original signal, standard LMS, and modified LMS filter are given in Figure 19 . The modified LMS used a dynamic learning rate coefficient α based on a heuristic function formerly proposed by Widrow [8] , and consisted of 1/n decaying function, coefficients were approximated by a piecewise linear curve, starting from the value 0.1 down to 0.001 (in about 1000 aster converthe standard LMS used a static learning rate with the best performances obtained by setting the µ parameter equal . The two algorithms reported noise attenuation ater than 40 dB and 36 dB respectively. As can be n from the two learning characteristics in Figure 20 steps). This heuristic function achieved a f gence, and les gradient noise. It has proved to be effective when applied to stationary signals. On the other hand to 0.05 gre see rithms nd noises showed similar simulation results. The adaptive noise filtering was implemented using a 16 bit 2's complement fixed point representation for samples and weights. As it can be seen in Figure 5 , the floor planned design required 1776 slices (logic blocks) of 3072 available (about 57%), and allowed a running clock frequency of 50 MHz (with a non optimized, fully automatic place & route process). It would require 2750 slices (89%) and would run at less than 25 MHz (due mainly to routing congestion). The Assembly file used for the simulation is given in Appendix A. The assembly code is provided elsewhere [26] .
s discussed in the previous chapters, the concept of the adaptive noise filtering applications can be implemented in both DSP processors like Motorola DSP56300 series and also in the Field Programmable Gate Array such as Xilinx Spartan III boards. In high performance signal processing applications, FPGAs have several advantages over high end DSP processors. Literature survey has showed that high-end FPGAs have a huge throughput advantage over high performance DSP processors for certain types of signal processing applications. FPGAs use highly flexible architectures that can be greatest advantage over regular DSP processors. However, FP As ith more gates FPGAs can process more e time. Thus power consumption per a
Conclusions
A G come with a hardware cost. The flexibility comes with a great number of gates, which means more silicon area, more routing and higher power consumption. DSP processors are highly efficient for common DSP tasks, but the DSP typically takes only a tiny fraction of the silicon area, which is dedicated for computation purposes. Most of the area is designated for instruction codes and data moving. In high performance signal processing applications like video processing, FPGAs can take highly parallel architectures and offer much higher throughput as compared to DSP processors. As a result FPGA's overall energy consumption may be significantly lower than DSP processors, in spite of the fact that their chip level power consumption is often higher. DSP processors can consume 2-3 watts, while the FPGAs can consume in the order of 10 watts. The pipeline technique, more computation area and w channels at the sam channel is significantly less in the FPGA's [15] . DSPs are specialized forms of microprocessor, while the FPGA's are form of highly configurable hardware. In the past, the usage of DSPs has been nearly ubiquitous, but with the needs of many applications outstripping the processing capabilities (MIPS) of DSPs, the use of FPGAs has become very prevalent. It has generally come to be expected that all software, (DSP code is considered a type of software) will contain some bugs and that the best can be done is to minimize them. Common DSP software bugs are caused because of, failure of interrupts to completely restore processor state upon completion, non-uniform assumptions regarding processor resources by multiple engineers simultaneously developing and integrating disparate functions, blocking of critical interrupt by another interrupt or by an uninterruptible process, undetected corruption or non-initialization of pointers, failing to properly initialize or disable circular buffering addressing modes, memory leaks, the gradual consumption of available volatile memory due to failure of a thread to release all memory when finished, dependency of DSP routines on specific memory arrangements of variables, use of special DSP "core mode" instruction options in core, conflict or excessive latency between peripheral accesses, such as DMA, serial ports, L1, L2, and external SDRAM memories, corrupted stack or semaphores, subroutine execution times dependent on input data or configuration, mixture of "C" or high-level language subroutines with assembly language subroutines, and pipeline restrictions of some assembly instructions [15] . Both FPGA and DSP implementation routes offer the option of using third party implementation for common signal processing algorithms, interfaces and protocols. Each offers the ability to reuse existing IP in the future designs. FPGA's are more native implementation for more DSP algorithms. Figures 21 and 22 give the block diagrams of the DSP and FPGA respectively. Motorola DSP56300 series can only do one arithmetic computation and two move instructions at a time. However, in the case of FPGAs, each task can be computed by its own configurable core and designated input and output interface.
5.
Speed is one of the most important concepts that determine the computation time and also it is one of the most important concepts in the market. In the adaptive filters the parameters are updated with the each iteration and after the each iteration the error between the input and the desired signal get smaller. After some number of iterations the error becomes zero and the desired signal is achieved. According to the specifications from the manufacturer manuals, Motorola DSP56300 series has a CPU clock of 100 MHz, but this speed depend on the instruction fetch, computation speed and also the speed of th audio codec runs on 24.57 MHz, this clock speed is determined by an external crystal. In the other hand Xilinx Spartan 3 has the maximum clock frequency of 125 MHz, but this speed can be reduced because of the number of instruction ns, gates and the congestion on the routing of the signals. Both of the modified adaptive noise filtering applications take about 200-250 iterations to cancel the noise and achieve the desired signal. In the Motorola DSP processor case because of the actual clock speed being lower, causality conditions and the speed limitation that is coming from the audio codec part of e board, the running time is 20 MHz. e clock to be faster.
s 1 Speed Comparison
e peripherals. On the DSP56303EVM board the th of the modified LMS algorithm in the case of the FPGA's the running speed is around 50 MHz. This due to discussions from the previous section, which is FPGA's flexibility and reconfigurable gates allows for th
General Conclusion
As discussed in the previous sections, we have shown the differences between the DSP processors and FPGAs. As far as power and cost are considered, DSP processors in general have lower power consumption, which makes them suitable for battery powered applications. These applications can be done on audio applications. These voice applications are very straight forward and do not require sophisticated pipeline and parallel moves. Audio applications can be different filter applications. These are used especially in the voice transmission lines and cell phones. When it comes to the high frequency applications, DSP processors have some restrictions on their part when they are compared to the FPGAs. In high speed applications, FPGA's are much faster than the DSP processors. When it comes to high speed applications, the DSP boards have some limitations when compared to the FPGAs. FPGAs can offer more channels, and thus when cost per channel is considered because FPGAs can offer more channels, the cost per channel is lower than the DSP's. Also the partitioning of the FPGA's can offer more throughputs as compared to DSP processors. Thus FPGAs can handle multiple tasks when their controls and finite state machines are configured correctly.
According to our study, the final conclusion is that for simple audio applications like adaptive noise cancelling, Motorola DSP56300 is more beneficial, because the requirements for audio applications are met with DSP processors. Also they are more power efficient and can devices. But when adaptive in high speed applications cal Signal Processbe used for battery powered noise filtering is considered like video streaming and multiplexed array signals, FPGA's are offering a faster approach and thus they are more suitable for high frequency applications.
Future Work
In the future, the adaptive noise filtering can be implemented on high frequency applications, such as noise removal from video streaming and noise removal from multiplexed data arrays. These applications may be applied first to FPGAs with Verilog HDL or VHDL. After application has been verified, hardware code can be converted to a net list and thru Synopsys a custom ASIC design can created. The ASIC design and FPGA design may be compared in the aspect of cost, power, architecture, noise removal and speed. These comparisons would be helping us to provide us a more educated choice for future applications.
