Abstract-Adaptive filters are used in many applications of digital signal processing. Digital communications and digital video broadcasting are just two examples. The GSFAP algorithm, discussed in the paper, is characterized by convergence superior to the popular NLMS, with only slightly higher complexity. The paper deals with floating-point-like implementation of algorithm using FPGA hardware. We present an optimized core for the GSFAP, built using logarithmic arithmetic which provides very low cost multiplication and division. The design is crafted to make efficient use of the pipelined logarithmic addition units. The resulting GSFAP core can be clocked at more than 80 MHz on the one million gate Xilinx XC2V1000-4 device. It can be used to implement filters of orders 20 to 1000 with a sampling rate exceeding 50 kHz. For comparison, we implemented a similar NLMS core and found that although it is slightly smaller than the GSFAP core and it allows a higher signal sampling rate (around 70 kHz) for the corresponding filter orders, GSFAP has adaptation properties that are much superior to NLMS, and that our core can provide very sophisticated adaptive filtering capabilities for resource-constrained embedded systems.
I. INTRODUCTION
Adaptive filters are widely used in digital signal processing (DSP) for countless applications in telecommunications, digital broadcasting, etc.
While a wide variety of filtering algorithms have been proposed in the literature, the most important categories are perhaps those based on least mean squares (LMS) [1] (with the power-normalized sibling NLMS [2] ) and recursive least squares (RLS) [3] , [2] . Algorithms based on LMS are fast and simple to implement, but they suffer from slow convergence. The RLS-based algorithms converge faster, but they are usually considered too computationally expensive, particularly in applications like echo cancellation where filters with up to several hundred taps are required. More recently, a category of algorithms based on affine projection (AP) [4] , sometimes referred to as the generalized NLMS, have been developed, which provide some compromise between the slow convergence of LMS and the computational complexity of RLS.
In order to reduce the computational complexity of the original affine projection algorithm (APA) [4] , a fast version (FAP) [5] (FRLS) [6] , [7] in the algorithm.
A number of variants have been proposed which solve the numerical stability problems while maintaining the advantages of FAP. The "modified" FAP [8] , [9] uses the matrix inversion lemma employed in the classical RLS algorithm, thus avoiding the problems with fast RLS but at the cost of greater computational requirements. Conjugate gradient (CG) FAP [10] builds on the results of the modified FAP, and uses the conjugate gradient method [11] to deal with the matrix inversion. The FAP-based algorithm that we believe to be the most suitable for hardware implementation is the Gauss-Seidel (GS) FAP [12] which replaces the CG method with the GaussSeidel method [13] . This algorithm has all the advantages of modified FAP and CGFAP, but has lower computational complexity, allowing an efficient implementation with fewer hardware resources.
The complexity of LMS based algorithms is (D (L), typically 2L + 1 multiply-accumulate (MAC) operations per iteration, where L is the filter order. The complexity of NLMS is similar, 2L + 3 MAC operations and 1 division per iteration. On the contrary, the complexity of the RLS-based algorithms is O (L2). The memory requirements of RLS are also much higher than those of (N)LMS. Fast versions of the RLS algorithm with complexity (9 (L) exist, which partly solve the complexity issues. The RLS lattice algorithm requires 18L operations and the Fast Transversal Filter (FTF) requires 8L operations, or 9L in the stabilized form. This however comes at the expense of problems with numerical stability. The complexity of FAP is 2L + (9 (N) where N is referred to as the projection order. All other FAP variants mentioned above have complexity 2L + ( (N2) where GSFAP is the most efficient, its complexity being 2L + N2 + 4N -1 MAC operations and N divisions per sample period. The projection order is almost always very much smaller than the filter order, i.e. N < L, so the time complexity is usually dominated by L rather than N2.
In this paper we develop an optimized core for FPGA implementation of the GSFAP [12] algorithm. To reduce the resource requirements of the floating-point computations, we represent numbers using the logarithmic number system (LNS) [14] , [15] . To evaluate the design, a configurable 1000-1-4244-0383-9/06/$20.00 ©2006 IEEE 0) Initialization:
2) Update Pk (Gauss-Seidel iteration):
for (i = 0; : < N; i = i+ 1) 3) Compute ek: II. THE GSFAP ALGORITHM A brief review of the GSFAP algorithm is now given. The GSFAP algorithm is summarized in Fig. 1 In step 4, the normalized estimation error vector Ek and consequently the alternate coefficient vector wk are updated.
The "old" excitation signal vector UkN+1 and the lowermost element of the newly updated vector Ek denoted as 6N-1,k are used to update the alternate coefficient vector wk.
The parameter ,u is the relaxation factor which represents the algorithm step-size parameter. The algorithm is stable for O < ,u < 2. The parameter d is the regularization parameter that prevents the correlation matrix R from becoming singular.
It is usually set to 10'7 < < 1 depending on the input signal.
III. LOGARITHMIC ARITHMETIC
In order to maintain accuracy of the algorithm in the FPGA implementation, we decided to implement the computations using a floating-point-like logarithmic arithmetic. The parameters of the library are briefly presented in this section.
The logarithmic number system (LNS) was chosen in order to reduce resource requirements and to achieve short latency as compared to other floating-point solutions. Logarithmic multiplication and division require only very simple logic. Although addition and subtraction are more complex in LNS, recent advances have made them feasible in small FPGA devices. We use the High Speed Logarithmic Arithmetic (HSLA) cores, described in [15] . Table I shows the resource requirements of our LNS units in comparison to Underwood's [16] highly-optimized IEEE single-precision floating point units. The major disadvantage of LNS arithmetic is the number of Block RAMs used by the ADD/SUB unit for storing the look-up tables. These units are always instantiated in pairs. While the resource requirements for a pair of LNS ADD/SUB pipes is significantly higher than for a pair of the floating point cores, LNS multiplier units need a small fraction of the size of the floating point multipliers. The most common operations in many matrix algorithms are multiplication and addition. When we sum the In Figure 2 , a comparison between the resource utilization of the LNS and Underwood's arithmetics for a single MACC unit and for the implementation of the LMS, NLMS and GSFAP algorithms is given, using the Xilinx Virtex XC2V1000 device. As we will see in Section IV, for efficient implementation of GSFAP 2 add, 4 multiply and 1 division units have to be used. The figure clearly demonstrates the advantages of using LNS arithmetic rather than the classical floating point. It should be mentioned that the LNS ADD/SUB unit disadvantage of using a considerable amount of Block RAMs is not an issue in our case since algorithms' internal structures (input and weight vectors, correlation matrix, etc.) can easily fit into remaining Block RAMs.
IV. ARCHITECTURE
In this section we present architecture of our GSFAP core. Both the mapping of the algorithm onto the LNS arithmetic units and of the data structures to the Block RAMs is described. The algorithm employs one LNS addition/subtraction (ADD/SUB A and B two separate, parallel pipelines) unit, four LNS multipliers (MUL A, B, C, and D) and one LNS divider (DIV A). Non-scalar data structures, vectors and a matrix, are stored in Virtex-2 Block RAMs.
The top-level architecture of the GSFAP algorithm is depicted in Fig. 3 . The blocks in the diagram correspond to individual steps in the algorithm presented in Fig. 1 cycles during which different parts of the design are active in the GSFAP based filter with the parameters L =1000 (filter order) and N =9 (projection order). Fig. 4 .
Our hardware implementation of IR and its update is similar to using a circular buffer. We call this process "re-indexing". The actual state of the buffer is kept in the register R-state which is decremented by the value N+I1 in each iteration. The "R update" is implemented in the following way. The values of vector Lk are read from UX and the multi-plication~o,kL Fig. 5 . One of the key modules of the algorithm is the GaussSeidel (GS) solver used to calculate the vector Pk (step 2 of the algorithm in Fig. 1) . Each element Pi,k of the vector Pk depends on previously computed elements, PO... i1,k and Pi±1...N 1,k-i. Therefore the individual elements cannot be updated simultaneously. Instead we use a pipelined architecture, in which the computations using the available values Of Pk are performed first, and the value that is dependent on the previous iteration is performed last. Fig. 6(a) and the result forms dot product of two N -1 length vectors.
Recall that vector b has the value 1 in its first element, and all other elements have the value 0. Because of this, we need to subtract the value from 1 on the first step, but in all other steps we are subtracting from zero, which we can implement by simply negating the sign bit. Considering the ADD/SUB unit latency and that the negation of the sign bit costs virtually nothing, we save 9 * (N -1) clock cycles for complete "P update". The result is then divided by a corresponding diagonal element of matrix R and the resulting value is finally written to the PEPS Block RAM. It is important to recall that the LNS multiplication and division are fast and cheap, so the resulting hardware is highly efficient. The next step of the algorithm is to compute the filter output
Yk and the estimation error ek as shown in the step 3 in Fig. 1 . The dot product E_1Ro,k is calculated, just after the vector p has been updated, using the port A of the ADD/SUB unit and the MUL A and B units. The resulting value is then multiplied by scalar ,u in order to get the value ,ucT1Ro,. As depicted in Fig. 3 , the "long" dot product (of two vectors of length given by the filter order L) uk{Wk1 is calculated in parallel with previously described blocks. It employs the MUL D and the ADD/SUB B units. The last two steps of GSFAP are the "EPS update" and the "WW[0,1] update". They both have similar structure so only the update of the vector w is described. The hardware structure of this stage is shown in Fig. 6(b) . In order to minimize the latency of the GSFAP core we split the vector w into two vectors which are stored in separate Block RAMs WWO and WW1. This allows us to use two independent pipelines to update them as depicted in the figure. This stage utilizes both pipelines of the ADD/SUB unit and the MUL B and C units. At the end of this step one GSFAP iteration is complete and the core is ready to acquire new data samples.
V. RESULTS AND CONCLUSIONS
We have created separate GSFAP cores for LNS 32-and 19-bit precision. The parameters of the cores are fully configurable. It is possible to change both filter order L (for values 20 < L < 1000) and filter parameters ,u and 6. However, modifying the value of N requires minor architectural changes. To demonstrate performance, we fixed the parameters L = 1000 and N = 9. In this configuration, a full iteration of the GSFAP algorithm takes 1597 cycles and performs 4227 logarithmic operations (2.64 ops/cycle). For comparison we have also created similar cores that implement the NLMS algorithm, which also come in 32-and 19-bit variations and are fully configurable. NLMS is more regular and less computationally intensive than GSFAP. With the corresponding filter parameters, it can perform a full iteration in 1088 clock cycles, and performs 4008 logarithmic operations (3.68 ops/cycle).
Our cores were developed on a Xilinx XC2V6000-4 (6-million gate, speed grade 4) FPGA and on a much smaller Xilinx XC2V1000-4 (1-million gate, speed grade 4) device. Table II shows parameters of developed cores. All designs can be clocked at 80 MHz. At this clock speed, the GSFAP and NLMS designs perform over 210M and 294M log-operations per second, respectively (the M log-operations are equivalent to MFLOPS). The GSFAP 32-bit core occupies only a small fraction (14%) of the 6-million gate XC2V6000-4 device. On the 1-million gate XC2V1000-4, it uses very large percentage of available resources in particular, it consumes 9900 slices. For the maximum filter length (L = 1000), the filter is able to perform noise/echo cancellation on signals at a sampling rate of more than 50 kHz. The corresponding NLMS core is some 15% smaller and can operate on signals at a sampling rate of around 73 kHz. The FPGAs used to test the implementations are speed grade 4 devices. More expensive speed grade 6 devices would allow even faster clock speeds, in the order of 100 MHz, while the Virtex-4 devices would be even faster.
To demonstrate a practical application of adaptive filters, we developed an echo cancellation example. In this case the adaptive filter is used to suppress echo generated by the unknown system, typically a room or a car cabin. The real room impulse responses and speech signals have been used in our experiments. The results of using the LNS 32-bit implementation of NLMS and GSFAP algorithms for this task are shown in Figure 7 . We used the input and echo signals sampled at the rate 16 kHz, the adaptive filters of length L = 500 and the GSFAP projection order N = 9. The step-size parameter was chosen ,u = 1 for both NLMS and GSFAP. The left picture of the figure shows the echo signal to be suppressed and the convergence rates of NLMS and GSFAP. The picture on the right-hand side represents residual echo for both algorithms. It should be noted that the variance (var) of residual echo which can also be used as a measure of quality of the adaptive algorithm "left" by the NLMS adaptive filter is 5.79 x 10-4 while for GSFAP it is 6.32 x 10-5. Although the NLMS core is slightly smaller than the GSFAP and can process signals at higher sampling rates, experiments show that GSFAP has adaptation properties that are much superior to NLMS, and that our core can provide very sophisticated adaptive filtering capabilities for resourceconstrained embedded systems. 
