This brief presents a novel pipelined architecture for low-power, high-throughput, and 
I. Introduction
ADAPTIVE filters are widely used in several digital signal processing applications. The tapped-delay line finite impulse response (FIR) filter whose weights are updated by the famous Widrow-Hoff least mean square (LMS) algorithm is the most popularly used adaptive filter not only due to its simplicity but also due to its satisfactory convergence performance . The direct form configuration on the forward path of the FIR filter results in a long critical path due to an inner-product computation to obtain a filter output. Therefore, when the input signal has a high sampling rate, it is necessary to reduce the critical path of the structure so that the critical path could not exceed the sampling period. In recent years, the multiplier-less distributed arithmetic (DA)-based technique has gained substantial popularity for its high-throughput processing capability and regularity, which result in cost-effective and area-time efficient computing structures. Hardware-efficient DA-based design of adaptive filter has been suggested using two separate lookup tables (LUTs) for filtering and weight update.
This brief proposes a novel DA-based architecture for low power, low-area, and high-throughput pipelined implementation of adaptive filter with very low adaptation delay. The contributions of this brief are as follows. 1) Throughput rate is significantly increased by a parallel LUT update. 2) Further enhancement of throughput is achieved by concurrent implementation of filtering and weight updating.
3) Conventional adder-based shift accumulation is replaced by a conditional carry-save accumulation of signed partial inner products to reduce the sampling period. The use of the proposed signed carry-save accumulation also helps to reduce the area complexity of the proposed design. 4) Reduction of power consumption is achieved by using a fast bit clock for carry-save accumulation but a much slower clock for all other operations. 5) The auxiliary control unit for address generation, which is not required in the proposed structure.
II. Review Of Lms Adaptive Algorithms
During each cycle, the LMS algorithm computes a filter output and an error value that is equal to the difference between the current filter output and the desired response. The estimated error is then used to update the filter weights in every training cycle. The weights of LMS adaptive filter during the nth iteration are updated according to the following equations:
where
The input vector x(n) and the weight vector w(n) at the nth training iteration are respectively given by
is the desired response, and y(n) is the filter output of the nth iteration. e(n) denotes the error computed during the nth iteration, which is used to update the weights, μ is the convergence factor, and N is the filter length.
In the case of pipelined designs, the feedback error e(n) becomes available after certain number of cycles, called the "adaptation delay." The pipelined architectures therefore use the delayed error e(n − m) for updating the current weight instead of the most recent error, where m is the adaptation delay. The weight-update equation of such delayed LMS adaptive filter is given by
III. Da-Based Approach For Inner-Product Computation
The LMS adaptive filter, in each cycle, needs to perform an inner-product computation which contributes to the most of the critical path. For simplicity of presentation, let the inner product of (1c) be given by
where w k and x k for 0 ≤ k ≤ N − 1 form the N-point vectors corresponding the current weights and most recent N − 1 input, respectively. Assuming L to be the bit width of the weight, each component of the weight vector may be expressed in two's complement representation
where w kl denotes the lth bit of w k . Substituting (5), we can write (4) in an expanded form.
To convert the sum-of-products form of (4) into a distributed form, the order of summations over the indices k and l in (6) can be interchanged to have
k=0 l=1 k=0
and the inner product given by (7) can be computed as N possible values of y l are precomputed and stored in a LUT, the partial sums y l can be read out from the LUT using the bit sequence {wkl} as address bits for computing the inner product.
The inner product of (8) can therefore be calculated in L cycles of shift accumulation, followed by LUT-read operations corresponding to L number of bit slices {wkl} for 0 ≤ l ≤ L − 1, as shown in Fig. 1 . Since the shift accumulation in Fig. 1 involves significant critical path, we perform the shift accumulation using carrysave accumulator, as shown in Fig. 2 . The bit slices of vector w are fed one after the next in the least significant bit (LSB) to the most significant bit (MSB) order to the carry-save accumulator. However, the negative (two's complement) of the LUT output needs to be accumulated in case of MSB slices. Therefore, all the bits of LUT output are passed through XOR gates with a sign-control input which is set to one only when the MSB slice appears as address. 
IV. Da-Based Adaptive Filter Structure
The computation of adaptive filters of large orders needs to be decomposed into small adaptive filtering blocks since DA based implementation of inner product of long vectors requires a very large LUT [3] . Therefore, we describe here the DA-based structures of small-and large-order LMS adaptive filters separately in the next two sections.
A. Structure of Small-Order Adaptive Filter
The structure of DA-based adaptive filter of length N = 4 is shown in Fig. 4 . It consists of a four point inner product block and a weight-increment block along with additional circuits for the computation of error value e(n) and control word t for the barrel shifters. The four-point inner-product block [shown in Fig. 5(a) ] includes a DA table consisting of an array of 15 registers which stores the partial inner products y l for 0 < l ≤ 15 and a 16 : 1 multiplexor (MUX) to select the content of one of those registers.
Bit slices of weights A = {w 3l w 2l w 1l w 0l } for 0 ≤ l ≤ L − 1 are fed to the MUX as control in LSB-to-MSB order, and the output of the MUX is fed to the carry-save accumulator (shown in Fig. 2 ). After L bit cycles, the carry-save accumulator shift accumulates all the partial inner products and generates a sum word and a carry word of size (L + 2) bit each. The carry and sum words are shifted added with an input carry "1" to generate filter output which is subsequently subtracted from the desired output d(n) to obtain the error e(n). The magnitude of the computed error is decoded to generate the control word t for the barrel shifter. The logic used for the generation of control word t to be used for the barrel shifter is shown in Fig. 5(c) . The convergence factor μ is usually taken. = 0, 1, . . .,N − 1 by appropriate number of locations (determined by the location of the most significant one in the estimated error). The barrel shifter yields the desired increments to be added with or subtracted from the current weights. The sign bit of the error is used as the control for adder/subtractor cells such that, when sign bit is zero or one, the barrel-shifter output is respectively added with or subtracted from the content of the corresponding current value in the weight register. B. Structure of Large-Order Adaptive Filter
The inner-product computation of (4) can be decomposed into N/P (assuming that N = PQ) small adaptive filtering blocks1 of filter length P as
k=0 k=P k=N-P Each of these P-point inner-product computation blocks will accordingly have a weight-increment unit to update P weights. The structure for N = 16 and P = 4 is shown in Fig. 6 .The (L + 2)-bit sums and carry produced by the four blocks are added by two separate binary adder trees. Four carry-in bits should be added to sum words which are output of four 4-point inner-product blocks. Since the carry words are of double the weight compared to the sum words, two carry-in bits are set as input carry at the first level binary adder tree of carry words, which is equivalent to inclusion of four carry-in bits to the sum words. It should be noted that the truncation does not affect the performance of the adaptive filter very much since the proposed design needs the location of the most significant one of μe(n).
V. Proposed Da With Carry Save Adder Bec
Carry Select Adder (CSLA) is one of the fastest adders used in many data-processing processors to perform fast arithmetic functions. From the structure of the CSLA, it is clear that there is scope for reducing the area and power consumption in the CSLA. The basic idea of this work is to use Binary to Excess-1 Converter (BEC) instead of RCA with Cin=1in the regular CSLA to achieve lower area and power consumption The main advantage of this BEC logic comes from the lesser number of logic gates than the n-bit Full Adder (FA) structure. Table 1 shows the comparison of Synthesis results of DA structures in terms of area and power. Area and power are significantly reduced. 
VI. Conclusion
We have suggested an efficient pipelined architecture for low-power, high-throughput, and low-area implementation of DA-based adaptive filter. Throughput rate is significantly enhanced by parallel LUT update and concurrent processing of filtering operation and weight-update operation. We have alsoproposed a carrysave accumulation BEC scheme of signed partial inner products for the computation of filter output. From the Xilinx synthesis results, we find that the proposed design consumes less power and less area over our previous DA-based FIR adaptive filter in average for filter lengths N = 16 and 32. Offset binary coding is popularly used to reduce the LUT size to half for area-efficient implementation of DA which can be applied to our design as well.
