Design and Verification of Low Power DA-Adaptive Digital FIR Filter  by Mankar, Pranav J. et al.
 Procedia Computer Science  79 ( 2016 )  367 – 373 
Available online at www.sciencedirect.com
1877-0509 © 2016 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license 
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the Organizing Committee of ICCCV 2016
doi: 10.1016/j.procs.2016.03.048 
ScienceDirect
        7th International Conference on Communication, Computing and Virtualization 2016 
Design and Verification of low power DA-Adaptive digital FIR 
filter 
Pranav J. Mankara, Ajinkya M. Pundb, Kunal P. Ambhorec,Shubham C. Anjankard 
a Assistant Professor, Department of Electronics, GHRCE, Nagpur,440016, India 
b Assistant Professor, Department of Electronics, GHRCE, Nagpur,44006, India 
c Department of Electronics &Telecomm., MIT College of Engineering, Pune,411038, India 
d Assistant Professor, Department of Electronics, Ramdeobaba College of Engineering and Management, Nagpur,440013, India 
 
Abstract 
A unique pipelined architecture for low-area, low-power, and high-throughput implementation of adaptive filter based on 
distributed arithmetic (DA) is presented in this paper. Distributed arithmetic (DA) is performed to design bit-level architectures for 
vector–vector multiplication with a direct application for the implementation of convolution, which is necessary for digital filters. 
Bit-serial operations and look-up tables (LUTs) are used to implement high throughput filters in which only one cycle per bit of 
resolution regardless of filter length is used. Also parallel lookup table (LUT) updation. Filtering and weight-update operations are 
concurrently performed to increase the throughput rate of adaptive FIR filter. The new approach of conditional signed carry-save 
accumulation is used in place of conventional adder based shift accumulation so that reduction of the sampling period and area 
complexity is easy. However adaptive DA filters also requires recalculating the LUTs for each adaptation which can nullify any 
performance advantages of DA filtering. The System is designed in Xilinx ISE 9.1 using Verilog HDL and it is routed using Model 
Sim 6.3. The Verification of the system’s behaviour is done using MATLAB 13.  DA adaptive filters are advantageous over digital 
signal processing microprocessor in terms of total area and power consumption. 
 
Keywords: Adaptive filter, Distributed Arithmetic, lookup table (LUT), Xilinx ISE 9.1 , MATLAB 13 , Model Sim 6.3. 
1. Introduction 
Most portable electronic devices, such as cellular phones, PDAs, and hearing aids, require digital signal processing 
(DSP) for high performance. Due to the increased demand of implementation of sophisticated DSP algorithms, low-
cost designs, i.e. low-area and low-power cost, are needed to make these hand-held devices small with good 
performance. Generally, dedicated multipliers are expensive in terms of chip area and are frequently replaced by 
multiplier-free implementation methods. Distributed arithmetic (DA) stores the sums of scaled coefficients in a series 
368   Pranav J. Mankar et al. /  Procedia Computer Science  79 ( 2016 )  367 – 373 
of LUTs with binary inputs used as addresses [4][6]. Using DA, the MAC operations are replaced by adding the entries 
read from LUTs. 
       In various applications, such as echo cancellation, time varying noise needs to be removed from desired signals. 
Hence, the coefficients of adaptive filters are updated based on the input samples[2][3].The adaptation of coefficients 
challenges the implementation of adaptive filters using DA, due to the high computation workload for updating the 
LUTs that store the sums of the scaled coefficients. This adaptation makes it challenging to implement DA-based 
adaptive filters with low cost due to the necessity of updating LUTs. One of the most famous algorithm which called 
least mean square (LMS) algorithm  given  by Widrow–Hoff is used here through which tapped-delay line finite 
impulse response(FIR) filter’s weights are updated , These are simpler in design as well have satisfactory convergence 
performance   [5].. Critical path due to inner-product computation to obtain filter output becomes very long so that for  
high sampling rate signals the critical path of design  needed to be reduced to cope-up with the sampling period. 
       Recently, high processing capability and regularity is achieved using the multiplier-less distributed arithmetic 
(DA) based technique [1] resulting in cost-effective and area–time efficient computing structures. Two separate 
lookup tables (LUTs) had been used by Allred et al. [4] which result in hardware-efficient DA-based design of 
adaptive filter. One lookup table for filtering and another for weight updation.  Guo and DeBrunner [7], [8] have 
improved the design in [4] by using only one LUT for filtering as well as weight updating. However, the structures in 
[4]–[8] do not support high sampling rate since they involve several cycles for LUT updates for each new sample. In 
a recent paper, we have proposed an efficient architecture for high-speed DA-based adaptive filter with very low 
adaptation delay [9]. Hence low power, low area, and high-throughput pipelined implementation of adaptive filter 
with very low adaptation delay is proposed here [9]. Thus this paper contributes to the following terms. 
1) Throughput rate is significantly increased by a parallel LUT updation. 
2) Further enhancement of throughput is achieved by concurrent implementation of filtering and weight updating. 
3) Conventional adder-based shift accumulation is replaced by a conditional carry-save accumulation of signed partial 
inner products to reduce the sampling period. The bitcycle period amounts to memory access time plus 1-bit full-adder 
time (instead of ripple carry addition time) by carry-save accumulation. The area complexity of the design is reduced 
using carry-save accumulation method. 
4) Reduction of power consumption is achieved by using a fast bit clock for carry-save accumulation but a much 
slower clock for all other operations. 
2. Distributed Arithmetic FIR Filter 
     A discrete-time linear finite-impulse response (FIR) filter generates the output y[n] as a sum of delayed and 
scaled input samples x[n].    
                                                                   ݕሾ݊ሿ ൌ σ ܹ݅ െ ݔሾ݊ െ ݅ሿ௞ିଵ଴                (1)     
 A typical digital implementation will require K MAC operations. A single processing unit digital signal processor 
will complete this operation in 0(K) clock cycles given a single instruction for each MAC plus data fetch, address 
generation, and loop control. Thus, the system clock has to operate at least K times faster than the rate at which the 
signal is sampled and often as much as 5K times faster. For systems where the maximum system clock speed is limited 
by power consumption limitations or other constraints, the throughput of the FIR filter, defined as the number of signal 
samples processed per second is similarly limited. This limitation may become severe for large filter sizes (large K). 
Although employing multiple processing units improves the throughput, the corresponding increase in logic 
369 Pranav J. Mankar et al. /  Procedia Computer Science  79 ( 2016 )  367 – 373 
complexity, on-chip area and power consumption may render such implementations unattractive. Bit-serial fashion 
may be done for implementing the filtering operation. This implementation of the filter, known as Distributed 
Arithmetic DA, achieves higher throughput (faster computation) and lower logic complexity at the cost of increased 
memory usage. The advancement in memory design technology has made shrinking of memory sizes which made DA 
implementation of digital filters as an attractive option DA was first introduced by Croisier et al. [10] and further 
developed by Peled and Lui [11].  
3. Basics of LMS Adaptive Algorithm 
The weights are updated using LMS algorithm during each cycle. This is done by estimation of error by taking 
difference of the current filter output and the desired response. The nth iteration of the updated weights is given by 
the following equations: 
                                                         w (n+1) = w(n) + μ.e(n).x(n)                         (2) 
                                                                 e (n) = d(n) − y(n) 
                                                                y (n) =wqT (n) ዘ x(n). 
The value of nth training iteration for input vector x(n) and the weight vector w(n)  are respectively given by: 
                                                      
                                                   x (n) = [x(n), x(n − 1), . . . , x(n − N + 1)]T 
                                                     w (n) = [w0(n), w1(n), . . . , wN−1(n)]T . 
d(n) is the desired response, and y(n) is the filter output of the nth iteration. e(n) denotes the error computed during 
the nth iteration, which is used to update the weights, μ is the convergence factor, and N is the filter length.  
4. Basics Block Schematics 
 
 
 
 
 
 
 
 
 
 
 
 
 
                                   
                        
                     
Fig.1.The proposed structure of DA-based adaptive filter of length N = 4 
 
It consists of a four-point inner product  block and a weight-increment block along with additional circuits for the 
computation of error value e(n) and control word t for the barrel shifters. The four-point inner-product block includes 
370   Pranav J. Mankar et al. /  Procedia Computer Science  79 ( 2016 )  367 – 373 
a DA table consisting of an array of 15 registers which stores the partial inner products yl for 0 < l ≤ 15 and a 16: 1 
multiplexor (MUX) to select the content of one of those registers. Bit slices of weights A = {w3l w2l w1l w0l} for 0 ≤ 
l ≤ L − 1 are fed to the MUX as control in LSB to- MSB order, and the output of the MUX is fed to the carry save 
accumulator (shown in Fig. 2). After L bit cycles, the carry-save accumulator shift accumulates all the partial inner 
products and generates a sum word and a carry word of size (L + 2) bit each. The carry and sum words are shifted 
added with an input carry “1” to generate filter output which is subsequently subtracted from the desired output d(n) 
to obtain the error e(n).The magnitude of the computed error is decoded to generate the control word t for the barrel 
shifter. The barrel shifter shifts the different input values xk for k = 0, 1, . ,N−1 by appropriate number of locations 
(determined by the location of the most significant one in the estimated error). 
4.1. DA Based Approach  for Inner Product Computation   
The LMS adaptive filter, in each cycle, needs to perform an inner-product computation which contributes to the 
most of the critical path. For simplicity of presentation, let the inner product given by 
                                        ݕ ൌ σ ܹ݇ כ ܺ݇ேିଵ௞ୀ଴                                                              (3)                                                            
Where wk and xk for 0 ≤ k ≤ N – 1 which form the N-point vectors corresponding the current weights and most recent 
N- 1 input, respectively. Let the bit width of the weight be L thus each component of the weight vector can be expressed 
in two’s complement representation. 
                                      
                                           ܹ݇ ൌ ܹ݇Ͳ ൅ σ ܹ݈݇ כ ʹି௟௅ିଵ௟ୀଵ                                         (4)          
Where wkl denotes the lth bit of wk. Substituting (5), we can write (4) in an expanded form to get distributed form as,          
                                     ݕ ൌ σ ܹ݇ כ ܺ݇Ͳ ൅ σ ʹି௟ כ ܹ݇ כ ܺ݇Ͳ ൅ σ ܹ݈݇ேିଵ௞ୀ଴௅ିଵ௟ୀଵேିଵ௞ୀ଴  (5)                                                           
Equation can be composed as 
                                             
                                    ݕ ൌ ሺσ ݕ݈ כ ʹି௟௅ିଵ௟ୀଵ ሻ െ ݕͲ                                                        (6)        
Since any element of the N-point bit sequence {wkl for 0 ≤ k ≤ N − 1 can either be zero or one, the partial sum yl for l 
= 0, 1... L − 1 can have 2N possible values. The inner product of (11) can therefore be calculated in L cycles of shift 
accumulation, followed by LUT-read operations corresponding to L number of bit slices. 
5. Synthesis Result 
The proposed design is coded in Verilog hardware descriptive language and the synthesis is performed in Xilinx ISE 
for filter length of N=4 and N=16 to find area and power complexity. The power analysis is performed in X-power 
analyser tool of Xilinx ISE.Fig.2 of consist Design summary of the whole design.Fig.4 has snapshot of power report. 
The design complexity and power report shows that the proposed structure has less area and power consumption 
compared to previous designs[4] [7][9].The design Verification is done with the help of MATLAB interface. It is done 
by creating a MATLAB code which consist of input signal with added noise, the Verilog design is provided to 
MATLAB interface with help of Model Sim Simulator which runs the Verilog code to create Test bench waveform. 
The output of the interface consist of filtered signal with removed noise so that we can verify filter is working properly. 
Fig.6 consist MATLAB results. 
 
371 Pranav J. Mankar et al. /  Procedia Computer Science  79 ( 2016 )  367 – 373 
 
 
Fig.2.Synthesis Report 
 
 
 
   Fig.3.RTL View of Design 
 
 
372   Pranav J. Mankar et al. /  Procedia Computer Science  79 ( 2016 )  367 – 373 
 Fig.4.Power Report in X-power Analyzer 
 
 
 
 
 Fig.5.Test Bench Using Model Sim. 
 
 
 
 
 
 
 
 
 
 
 
 
Fig.6.Matlab Output 
373 Pranav J. Mankar et al. /  Procedia Computer Science  79 ( 2016 )  367 – 373 
References 
1. S. A. White, “Applications of distributed arithmetic to digital signal processing::A tutorial review,” IEEE ASSP Mag., vol. 6, pp. 
4–19, Jul. 1989.  
 
2. B. Farhang-Boroujeny, Adaptive Filters: Theory and Applications. Chichester, U.K.: Wiley, 1998.S. Haykin, Adaptive Filter 
Theory. Upper Saddle River, NJ: Prentice-Hall, 1996.  
 
3. D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D.V.Anderson “LMS adaptive filters using distributed arithmetic for high 
throughput,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 52, no. 7, pp. 1327–1337, Jul. 2005.  
 
4. S. Haykin and B. Widrow, Least-Mean-Square Adaptive Filters. Hoboken, NJ, USA: Wiley, 2003.  
 
5. D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson, “An FPGA implementation for a high throughput adaptive 
filter using distributed arithmetic,” in Proc. 12th Annu. IEEE Symp. Field-Programmable Custom Comput. Mach., 2004, pp. 324–
325.  
 
6. R. Guo and L. S. DeBrunner, “Two high performance adaptive filter implementation schemes using distributed arithmetic,” IEEE 
Trans. Circuits Syst. II, Exp. Briefs, vol. 58, no. 9, pp. 600–604, Sep. 2011.  
 
7. R. Guo and L. S. DeBrunner, “A novel adaptive filter implementation scheme using distributed arithmetic,” in Proc. Asilomar 
Conf. Signals, Syst., Comput., Nov. 2011, pp. 160–164.  
 
8. P. K. Meher and S. Y. Park, “High-throughput pipelined realization of adaptive FIR filter based on distributed arithmetic,” in VLSI 
Symp. Tech. Dig., Oct. 2011, pp. 428–433.   
 
9. A. Peled and B. Lie, “A new hardware realization of digital filters,”IEEE Trans. Acoustics, Sound, Signal Process., vol. ASSP-22, 
no. 4, pp.456–462, Dec. 1974.  
 
10. Croisier, D. J. Esteban, M. E. Levilion, and V. Rizo, “Digital Filter for PCM Encoded Signals,” U.S. Patent 3 777 130, Apr. 1973.  
 
11. D.J. Allred,V. Krishnan,W. Huang, and D. Anderson, “Implementation of an LMS adaptive filter on an FPGA employing 
multiplexed multiplier architecture,” in Proc. Asilomar Conf. Signals, Systems, Computers., Nov. 2003, pp. 918–921. 
  
 
 
 
