Abstract: This study describes the design of high speed FIR filter using parallel prefix adders and factorized multiplier. The fundamental component in constructing any high speed FIR filter consists of adders, multipliers and delay elements. To meet the constraint of high speed performance and low power consumption parallel prefix adders are more suitable. This study focus the design of new Parallel Prefix Adder (PPA) and new multiplier cell called factorized multiplier with minimal depth algorithm and its functional characteristics is compared with the existing architecture in terms of delay and area. The performance evaluation of the proposed PPA and multiplier are examined for the bit sizes of 8, 16, 32 and 64. The coefficient of the filter is obtained through hamming window using MATLAB program. The proposed FIR filter using new PPA and factorized multiplier has been prototyped on XC3S1600EFG320 in Spartan-3E Platform using Integrated Synthesis Environment (ISE) for 90 nm process. Nearly 14% of slice utilization and 34% of speed improvement has been obtained for FIR using new PPA and factorized multiplier.
INTRODUCTION
The FIR filters are used to remove unwanted signal component using discrete transfer function of the input with a set of filter coefficients. They are widely used in DSP applications like image processing, filtering, decimation and interpolation. The general distinctiveness of a FIR filter is to modify the characteristic of signals in time and frequency domain. The basic concept of the FPGA FIR filter have been excerpted in the literature. It is reported that the FIR filters are implemented in systolic and non-systolic architecture. The core elements in any FIR filters are adders, multipliers and delay elements. Adders are one of the fundamental components in many applications. To implement a high speed FIR filter parallel prefix adders are more suitable than ripple carry, carry save, carry select and carry look-ahead adders (Uma et al., 2012) .
The high speed parallel prefix adders are always better opted when the need for a high speed circuit exists. As for the literature views, tradeoffs for parallel prefix adders were done among number of logic levels, fan outs and wiring tracks. The conditional-sum addition (Sklansky, 1960 ) is a fast addition paving a logarithmic speed-up. It has the minimum logic depth (log 2 N) and it needs the least routing tracks. Due to the large fan out, the area and circuit speed is also affected. Kogge-stone adders (Kogge and Stone, 1973) are the fastest prefix tree. The time complexity for carry signal is O (log n), therefore it is considered to be the fastest adder design. This is the most commonly used parallel prefix topology. The main features of this adder are that, it has uniform fan out, exhibits regular structure with minimum logical depth.
The Brent-Kung adder, Richard (1982) is a parallel prefix form carry look-ahead adder. It has a high logic depth. It is considered as one of the better tree adders for minimizing wiring tracks, fan out and gate count and it has the minimum number of nodes possible. Ladner Fischer (Richard and Fischer, 1980) prefix structure requires less implementation area but have unlimited fan out comparatively. Han and Carlson (1987) implemented an hybrid adder derived from the Brent-Kung and Kogge-Stone algorithms. This type of adder gives a good balance between the logic depth and fan-out. Knowles (2001) adder falls between the family of Kogge-Stone and Sklansky. Efficient carry-look ahead adder (Patel and Boussakta, 2007; Sabyasachi and Khatri, 2008) architecture based on the parallelprefix computation with triple-carry-operator are presented.
The next fundamental module of FIR filter is the high speed multipliers. There are different classes of multipliers exist, among these multipliers, the array multiplier (Sheplie, 2004; Ching, 2005) is the simplest one in terms of area and power consumption. But the delay of the circuit is high since this topology uses full adder structures which forms carry chain in different stages of multiplier circuit. This disadvantage can be solved by using carry save (Wallace, 1964) adders by the structure of Wallace Tree. This Wallace Tree multiplier reduces the ripple carry delay in the internal adder circuits, thereby reducing the propagation delay of the circuit. In the reference (Baugh and Wooly, 1973) presents modified booth algorithm with radix-4 cellular array modular multiplier circuit. This type of multiplier reduces the number of iterations by using pipeline structure and direct radix-2 implementation of Montgomery.
This study describes the design of high speed FIR filter using new parallel prefix adders and factorized multiplier to meet the constraint of high speed performance and low power consumption.
MATHEMATICAL MODELING OF PARALLEL PREFIX ADDER
This section presents the implementation and simulation output of proposed Delay-Area efficient PPA and multiplier circuit design using factorization method.
Parallel prefix adders allow more efficient implementation of the carry look-ahead technique. These are nothing but a two level carry look-ahead adders. The addition in PPA is usually expressed in terms of carry generation signal g j , carry propagation signal p j , carry signal c j and sum signal s j at each bit position (1≤j≤n):
The extended consecutive bits carry and propagation are computed as: [ : ] [ : ]
Figure 1 shows two of basic components: g-p generator, (G, P) and g generators, (G). They are denoted by a black and grey cells respectively.
The proposed adder is the hybrid adder obtained from the combination of Ladner Fischer and Han Carlson. The first two stages of the adder follow Han Carlson adder topology and the remaining stages follow the Ladner Fischer adder. These networks have low gate counts. This adder concentrates on the design of a parallel prefix network with a minimal depth case. The main limitation exists in Ladner Fisher adder is that the lateral fan out of the prefix cells doubles at every levels. Thus additional buffers are used, as this drawback can adversely affect the performance. This can be eliminated through near minimum depth prefix algorithm using (Richard and Fischer, 1980) . The best characteristics of both the adder are adopted in order to construct this proposed adder. It has the order of log N and the numbers of nodes are N/2-1 and N/4. The total number of computational nodes are N-1+ (log 2 N-2) N/4. This adder gives the most effective performance in the aspects of delay and area. The proposed adder for 8, 16, 32 and 64 bit is depicted in Fig. 2 . The performance evaluation of the proposed PPA and multiplier are examined for the bit sizes of 8, 16, 32 and 64. The target FPGA device chosen for the implementation of these adders has been prototyped on XC3S1600EFG320 in Spartan-3E Platform using Integrated Synthesis Environment (ISE) for 90 nm process. Structural data flow modeling using Verilog HDL was used to model each adder. The optimization targets for these adders are set to speed constraint optimization. Table 1 presents the simulated results of existing adders and proposed adder in terms of path combinational delay and total slice utilization.
PROPOSED MULTIPLIER CIRCUIT USING FACTORIZATION METHOD
The multiplier circuit is implemented using factoring method. A method for multiplying numbers by factoring one of the numbers into smaller parts. For example, N -A tR←{B, 000} //Concatenate the value of B with 3 zeros tC←B* tQ P = tR -tC else tQ←2 N + A tR←{B, 000} //Concatenate the value of B with 3 zeros tC←B* tQ P = tR + tC End If End 41×99 = 41x (100-1) = 4100-41 = 4059. For binary multiplication of A and B, factor A into smaller number. In designing 8-bit multiplier cell it is easier to factor the number in terms of 8. This makes the computation process simple by incorporating shift operation. The number of adders used in the proposed multiplier cell will be less by incorporating this factorization method. The pseudo code for implementing this multiplier cell is presented in Table 2 and its simulation result is shown in Fig. 3 . The comparison results obtained for array, Wallace, booth multiplier with proposed work is shown in Table 3 . From the simulation result it can be observed that the proposed multiplier cell utilizes less slices and critical path delay when compare to the existing multipliers.
Sklansky ------------------------Brent kung -------------------------Kogge stone -------------------------

Riyaz --------------------------Sabyasachi das --------------------------
Proposed ------------------------
For example consider the design of 8-bit multiplier circuit: A = 0110, B = 0101 for (A<=1000) 0101* (1000 -0010) 0101*1000 -0101*0010 A = 1100, B = 1010 for (A>1000) 1010* (1000 + 0100) 1010*1000 + 1010*0100
Implementation of FIR filter: FIR filter design using various parallel prefix adders:
For an N-tap FIR filter with coefficients h (k), whose output is described by: 
The filter's Z transform is:
Different filter design has been reported in the literature (Wei et al., 2008; Jiang and Bao, 2010) . The basic architecture of a linear FIR filter is shown in Fig. 4 . The design structure of this filter consists of adders, multipliers, filter coefficients (h 0 , h 1 -h n-1 ) and delay elements. The delay element is implemented as D flipflop. The filter coefficients are stored in ROM. A processing element is implemented with adder and multiplier. The processing element block will be called or used as many number of time as per the requirement. The coefficient of the filter using hamming window is found through MATLAB program. The coefficient may be signed fractional number which must be converted into unsigned binary and scaled to a coefficient width of 5 and it is stored in ROM. The top block of 4-tap FIR is shown in Fig. 5 . 
RESULTS AND DISCUSSION
The proposed 4-tap FIR is simulated with different parallel prefix adders using Xilinx 12.1 ISE with constraints fixed for speed and area optimization. The filter coefficients are obtained for the following specification using hamming windowing technique Filter order-4, Type of window-hamming window, The frequency range: (fs = 42000 (sampling frequency), fc = 10600 (cutoff frequency) and input bit size-32 bit. The proposed FIR filter using new PPA and factorized multiplier has been prototyped on XC3S1600EFG320 in Spartan-3E Platform using Integrated Synthesis Environment (ISE) for 90 nm process. The simulation and synthesis report are presented in (Fig. 6 and 7 ). The synthesis report shows the longest combinational path between any two registers goes through just one processing element. The synthesis tool report a maximum clock period of 2.256 ns which allows this filter to be run at 300 MHz with the proposed adder. Table 4 presents the comparison between Filter design with various parallel prefix adders and proposed adder and multiplier.
CONCLUSION
The study presented the implementation of highly efficient FIR Filter design with Delay-Area efficient parallel prefix adder and multiplier circuit using factoring method with minimal depth algorithm and its functional characteristics is compared with the existing architecture in terms of delay and area. The performance evaluation of the proposed PPA and multiplier are examined for the bit sizes of 8, 16, 32 and 64. The proposed FIR filter using new PPA and factorized multiplier has been prototyped on XC3S1600EFG320 in Spartan-3E for 90 nm process. The FIR filter design with the proposed structure produces approximately 14% of the area reduction and 34% of delay reduction when compare to other existing parallel prefix adders.
