The objective of the paper is to reduce the hardware complexity of higher order FIR filter with symmetric coefficients. The aim is to design efficient Fast Finite-Impulse Response (FIR) Algorithms (FFAs) for parallel FIR filter structure with the constraint that the filter tap must be a multiple of 2. In our work we have briefly discussed for L = 4 parallel implementation. The parallel FIR filter structure based on proposed FFA technique has been implemented based on carry save and ripple carry adder for further optimization. The reduction in silicon area complexity is achieved by eliminating the bulky multiplier with an adder namely ripple carry and carry save adder. For example, for a 6-parallel 1024-tap filter, the proposed structure saves 14 multipliers at the expense of 10 adders, whereas for a six-parallel 512-tap filter, the proposed structure saves 108 multipliers at the expense of 10 adders. Overall, the proposed parallel FIR structures can lead to significant hardware savings for symmetric coefficients from the existing FFA parallel FIR filter, especially when the length of the filter is very large.
Introduction
The significant area of research in VLSI System Design is the area-efficient high-speed data path logic systems. Digital Filters are one of the most widely used fundamental devices in DSP systems, ranging from multimedia signal processing to wireless mobile communications. FIR filters are used in high frequency applications, such as multimedia signal processing, whereas some other applications require high throughput with a lowpower circuit such as Multiple Input Multiple Output (MIMO) systems used in cellular wireless communications. The transition bandwidth of the filter decides the order of the filter. As the bandwidth of the filter decreases, the order of the FIR filter increases. The order of the filter increases when narrow band transitions are required at the filter output.
A higher order digital filter is used in video ghost canceller for broadcast television, which reduces the effect of multipath signal echoes. Hence, an optimized parallel Fast Finite Algorithm that retains hardware complexity is required, if the order of the filter increases enormously. In the proposed FFA algorithm, power consumption is reduced by applying pipelining and parallel processing concepts used in VLSI DSP applications. The critical path reduction is obtained by pipelining mechanism that aims at reducing the critical path by interleaving pipelining latches along the system latency, whereas parallel processing is a powerful technique because it can be used to increase the throughput of a FIR filter or reduce the power consumption of a FIR filter. Parallel processing increases the sampling rate by replicating hardware so that multiple inputs can be processed in parallel, and multiple outputs are generated simultaneously. As the parallel filter block size L increases, hardware complexity as well as cost also increases, which limits its practical implementation. Previous work aims at reducing hardware complexity of the parallel FIR Filter [1] - [7] . The Proposed FFA structure successfully overcomes the constraint that the hardware implementation cost of a parallel FIR filter has a linear increase with block size L. Fast FIR algorithms (FFAs) introduced [1] - [4] use approximately (2L-1) sub-filter blocks to implement a L-parallel filter, each sub-filter has length N/L. Area complexity is optimized by reducing bulky multipliers from (2N -N/L) to L x N using the FFA technique.
The Iterated Short Convolution (ISC) based linear convolution structure is transposed to obtain a new hardware efficient FFA filter structure which saves a large amount of hardware cost, especially when the length of the FIR filter is large [7] . Smallsized filtering structures are constructed [6] - [10] based on fast linear convolution, and then long convolution is decomposed into several short convolutions, i.e., larger block-sized filtering structures can be constructed through iterations of the smallsized filtering structures.
However, in both categories, symmetry of coefficients in the filter design has not been taken into consideration for the design of structures yet. This can lead to significant savings in hardware complexity and cost. In this paper, we provide a new parallel FIR filter structure based on FFA consisting of advantageous polyphase decompositions, which can reduce amounts of bulky multiplications in the sub-filter section by exploiting the inherent nature of the symmetric coefficients compared to the existing FFA fast parallel FIR filter structure. This paper is organized as follows. A brief introduction of existing FFAs is shown in Section II. In Section III, the proposed parallel FIR filter structures are presented. Section IV investigates the complexity and comparisons. In Section V, the description of hardware implementation and the experimental results are shown. Finally, section VI presents the conclusions.
Fast FIR Algorithm (FFA)
In general the output of an n-tap FIR filter which can be expressed as in (1)
where the input {x(n)} is an infinite-length input sequence of the length N FIR filter coefficients. Then, the traditional L-
A. Existing 2 x 2 FFA Technique (L = 2 parallel)
From (2), a two-parallel filter can be mathematically expressed as [1] - [3] .

Equation (3) and (4) shows the traditional two-parallel filter structure, which will require four length-N/2 FIR sub filter blocks, two post processing adders, and totally 2N multipliers and ( 22 N  ) adders. However, (4) can be written as
The implementation of (5) will require three FIR subfilter blocks of length N/2, one preprocessing and three post processing adders, and 3N/2 multipliers and With reference to the three-parallel FIR filter using FFA can be expressed as [2] ],
To utilize the symmetry of coefficients, the hardware implementation of (7) requires six length-N/3 FIR sub-filter blocks, three preprocessing and seven post processing adders, and three N multipliers and 24 N  adders, which has reduced approximately one third over the traditional three-parallel filter hardware cost. The implementation obtained from (7) is shown in Fig. 2 .The 3-parallel filter can be expressed in matrix form as
C. Conventional 4 x 4 Parallel FFA FIR Filter
According to (2), a four-parallel filter can be expressed as [2] ) 
The hardware implementation in (8) requires eight length-N/4 FIR sub-filter blocks, three preprocessing and fifteen post processing adders, and 9N/4 multipliers and (9N / 4) + 9 adders, which reduces approximately the hardware complexity by 20 %. The hardware description of existing four FIR filter using FFA algorithm is in Fig. 3. 
Proposed FFA structures for symmetric convolutions
The main objective of the proposed structures is to earn as many sub-filter blocks as possible which contain symmetric coefficients so that half the number of multiplications in the single sub-filter block can be reused for the multiplications of whole taps, which is similar to the fact that a set of both odd and even symmetric coefficients would only require half the filter length of multiplications in a single FIR filter. Therefore, for an N-tap L-parallel FIR filter the total amount of saved multipliers would be the number of sub-filter blocks that contain symmetric coefficients times half the number of multiplications in a single sub-filter block (N/2L). [11] A. Proposed FFA for L=2 parallel FIR From (4), a two-parallel FIR filter can also be written as (9)
Fig. 4: Proposed FIR filter for two-parallel implementation
When it comes to a set of even symmetric coefficients, (9) can earn one more sub-filter block containing symmetric coefficients than (5), the existing FFA parallel FIR filter. Fig. 4 shows the implementation of the proposed two-parallel FIR filter based on (9) . An example is demonstrated here for a clearer perspective.
Example: Consider a 1024-tap FIR filter with a set of symmetric coefficients applying to the proposed two-parallel FIR filter [11] . {h(0), h(1), h(2),h(3), h(4), h(5)…. h(6), h(7), h(8), h(9), …, h(39)} where h (39) = h(0), h(38) = h(1), h(37) = h(2), h(36) = h(20), h(4) = h(19), h(5) = h(18),….h(11) = h(12), applying to the proposed two-parallel FIR filter structure, and the top two subfilter blocks will be given as
As observed from the above example, two of three sub-filter blocks from the proposed two-parallel FIR filter structures, H 0 -H 1 and H 0 + H 1 , are with symmetric coefficients, now, as (9) , which means the sub-filter block can be realized by Fig. 4 , with only half the amount of multipliers required. Each
output of multipliers responds to two taps. Note that the transposed direct-form FIR filter is employed. Compared to the existing FFA two-parallel FIR filter structure, the proposed FFA structure leads to one more sub-filter block which contains symmetric coefficients. However, it comes with the price of the increase of amount of adders in preprocessing and post processing blocks. In this case, two additional adders are required for L = 2.
Existing FFA [11] Proposed FFA 
B. 3 x 3 Proposed FIR structure using FFA (L=3)
With the similar approach, from (7), a three-parallel FIR filter can also be written as (12). Fig. 5 shows implementation of the proposed three-parallel FIR filter. When the number of symmetric coefficients N is the multiple of 3, the proposed three-parallel FIR filter structure presented in (12) enables four sub-filter blocks with symmetric coefficients in total, whereas the existing FFA parallel FIR filter structure has only two ones out of six sub-filter blocks. Therefore, for an N-tap threeparallel FIR filter, the proposed structure can save N/3 multipliers from the existing FFA structure. However, again, the proposed three-parallel FIR structures also bring an overhead of seven additional adders in preprocessing and post processing blocks. 
H H H H X X X X H Z H X Z X
(13)
Symmetric Convolutions Based Proposed Cascaded FFA Structures
The proposed parallel FIR structure enables the reuse of multipliers in parts of the sub-filter blocks but it also brings more adder cost in preprocessing and post processing blocks. When cascading the proposed FFA parallel FIR structures for larger parallel block factor L the increase of adders can become larger.
Therefore, other than applying the proposed FFA FIR filter structure to all the decomposed sub-filter blocks, the existing FFA structures which have more compact operations in preprocessing and post processing blocks are employed for those sub-filter blocks that contain no symmetric coefficients, whereas the proposed FIR filter structures are still applied to the rest of sub-filter blocks with symmetric coefficients.
For example, a (m-by-m) FFA can be cascaded with a (n-byn) FFA to produce a (m x n)-parallel filtering structure. The set of FIR filters that result from the application of the (m-by-m) FFA are further decomposed, one at a time, by the application of the (n-by-n) FFA. The resulting set of filters will be of length N / (m x n). When cascading the FFAs, it is important to keep track of both the number of multipliers and the number of adders required for the filtering structure. The number of required multipliers is calculated as shown in equation (14).
where r is the number of FFAs used, Li is the block size of the FFA at step I, M i is the number of filters that results from the application of the i th FFA and N is the order of the filter. The number of required adders is calculated as
where A i is the number of pre/post-processing adders required by the i th FFA. Consider the case of cascading two (2-by-2) FFAs. The resulting 4-parallel filtering structure would require a total of 9N/4 multipliers and 20 + 9 (N/4 -1) adders for implementation.
The reduced complexity 4-parallel filtering structure represents a hardware (area) savings of nearly 44 % when compared to 4N multipliers required in the traditional 4-parallel FIR filtering structure.
The proposed cascading process for four-parallel FIR filter (L = 4) and the realization is shown in Fig. 7. From Fig. 7 , it is clear to see that the proposed four-parallel FIR structure earns three more sub-filter blocks containing symmetric coefficients than the existing FFA one, which means 3N / 8 multipliers can be saved for an N-tap FIR filter, at the price of 11 additional adders in preprocessing and post processing blocks. By this cascading approach, parallel FIR filter structures with larger block factor L can be realized. The proposed six-parallel FIR filter will result in 6 more symmetric sub-filter blocks, equivalently N/2 multipliers saved for an N-tap FIR filter, than the existing FFA, at the expense of an additional 32 adders. Also, the proposed eight-parallel FIR filter will lead to seven more symmetric sub-filter blocks, equivalently 7N/16 multipliers saved for an N-tap filter, than the existing FFA, with the overhead of additional 54 adders.
Hardware complexity Analysis and comparison
When an L-parallel FIR filter comes with a set of symmetric coefficients of length N, the number of required multipliers for the proposed parallel FIR filter structures is provided by (16) and (17).
Case 1:
Here, L i is the small parallel block size such as (2 x 2) or (3 x 3) FFA. r is the number of FFAs used. M i is the number of subfilter blocks resulted from i th FFA. S is the number of sub-filter blocks containing symmetric coefficients. The number of the required adders in sub-filter section is given as 1 1 ( 1).
A comparison between the proposed and the existing FFA structures for even symmetric coefficients with different length under different level of parallelism is summarized in Table I . Also, a comparison between the proposed structures and other structures for a 144-tap FIR filter with parallel block 4 and 8 is shown in Table II . 
36-tap

81-tap
512-tap
1024-tap
FPGA Implementation Analysis
The existing and proposed FFA structures are implemented in Verilog HDL targeted on Xilinx VirtexE FPGA device of filter length of 12 and 27, word length of 8-bit. The comparison results are tabulated as 
Conclusions
The proposed efficient parallel FIR filter structures in this paper are advantages for symmetric convolutions or filter with symmetric coefficients. The Proposed parallel FFA technique is suitable for both even and odd number of tap which is a multiple of 2 or 3. Multiplier consumes the major area in the hardware for the parallel FIR implementation. The proposed structure extracts the nature of even symmetric coefficients and save a significant multiplier area at the expense of additional adder in pre-processing and post processing adder block in FIR filter. The Proposed structures is implemented using ripple carry and carry save adder, where the carry save based FFA technique reduces the hardware complexity by approximately by 10%.
Since an adder occupies less area when compared to multipliers, it is advantageous to exchange multipliers with adders in terms of hardware cost. Moreover the number of increased adders stays still the same when the length of FIR filter becomes large, whereas the number of reduced multipliers increases along with the length of the filter. The larger the length of FIR filters is, the more the proposed structure and save form the existing FFA structure. In this paper, we have proposed new parallel FIR structures for polyphase decompositions dealing with symmetric convolutions comparatively better than the existing FFA structure in terms of hardware consumption. 
