ABSTRACT In the design of multiplierless finite impulse response (FIR) filters, tremendous efforts have been made to reduce the number of adders of the multiplier block for the reduction of overall chip area and power consumption. However, fewer in the multiplier block do not necessarily lead to lower power consumption, since the structural adders dominate the power consumption of an FIR filter circuit. In this paper, we propose a power-oriented optimization method for linear phase FIR filters. In the proposed algorithm, the power index, which is the average adder depth of the structural adders, is used as the optimization objective in the discrete coefficients search. A gate-level simulation of benchmark filters shows that the proposed technique designs filters consuming less power than those obtained by the best available algorithms, which aim to minimize the number of adders. The power savings over existing designs can be as much as 19.6%.
I. INTRODUCTION
Finite impulse response (FIR) filter is one of the most important building blocks in many digital signal processing (DSP) circuits and systems. In high performance applications, FIR filters can be implemented using dedicated hardware such as application specific integrated circuit (ASIC). In these applications, the transposed direct form filter structure is preferred over the direct form filter structure due to its inherent pipelined accumulation section.
For very large scale integration (VLSI) implementation of fixed-coefficient FIR filters, the resource-hungry multipliers can be realized by a multiple constant multiplication (MCM) block suing shift and add/subsract operations as shown in Fig. 1(a) . The adders/subtracters (both referred to as adders in the remaining of the paper for the convenience of explanation.) in the MCM block are referred to as multiplier block adders (MBAs). The products are then accumulated using the structural adders (SAs) in the product accumulation block (PAB)(shown in Fig. 1(a) ). The '' '' represent the left shift operation. In the MCM blocks, common subexpressions can be shared among all the multiplications (as shown in Fig. 1(b) ) such that the circuit area and power consumption can be significantly reduced. Therefore, the efficient implementation of MCM blocks has been the focus of multiplierless FIR filter design.
For a given filter specification, the multiplierless FIR filter design methods can be classified into two categories. The methods in the first category divide the overall design procedure into two steps. The first step is to obtain a continuous coefficient set by linear programming optimization and quantize this continuous coefficient set within a given wordlength to obtain a discrete coefficient set. After that, a certain MCM algorithm is applied to the discrete coefficient set to synthesize the adder-and-shift network. However, since the discrete coefficients are designed and synthesized separately (design first and synthesized in a later stage), the resultant circuit is far from optimal in terms of hardware complexity and power consumption. For a given filter specification, there may exist thousands of discrete coefficient sets satisfying the filter specification. In most cases, it is possible to find some other sets of coefficients that also meet the filter design specifications. Therefore, instead of designing the discrete coefficient set and MCM block separately, many researchers have gone back to the filter design process to incorporate the MCM algorithms into the discrete coefficient set optimization procedure. These methods, referred to as MCM and FIR filter design joint optimization methods [1] - [11] , constitute the second category of multiplierless FIR filter designs. The power consumption and hardware cost of the FIR filters designed by the methods in the second category is much lower than that generated by the methods in the first category [7] , [8] .
Existing research [12] , [13] has shown that the power consumption of FIR filters is more related to the adder depth (AD) than to the adder cost. This is because the overall transition in the addition, which determines the power consumption of the adder, is related to the adder depth. Therefore, for the design methods in the first category, the average adder depth (AAD), being the average value of the adder depth for all adders, has been taken into consideration in some MCM algorithms [14] - [16] . However, the algorithms in the second category [1] - [11] so far aimed only at the minimum adder cost. Although the technique in [10] can impose a maximum logic depth constraint, it has one obvious limitation: a design with minimum number of adders achieved under the maximum logic depth constraint does not guarantee the minimum AAD for the overall design. Moreover, it is worth mentioning that the AAD is generally used as a power index for low power design only when the number of adders is comparable [16] .
In this paper, we propose a linear phase FIR filter design algorithm aiming to find a coefficient set which can be synthesized into an adder-and-shift network with the lowest AAD of SAs. The proposed method is based on the branch and bound tree search. Since the adder cost of SAs generally is determined by the filter order which is fixed, the lower AAD will lead to the lower power consumption of SAs. As it will be discussed in Section II, the power consumption of SAs dominate the power consumption of the filtering circuits. The overall power consumption of an FIR filter can thus be significantly reduced if the AAD of SAs is minimized.
Design examples show that the proposed algorithm can generate designs with much lower power consumption.
The rest of the paper is organized as follows. Section II analyze the power consumption of the SA part of FIR filters and proposed a new power index, i.e., the AAD of SAs, to guide the discrete coefficient optimization process. Section III presents a tree search algorithm which traverses a pre-defined search space to obtain the coefficient set with lowest AAD of SAs. Five design examples are given in Section IV to show the advantages of the proposed algorithm for the design of low power multiplierless FIR filters. Conclusions are drawn in Section V.
II. ANALYSIS POWER CONSUMPTION OF SA BLOCK AND THE PROPOSED POWER INDEX
In digital CMOS circuits, the dynamic power, consisting mostly of switching power, is the dominating part of total power dissipation when circuits are working. The switching power of a node in digit circuits can be estimated as [17] 
where α is the switching activity factor, C L is the load capacitance, V DD is the power supply voltage and f is the clock frequency. The total switching power of the circuit is given by
where M is the total number of nodes, referred to as logic complexity factor in the remaining of the paper. In the expression of total switching power, the switching activity factor α and the logic complexity factor M are determined by filter implementation, while other parameters are defined by the foundry process and circuit specification.
A. SWITCHING ACTIVITY OF STRUCTURAL ADDERS
Here, we assume that there is no pipeline between the MCM adders and SA adders. Though the pipeline will reduce the switching activity of SA block, but the extra inserted registers will also consume much power. Different activity models have been proposed to analyze the switching activities of MBAs at bit-level [17] - [19] . In [18] , the spatial correlation of inputs of each adder is taken into consideration for accurate switching activity estimation of MCM blocks. It has been shown (e.g., in [18, Fig. 3] ) that the adders with high depth tends to have higher switching activity. This is because the useless glitches caused by different delay pathes will propagate along the adders stages. Although this kind of glitches do not generate any circuit error, they can contribute up to 70% of switching activity in digital circuits [20] . For SAs in FIR filters, there is no existing model to accurately estimate the switching activities. One of the inputs of a SA is from the output of a register and there is no glitch propagation from this input. The other input is from the output of the MCM block where the glitches generated in the MCM block will be propagated to the SA. The switching activity of the SAs are therefore generally higher than that of the MBAs.
B. LOGIC COMPLEXITY OF THE SA BLOCK
The number of SAs needed for the PAB is determined by the number of non-zero filter coefficients. For an N th order FIR filter (implemented in transposed direct form), if the ith filter coefficient is non-zero, the width of the corresponding SA can be expressed as [21] 
where W X is the word-length of the input signal, h k is the kth filter coefficient and l i is the number of left shifts needed to generate h i from a corresponding positive odd integers (fundamentals) in the MCM block implementation. Therefore, the total number of full adders (FAs) needed for the SA block is given by
As we can see from (3) and (4), the word-length of structural adders need to cover the range expansion of intermediate accumulation results and the logic complexity of the PAB increases with the number of non-zero filter coefficients. Fig. 2(a) shows the comparison of number of MBAs and number of SAs for 8 commonly referenced benchmark filters [22] with number of taps from 36 to 441. The filters are named as F suffixed with filter length for easy reference. The MCM blocks of the filters are synthesized using the C1 algorithm proposed in [14] . As we can see, the SA blocks consumes much more adders than the MCM blocks. Moreover, the word-lengths of SAs are generally larger than that of MBAs. Therefore, as shown in Fig. 2(b) , the logic complexity (in terms of FA counts) of the SA block is substantially higher than that of the MCM block for all the filters. 
C. POWER CONSUMPTION OF THE SA BLOCK
According to the above analysis, SAs in transposed direct form FIR filters not only contribute most of the logic complexity, but also have higher switching activity.
Moreover, in the PAB, there are N registers working at the clock frequency of f . It can therefore be concluded that the power consumption of the SA block will dominate the total power of filtering circuits. Circuit simulation has been performed to validate the above conclusion. We have implemented the filter ''F108'' in Verilog HDL and synthesized it using Synopsys Design Compiler with 65nm CMOS standard cell library. Gate-level simulation using PrimeTime has been performed to estimate the power consumption of the filter at the clock frequency of 200MHz. The results are listed in Table 1 . As can be seen, the power consumption of the PAB is substantially higher than that of the MCM block. The contribution of the MCM block is even lower than the registers in the PAB. As shown in Fig. 3 , the PAB contributes 75.7% of the total power, where 62.4% is from the SAs and 13.3 % if from the registers in the PAB. The MCM block itself dissipates only 7.7% of the total power. 
D. THE PROPOSED POWER INDEX OF SA BLOCK
As discussed in II-C, the power consumption of the SA block dominates the power consumption of an FIR filter. Therefore, in this subsection, an novel power index which can indicate the power consumption of SA block is proposed.
Existing research [16] shows that if the adder cost of two networks are comparable, the one with lower AAD generally has lower power consumption. For the multiplierless FIR filter design, different discrete coefficient sets may result in a significant difference in terms of MBA cost for the MCM block, but the number of SAs is comparable if the order of the filter is fixed. The reason is that the number of SAs, denoted as N SA , is only determined by filter order N and the number of zero coefficients N zero , i.e., N SA = N − N zero . In most cases, the number of zeros in a coefficient set are much less than the filter order, i.e., N N zero , such that the number of SAs can be considered as a constant number N . Therefore, the AAD of SAs can be used as an indicator of the overall power consumption of SA block. The AAD of SAs, denoted as AAD SA , can be computed as
where AD(i) is the adder depth of i-th SA in the delay chain. The adder depth of SAs is determined by the synthesis of each coefficient as shown in Fig. 4 . It can be seen from Fig. 4 that the adder depth of SAs corresponding to coefficient 21 can either be 3 or 4. Since our goal is to minimize the adder depth of SA part, it is necessary to ensure each SA is on its minimum adder depth (MAD). According to [15] , if the number of the non-zero digits of coefficient h(i) expressed in canonic signed digit (CSD) is NZ (h(i)), the minimum achievable adder depth of the adder network to realize this coefficient is confined as Log 2 NZ (h(i)) . Therefore, the MAD of SA corresponding to coefficient h(i) can be given
If the coefficient h(i) is 0, SA is not needed and the adder depth is 0 in such case. Let MAD SA (i) be the MAD of SA corresponding to the coefficient h(i). If each SA is ensured to be realized on its MAD, the AAD of SAs part can expressed as
In the case of linear phase FIR filters, the coefficients are symmetrical and h(N âĹ i)= ±h(i), i.e, the AD of i-th SA is the same as that of (N − i)-th SA. Hence, AAD SA can also be expressed as
for even N , and
+ Log 2 NZ (h(0)) + 1), (9) for odd N . In our paper, our goal is to find a coefficient set h(i) for i = 0, 1, 2 . . . N which has the smallest AAD SA .
III. THE PROPOSED DISCRETE COEFFICIENT OPTIMIZATION TECHNIQUE GUIDED BY THE PROPOSED POWER INDEX OF SA BLOCK
Using the power index (AAD of SAs) discussed in Section II, a tree search algorithm is introduced in this section to optimize FIR filters to achieve the smallest AAD of SAs. Depth-first tree search is used during the search procedure to optimize the filter coefficients. Cut-off schemes are exploited to accelerate the search. This section introduces the tree search procedure, optimization objective function and cut-off schemes of the proposed algorithms.
A. DEPTH-FIRST TREE SEARCH PROCEDURE USING IN THE FIR FILTER DESIGN
For the FIR filter design problem, the root of tree is the optimal continuous coefficient set. Then, the root node is expanded by quantizing a selected coefficient to a discrete value. All child nodes will further produce their own child nodes by quantizing another coefficient to a discrete value. The process continues until it reaches a leaf which contains only discrete coefficients. The search is a recursive procedure as follows:
1) The root is generated by solving the linear programming formulated in next subsection. The current node is set to the root node and the search depth D is initialized to 0; 2) The tree generates a child of current node by fixing the D-th coefficient to a new discrete value.
• If such child node exists, the current node is replaced by the generated child node, increased the tree depth by 1, and go to step 3.
• If such child node does not exist, i.e., the D-th coefficient has been fixed to its all possible discrete values, there are two cases: a) if the current node is not the root node, the search backtracks to the parent of the current node and D = D − 1 and the current node is replaced by the parent of current node and repeat step 2; b) if the current node is the root node, the tree search is completed and the search program is terminated.
3) The first pruning checking: compute the AAD of SAs of the current node and check whether the descendent of this node has the possibility to be optimum solution in term of the AAD of SAs (i.e. if the objective function value of the node is smaller than that of the current best solution). If yes, go to step 5; Otherwise, the current node is pruned and the search backtrack to its parent node and go to step 2. In next subsection, the way to compute the AAD of SAs will be discussed. 4) As one of the coefficient is fixed to coefficient value, the rest coefficients need to be reoptimmized. The reoptimization is still formulated as a linear programming which will be discussed in the next subsection. 5) Do the second pruning checking to check whether the tree node meet the filter ripple specification. If the current node meets the ripple specification, then go to step 7. Otherwise, the current node is cut off; the new current node is replaced by its parent node and the search depth is decreased by 1, go to step 2. 6) If all coefficients of the current node are discrete values, i.e., a leaf node is obtained. Go to step 8; otherwise, the tree grows by going to step 2. If such parent node does not exist, the whole search is terminated. 7) Updated the best solution if the leaf node's AAD of SAs is smaller. 8) The parent node replaces the current node and the tree search backward by setting D = D − 1, go to step 2.
B. LINEAR PROGRAMMING FORMULATION FOR THE FIR FILTER DESIGN
The zero-phase frequency responses of a linear phase FIR filter with order N can be express as [23] :
where Trig(ω, n) is an appropriate trigonometric function depending on the parity of N and symmetry of the filter. To find the continuous coefficient set h(n), a linear programming is formulated as
where δ is the peak ripple and b is a floating passband gain. δ p , δ s , ω p and ω s are the given passband ripple, stopband ripple, passband edge and stopband edge, respectively. b l and b u are two constants, defining the lower bound and upper bound of the passband gain. In this paper, they are chosen to be 0.7 and 1.4, respectively. The optimal continuous coefficient set, which is the root of the proposed search tree, can be obtained by solving the above linear programming problem. During the tree search, when a coefficient is fixed to a discrete value, the reset unfixed coefficients needed to be reoptimized, the reoptimization problem can be formulated to
where D is the tree depth of the current node and h k fix is the discrete value of k-th filter coefficient.
C. OPTIMIZATION OBJECTIVE FUNCTION AND COST ESTIMATION OF EACH NODE
The objective of the optimization is to minimize the AAD of SAs. When the AAD of SAs are the same, the hardware cost of SAs is minimized. At bit level, the hardware cost of SAs is determined by the number of full adders. Based on this, to control the hardware cost, the proposed optimization objective function is given by Obj = W * AAD SA + FA str (13) where FA str is the full adder cost of overall SAs and W is a weighting factor. By setting W to a very large constant, for example 10000, the AAD of SAs is optimized with higher priority than the number of overall full adders. The value of FA str is estimated as (3). It should be noted that during the tree search, only a part of coefficients are fixed to discrete values and thus the AAD SA and FA str can not be exactly computed. More specifically, when h(j) is fixed, the adder depth and full adder cost of the SA for coefficient h(i), AAD SA and FA str , for i = 1, . . . j can be exactly computed, whereas the rest cannot, since the values of unfixed coefficients h(j+1) to h(N /2) or h(N /2−1) are required for the computation in (8) , (9) and (3) . For this reason, in the proposed algorithm, the adder depth of SAs and the number of FAs corresponding to those unfixed coefficients are estimated using their corresponding lower bound, i.e, 1 and W X , in order not to sacrifice the optimality. Here W X is the word-length of the input signal.
D. SYNTHESIS OF THE DISCRETE COEFFICIENT SET
The above tree search is synthesis independent as the objective function given in (13) and the cut-off schemes rely on the coefficient values only. Once the discrete coefficient set is searched, it is synthesized into adder-and-shift network using an MCM optimization algorithm. Since we assume that each coefficient is synthesized using minimum adder steps to ensure each SA is on its MAD, the MCM algorithm used here should consider this requirement. Currently, there exist many MCM algorithms [15] , [16] , [24] which can synthesize each coefficient with minimum adder steps. In this paper, the algorithm proposed in [16] is adopted.
IV. DESIGN EXAMPLES
Five benchmark filters are designed to show the superiority of the proposed algorithm. The filter specifications are listed in Table 2 . Since existing multiplierless FIR filter discrete coefficient optimization algorithm [1] - [11] all aim to minimize the adder cost to indirectly minimize the power consumption, we compare the proposed one with the best one [10] aiming to minimize the adder cost.
The actual power consumption of the 5 filter architectures are simulated. The IC technology used is the TSMC 65nm standard cell library, and Design Compiler, a gate level tool, is used to measure the area and the power consumption of each architecture at the clock frequency of 200MHz. For fair comparisons, no area and time optimization are applied during the gate-level synthesis using Design Compiler. The number of adders, AAD, the area and the power consumption are summarized in Table 3.   TABLE 3 . The result comparisons between the proposed algorithm and the algorithm in [10] . NA is the number of adder. Table 3 shows that all designs obtained using the proposed algorithm have lower power than that obtained in SHI's algorithm [10] . The power consumption saving can be as much as 19.6% for the filter Y1. Table 3 shows that the adder cost of the proposed algorithm is a bit higher than that of SHI's algorithm. The areas of the circuits obtained using the proposed algorithm are slightly increased, but it is negligible compared to the decrease of power consumption. The overall performance of the proposed design are therefore superior to the existing techniques in terms of power consumption.
V. CONCLUSION
In this paper, a low power FIR design method is proposed. Instead of searching for coefficient sets that minimize the number of adders as traditional algorithms, our technique aims to find a coefficient set that minimize the AAD of SAs. A tree search algorithm is developed to search for the discrete coefficients. The optimization objective function as well as the cut-off schemes catered for the design of low power FIR filters are proposed. Design examples shows that the proposed algorithm can generate designs with much lower power consumption than existing algorithm at cost of slight increase of chip area. 
