In this paper we present a bit-oriented architecture for the digital Finite Impulse Response (FIR) filter with adaptive coefficients. Instead of using multipliers, the FIR operation is regarded as the summation of a larger number of operands. Our architecture utilizes two new word-level operand compressors, namely six-to-triple and six-to-double compressors. Both of these two compressors, multiple operands can be merged efficiently without long carry propagation. Using such compressors as the basic building blocks, an FIR can be implemented as a fast and regular structure that lends itself to automation.
Introduction
A digital FIR filter performs the frequency shaping or the linear prediction on a discrete-time input sequence {x 0 , x 1 , x 2 , …} [4] . Mathematically, a sample y i of the output sequence {y 0 , y 1 , y 2 , …} can be expressed as a linear combination of the input samples over a window of time:
y i = a 0 x i + a 1 x i-1 + … + a n-1 x i-n+1 , where the {a 0 , a 1 , a 2 , …, a n-1 } are the coefficients and n is the tap number or the dimension of the filter.
The adaptive filter is a special type of FIR filter that has been commonly used in many applications such as communication or multi-media signal processing [2] . For example, it was intensively used in echo cancellation, noise reduction, channel equalization, and audio compression, etc. Unlike the fixed-coefficient counterpart, the coefficients of such filters changes 2 dynamically in response to its environment.
In terms of implementation, numerous architectures for FIR filter have been proposed in the literature. In general, they can be divided into three major types.
(1) The basic architecture, including the direct form and transposed form, treats the FIR as a sequence of multiplication and accumulation using word-level multipliers and adders [6] . (2) The multiplier-less architecture uses only adders. Canonic signed digit representation and the common sub-expression sharing techniques can be exploited to reduce the hardware cost [7] [8] . These approaches are usually quite cost effective. However, they require fixed coefficients and thus are not applicable to adaptive filters. ( 3) The bit-oriented architecture regards the FIR as a sequence of addition and shifting (as will be discussed in detail later) [1] .
On top of these structures, the poly-phase decomposition offers a pre-processing mechanism to reduce the overall computation by mathematical manipulation. Also, the hardware in any architecture can be folded or unfolded arbitrarily to explore the trade-offs among different design criteria such as power dissipation, area, and speed [5] [6].
Among the above architectures, the multiplier-less has been viewed as one of the most cost-effective ones. However, it cannot be applied to the adaptive filter because it assumes fixed coefficients. Therefore, we propose a low-power architecture for adaptive filters in this work based on the bit-oriented concept. The FIR filtering is performed by inner-product operations, in which each input data is examined bit-by-bit. To improve the speed, we design six-to-triple and six-to-double compressors as the basic kernels to render a highly regular tree structure. This structure is highly scalable in that it can do the summation of a larger number of operands efficiently. Such a scheme is also highly flexible. It can be folded to reduce hardware or unfolded to improve speed. Experimental results show that the glitches that cause significant power dissipation in multiplier-based FIR implementations can be slashed dramatically, and leading to overall 44% power reduction using our generator based on the proposed architecture.
The rest of this paper is organized as follows. Section 2 provides the preliminaries. Section 3 introduces the new carry-save architecture using the six-to-triple compressors. Section 4 introduces the new local-carry architecture using the six-to-double compressors. Section 5 presents the experimental results and Section 6 provides the conclusions. 
Preliminaries
Fig . 1 shows the inner-product view of an FIR filter. To produce one output sample y i , n multiplications are required. The coefficients and the input data are assumed to be {c 0 ,…, c n } and {x i ,…,x i-n-1 }, respectively. In the sequel, we also assume that the word-length of the input data is m.
To do the computation in a bit-oriented manner, each multiple-bit input data is first expanded into a bit-level representation.
For example, x 0 is now represented as (x 00 , x 01 ,…, x 0m ), with the first subscript denoting the time-domain index of the data, while the second subscript denoting the bit index.
Assuming that the tap number n = 3 and the input data word-length m = 4, the output sample y 2 was originally expressed as:
. Then its bit-oriented form is expressed as:
In this expression, the computation is performed by scanning one data bit at a time. The weight of a data bit depends on the fixed-point definition of the number. In this example, it ranges from 0 to 2 -3 . When checking one data bit, three word-level operands determined by the coefficients may need to be added to the partial product. In the sequel, this is referred to as bitslice computation. This computation could be slow because the number of operands to be accumulated grows linearly with the tap number n. Example 2: Fig. 2 shows the contents of the table and the required processing hardware [6] . In general, one inner-product computation takes k clock cycles, each of which consumes one bit slice of data. In terms of hardware, it needs a simple accumulator, shifter, and a look-up table. Although it is quite efficient when the tap number n is relatively small, the size of the look-up table could explode quickly as n goes larger. Suppose that the k-th bit slice of data under check is (x 0k , x 1k , x 2k ) = (101) 2 .
Then we can express the outcome of the bit-slice computation as:
This outcome has been pre-stored in the table indexed by (x 0k , x 1k , x 2k ) = (101) 2 = 5. Hence, it can simply be retrieved from the table by taking 5 as the index.
Architecture
The previous bit-oriented architecture does not scale easily. One simple way to overcome this problem is to do the bit-slice computation on the fly, instead of doing the table look-up. However, the computation time could be longer. We need techniques to speed up the bit-slice computation.
Primitive Structure for Bit-Slice Computation
Tree-structure as shown in Fig. 3 is well known to have a higher speed for associative operation on multiple operands.
Assume that there are eight taps {c 0 , c 1 , …, c 7 } and the input bit slice under processing is denoted as (x 0k , x 1k, …, x 7k ). Then, each input bit x ik controls a multiplexer that selects coefficient c (7-i) or 0 as an input operand to the subsequent multiple-operand adder. It is apparent that this primitive structure is still relatively slow due to the long carry propagation in every basic twooperand adder.
One way to break the carry chain is to use the popular carry-save structure. That is, the carry signals are propagated to the next level of adders, instead of being propagating to the higher-bit position immediately within each adder. If successful, the propagation delay of the carry chain can be reduced to constant time O(1), instead of O(m) or O(log m) if high-speed carry look-ahead structure is used, where m is the operand's word-length. However, this may require the conversion of the entire adder tree into a bit-by-bit structure and the use of (3,2) or even more complex compressors as in the classical Wallace tree structure [11] , leading to a highly irregular structure that is harder to be automatically generated.
Six-To-Triple Compressor
To overcome the above dilemma, our first attempt is to derive a new carry-free compressor such that the overall word-level adder tree structure can be retained. A compressor fulfilling the above requirements is shown in Fig. 4 . Unlike the other compressors that generate outputs of only two different weights, this compressor produces three outputs with weights of 1, 2, 4, respectively by taking six bits of the same weight as the inputs. Therefore, it is referred to as six-to-triple compressor or 6T
compressor. In terms of functionality, it is to count the number of 1's in the six inputs and represent the result as a 3-bit output Fig. 4 shows the detailed structure of this compressor. It consists of 3 full-adder cells (FA) and 1 half-adder cell (HA).
Upon the 6T compressors, an adder for six input operands can be built by cascading as shown in Fig. 5 . This adder is called 6T adder for short. It can take two triple-digit operands, A and B, or equivalently six operands, as the inputs and produce a triple-digit sum, S. Note that the propagation delay of this carry-free 6T adder is independent of its word-length because there is no carry propagation across the 6T compressors.
In the new architecture based on the 6T compressor, most partial results use triple digits. That is, a digit has three bits of weights 1, 2, and 4, respectively. Only in the last clock cycle, the overlapping bits among digits are combined through a vectormerging adder (VMA) just like in the carry-save multiplier. As demonstrated in Fig. 6 , we assume that there are only five digits (S 4 S 3 S 2 S 1 S 0 ), each of which has three bits. The weights of these bits range from 2 0 to 2 6 . For each weight position, there could be 1, 2, or 3 bits. We use full adders (FA) or half adders (HA) in the first stage to compress bits of the same weight into a sum bit plus a carry output bit in a carry-save manner. In the second stage, a typical binary adder with carry propagation is then used to generate the final binary number (q 7 q 6 q 5 q 4 q 3 q 2 q 1 q 0 ) 2 .
The main reason why the proposed 6T adders can be plugged into the tree-based architecture for multiple operands merging 6 without disturbing the regularity is that it has an integer compression ratio of 2 (i.e., six inputs can be merged into three). This is one unique property that has not been found in numerous compressors such as (3, 2), (5, 3), or (7, 4)-compressors proposed previously in the literature. A compressor with an integer compression ratio can lead to a more regular structure as shown in the following example.
Example 3:
Consider the situation of compressing 24 word-level operands into 3 operands. Using traditional carry-save adders by (3, 2) compressors will lead to the block diagram in Fig. 7(a) . It can be seen that there are irregularities indicated as dummy blocks in the tree structure. On the other hand, using the proposed 6T compressors will yield a more regular one with much simpler routing and higher scalability as shown in Fig. 7(b) . Such a property is essential for implementing an FIR generator.
The Bit-oriented Architecture
The overall bit-slice computation now can be transformed to a carry-free structure as shown in Fig.8 . For simplicity, the input multiplexers have been omitted. This architecture can be further divided into three layers.
The first layer uses normal carry-save adders, denoted as 3D, to combine three operands as double-digit numbers. Note that this layer has only one level.
The second layer uses carry-free 6T adders to merge three double-digit numbers as a triple-digit sum. Similar to the first layer, there is only one level in this layer.
The third layer uses carry-free 6T adders to perform the tree-based triple-digit numbers merging. This layer could have multiple levels, depending on the number of operands being summed up simultaneously.
It is worth mentioning that the new bit-slice computation only produces a triple-digit result. After every input bit slice has been processed, the carry signals in the final triple-digit representation are then propagated using a vector-merging adder. In other words, if each bit-slice computation takes one clock cycle, then the total number of clock cycles for a single iteration (i.e.,
for producing one output sample) would be equal to the number of bits in a word plus the number of clock cycles for the vector merging that involves carry-propagation. 
Performance Analysis
In this sub-section, we estimate the performance gain. Here, the sample period denotes the processing time required to produce one output sample.
Observation 1:
The bit-slice computation performed by the original architecture (as shown in Fig. 3 ) has a complexity Θ(log m ͊ log n), where m and n are the word-length and the FIR tap number, respectively. Here the factor of Θ(log m) is due to the carry-propagation delay inside a fast binary adder, while Θ(log n) is the factor corresponding to the levels of the tree. One sample output requires m rounds of bit-slice computation, hence, the overall sample period will have a complexity of Θ(m ͊ log m ͊ log n).
Observation 2:
The bit-slice computation performed by the new architecture has a lower bound of Θ(d ͊ log n), where the constant d is the propagation delay across a 6T compressor, and Θ(log n) is the factor corresponding to the levels of the tree.
Therefore, the sample period has a low bound of Θ(m ͊ d ͊ log n + VMA), where VMA denotes the propagation delay for converting a tripe-digit number to a normal binary form. Since VMA is also in the order of Θ(log m), we conclude that the asymptotic complexity of the new architecture is Θ(md ͊ log n + log m).
Observation 3:
The hardware of the new architecture consists of a number of carry-free 6T adders, one shifter, one vector merging adder, and a number of registers. The total number of the 6T adders in the tree structure is roughly Θ(n) in terms of the total number of coefficients n.
Fully Parallel Architecture
The architecture we have discussed so far does not fully explore the parallelism. For reducing the hardware requirement, we have used the same hardware for each bit-slice computation. As a matter of fact, we can also perform every bit-slice computation in parallel. In this regard, an FIR operation is viewed as the summation of nͪm binary operands. For instance, if the tap number n is 16 and the word-length m is 16, then we need to sum up 256 binary numbers. In other words, we can unfold 16 rounds of bit-slice computation as a large carry-free tree structure that takes 256 binary numbers as the inputs. Since the level of the critical path in our architecture grows logarithmically, such a parallel architecture can achieve a much better 8 speed with linearly increased hardware. In other words, the time complexity could be further reduced from Θ(md ͊ log n + log m) to Θ(d ͊ log mn + log m).
Local-Carry Adders
One reason why the proposed 6T adders can be plugged into the tree-based architecture for multiple operands merging without disturbing the regularity is that it has an integer compression ratio of 2 (i.e., six inputs can be merged into three). This is one unique property that has not been found in numerous compressors such as (3, 2), (5, 3), or (7, 4)-compressors proposed previously in the literature. In general, compressors do not interact with one another so as to avoid the long carry propagation.
However, in the following, we show surprisingly that allowing local carry propagation can further enhance the compression ratio without increasing the overall carry propagation delay inside a six-input adder. By doing so, the number of layers in the tree structure can be further reduced, leading to an even higher speed. Fig. 9 shows the structure of a six-to-double compressor, or 6D compressor for short. It can be used to build efficient localcarry adders. This compressor takes in four extra carry input bits, along with six input bits, and compresses them into four carry output bits plus two output bits. That is, the totally 10 input bits of the same weight have been compressed as 4ͪ2 + 2 after this building block.
It can be seen that the local-carry adder can be constructed by cascading a number of 6D compressors as shown in Fig. 10 .
There are carry signals propagating between stages. However, the carry propagation has a short range that goes as far as only two stages. Therefore, the overall propagation delay is still not dependent on the total number of bits of the adder. This property is achieved by careful signal routing among consecutive stages. Careful examination reveals that the longest propagation delay ripples through at most 3 levels of full-adder cells from the top inputs down to the bottom output, which is exactly the same as the 6T adders discussed previously. The routing that achieves this mandates that the carry output signals generated by a full-adder cell in one stage should ripple down one level as feeding the next stage. For example, c 0ut0 , c out1 , c out2 , c out3 are carry output bits of the i-th 6D compressor as shown in Fig. 9 . Their levels away from the inputs are 1, 1, 2, and 3
respectively. When they feed to the left to the next higher-order stage, they drive full-adder cells in the levels of 2, 2, and 3, none, respectively. By enforcing this rule, all carry propagations are restricted to only three full-adder cells at most.
9
Observation 4: The improvement of using 6D compressors over the 6T compressor can be characterized by their compression ratios. In the 6T-based adder explained in Section 3.2, the compression ratio is from 6 operands down to 3 operands after one adder, which is equal to 2. While in a 6D-based adder, the compression ratio is from 6 operands down to 2 operands, which is equal to 3. As a result, we conclude that the compression ratio has been improved from 2 to 3, which is a 50% improvement, without incurring any longer propagation delay by simply using the proposed 6D compressors. Again, it is worth mentioning that, the tree-based regular structure for bit-slice computation can be retained because the 6D compressors still have an integer number of compression ratio.
Experimental Results
We have implemented an FIR generator based on the proposed architecture. The user can use this tool to generate the Verilog codes of the full-parallel architecture or the bit-slice architecture. In the following, we compare the results with the traditional fast multiplier-based FIRs using the transposed form [4] . Commercial tool Design Compiler and TSMC 0.25µm standard cell library are used to derive the area estimation and the clock cycle time. Regarding the power dissipation, we use the relatively accurate PrimePower to do the estimation. The test cases we used ranges in two parameters. The first parameter, word-length, ranges from 10 to 30, while the second parameter, FIR tap number, ranges from 12 to 30.
In Fig. 11 , we compare the round-off errors with those of the transposed form. It shows clearly that the round-off errors of our architecture are much less because we do not do the truncation as often as those architectures that base on multipliers.
In Table 1 , we compare with the transposed form in terms of the area, sample period, and area-delay product. It can be seen that the proposed fully parallel architecture outperforms the transposed form in every category. On the average, the delay is comparable, the area is 79%, and the power dissipation is only 56%, implying a 44% power reduction. Note that the power reduction is mainly on the glitches reduction. The proposed architecture has a very balanced structure, meaning that all combinational paths from the inputs to the outputs or between flip-flops are of similar lengths. Such a property help suppress the glitches, and thus saving power. It is worth mentioning that our architecture is completely balanced at the bit level. As a result, it does not even suffer from the local glitches induced by word-level adders such as in the baseline tree-based architecture of Fig. 3 .
As mentioned previously, we can easily trade off the speed for the area reduction by bit-oriented architecture. As shown in Table 1 , one may slash the power dissipation down to 5% and the area down to 25%, with only 5.18 times of processing time.
This is particular useful for ultra low power application that does not require much processing speed. In terms of the joint metric, power-area-delay product (PAD product for short), the bit-oriented is only 6.67% of that of the transposed form. Note that in this work we focus on ASIC-based implementation. For FPGA-based implementation, different design consideration that takes into account the specific FPGA architecture is necessary in order to produce more area-efficient results [3] .
Conclusion
Previous bit-oriented architecture for the FIR filter cannot scale well as the tap number increases. In this paper we present an improvement that performs the summation of multiple operands on the fly. The major contributions in this paper are two fold.
Firstly, we proposed a regular carry-free tree-based structure that can be easily automated for multiple-operand merging.
Secondly, we found, quite contrary to the intuition, that a local-carry adder can actually be more efficient than a pure carry-free adder. In terms of the compression ratio when doing multiple-operand merging, it is 50% better. It has been shown that the 
