Advances in nanoelectronic fabrication have enabled integrated circuits to operate at a high frequency. The finite impulse response (FIR) filter needs only to meet real-time demand. Accordingly, increasing the FIR architecture's folding number can compensate the high-frequency operation and reduce the hardware complexity, while continuing to allow applications to operate in real time. In this work, the folding scheme with integrating input-data and tap folding is proposed to develop a hardware-efficient programmable FIR architecture. With the use of the radix-4 Booth algorithm, the 2-bit input subdata approach replaces the conventional 3-bit input subdata approach to reduce the number of latches required to store input subdata in the proposed FIR architecture. Additionally, the tree accumulation approach with simplified carry-in bit processing is developed to minimize the hardware complexity of the accumulation path. With folding in input data and taps, and reduction in hardware complexity of the input subdata latches and accumulation path, the proposed FIR architecture is demonstrated to have a low hardware complexity. By using the TSMC 0.18 µm CMOS technology, the proposed FIR processor with 10-bit input data and filter coefficient enables a 128-tap FIR filter to be performed, which takes an area of 0.45 mm 2 , and yields a throughput rate of 20 M samples per second at 200 MHz. As compared to the conventional FIR processors, the proposed programmable FIR processor not only meets the throughput-rate demand but also has the lowest area occupied per tap.
INTRODUCTION
Finite impulse response (FIR) filter is regarded as one of the major operations in digital signal processing; specifically, the high-tap-number programmable FIR filter is commonly applied in ghost cancellation and channel equalization. The main operation of an FIR filter is convolution, which can be performed using addition and multiplication. The high computational complexity of such an operation makes the use of special hardware more suitable for enhancing the computational performance. This special hardware used to realize a high-tap-number programmable FIR filter is costly. Thus minimizing the hardware cost of this special hardware is an important issue.
With the regular computation of an architecture, a folding scheme that utilizes the same and small hardware component to repeatedly complete a set of computation is frequently used to reduce the hardware complexity of such architecture [1, 2] . Generally, the folding schemes of an FIR architecture can be classified into input-data folding, coefficient folding, and tap folding [3] [4] [5] [6] [7] [8] [9] [10] [11] . Additionally, while advances in nanoelectronic fabrication have enabled integrated circuits to operate at a high frequency, the throughput-rate demand of an FIR filter does not change significantly. Due to such phenomenon, the folding technique must be further improved to design a hardware-efficient FIR architecture. Figure 1 presents the relationship between the computational performance, hardware complexity, and circuit speed on different hardware platforms in realizing a high-tap-number FIR filter. With only one or few multipliers/adders, the programmable processors cannot be applied to realize a hightap-number FIR filter in real time. On the other hand, the conventional FIR architectures using the application specific integrated circuit (ASIC) approach would have the fixed folding numbers to do so; but with the increase in circuit speed, the conventional architectures are only able to slightly decrease hardware complexities by reducing their pipelined latches. Therefore, in the advanced fabrications, conventional FIR architectures with fixed folding numbers cannot be used to realize a hardware-efficient FIR filter. Instead, an FIR architecture that can increase its folding number would cost-effectively meet the real-time performance demand. With the use of high-speed circuitry, the folding number of such architecture is increased accordingly to effectively decrease the computation units required. In overall, this FIR architecture can fill the gap between fabrication migration and hardware platform development, in the design of an architecture that meets real-time demand with hardware efficiency.
In the FIR architecture design, the circuit required for a multiplication operation poses a major concern because it takes a hefty part of hardware complexity. The multiplication operation includes partial-product generation, partialproduct shifting, and partial-product summation. Of which, partial-product shifting can be realized with hardwire so no additional hardware complexity is dedicated here. To avoid computation at large word lengths, the folding scheme can be applied to add the partial products at the same precision index from multiple multiplication operations, shift the added results, and then perform summation of these shifted results to complete an FIR filter operation. Based on the above arrangement, an FIR architecture employing input-data and tap folding is proposed in this work. With input-data folding, each input datum is partitioned into multiple input subdata with short word lengths. In each clock cycle, multiplication operations are performed on input subdata at the same precision index and the coefficients correlated to these subdata. Results are then added, and the shifting and accumulation operations of the multiplications are performed on the summed results accordingly to derive at an output datum. With the shifting operation performed after the tap summation, it would not incur an increase in the word length of the intermediate data thus saves the hardware cost of adders in the tap summation. However, with the use of only inputdata folding, the architecture's folding number is limited by the input-data word length and cannot increase along with the use of high-speed circuitry. The proposed architecture then takes it further by integrating tap folding to partition an FIR filter into multiple sections, and completes each section chronologically. The folding number of the proposed architecture using the input-data folding and tap-folding schemes is the product of the folding numbers from input-data folding and tap folding. An increase in the folding number of the tap-folding scheme would also increase the folding number of the proposed FIR architecture to accommodate the use of high-speed circuit in effectively reducing the hardware complexity. In comparison to the conventional architectures under the same folding number, the proposed architecture clearly demonstrates a lower hardware complexity.
Based on the radix-4 Booth algorithm, two approaches to reduce the hardware complexity of the FIR architecture are proposed-one is a 2-bit input subdata approach and the other is a tree accumulation approach with simplified carryin bit processing. In the 2-bit input subdata approach, other than the input subdata currently in-use, the Booth decoder could also rely on the prior input subdata and control signal to perform Booth decoding. Such flexibility would allow the proposed FIR architecture to reduce the latch amount required to store these input subdata. As for the tree accumulation approach, a full adder is fully utilized to perform the addition operations. The proposed FIR architecture can omit the use of half adders, and lives up to its appeal for a design with low hardware complexity. In this work, the cell library of the TSMC 0.18 µm CMOS technology is used to implement the proposed FIR processor equipped with 10-bit input data and coefficients to realize 128 taps. Other than satisfying the throughput-rate requirement, the proposed FIR processor is demonstrated to have the least hardware area per tap than the conventional ones.
CONVENTIONAL BOOTH-ALGORITHM FIR ARCHITECTURES USING FOLDING SCHEMES
The operation of an FIR filter can be written as
where X, C, and Y represent the input data, filter coefficients, and output data, respectively, and N is the number of taps. The Booth algorithm is typically used to implement the multiplication operations of a programmable FIR filter and thus effectively reduce computational time and hardware complexity [12, 13] . Comparing radix-2, radix-4, radix-8, and radix-16 Booth algorithms in terms of both computational performance and hardware complexity reveals that the radix-4 Booth algorithm strongly outperforms in terms of hardware efficiency [14] . Therefore, the radix-4 Booth algorithm was applied in the proposed FIR architecture. The radix-4 Booth algorithm incorporates the multiplier X n−i and the multiplicand C i with word lengths of W and L, respectively. Each input datum X n−i is partitioned into many 3-bit groups, each of which has one bit that overlaps with the previous group, which can be written as 
C i is multiplied by X n−i , and (3) is modified to
where B(X n−i,l , C i ) is the output of Booth decoding that can take five values, 0, ±C i and ±2C i , according to X n−i,l . According to (1) , an FIR architecture can fold itself based on input data, coefficients, and taps. First, in the input-data folding scheme, with the radix-4 Booth algorithm being used to perform the multiplication operations, each W-bit input datum is partitioned into (W/2) 3-bit input subdata that then undergo Booth decoding in order. From (1) and (4), the operation of an FIR filter can be modified as
Like the input-data folding scheme, the coefficient-folding scheme can be employed to partition each L-bit coefficient into (L/2) 3-bit sub-coefficients, and then Booth decoding is performed in a sequence. Equation (1) can be modified as
where C i,l is the lth 3-bit sub-coefficient of the coefficient, C i , and B(C i,l , X n−i ) can be one of the five values, 0, ±X n−i , and ±2X n−i . In the tap-folding scheme, an FIR filter is partitioned into f parts to complete the operations accordingly. Such a scheme can be applied to modify the operation of an FIR filter from (1) as follows:
Equations (5), (6) , and (7) reveal that the FIR architectures equipped with input-data folding, coefficient folding, and tap folding would result in folding numbers of W/2, L/2, and f , respectively. The three folding schemes based on (5), (6) , and (7) are applied in the design of the two FIR architectures that are commonly used, the direct form and the transposed direct form, to derive the six FIR architectures shown in Figure 2 . Among them, the preprocessing units of architectures in Figures 2(a) , 2(b), 2(c), and 2(d) can partition input data or coefficients into many 3-bit input subdata or 3-bit subcoefficients, and perform predecoding on these input subdata or sub-coefficients to reduce the hardware complexities of Booth decoders [3] [4] [5] 11] . Input (sub)-data latches and (sub)-coefficient latches are used to store input (sub)-data and (sub)-coefficients, respectively. N Booth decoders are applied to perform Booth decoding, with the results being added in the accumulation path. Pipelined latches are then used to reduce the delay and to arrange the data flow in accumulation computation. Lastly, the post-processing unit performs summation and shifting on results from the accumulation path to realize the computation of (5) and (6) . As for the architectures shown in Figures 2(e) and 2(f), N/ f multipliers are assigned to perform the multiplication operations. Each multiplier is equipped with W/2 or L/2 Booth decoders to generate partial products. Partial products from N/ f multipliers are summed together in the accumulation path. Finally, the results from the accumulation path are carried on to the post-processing unit to perform the summation operation, thus satisfies the computation in (7) [6] [7] [8] .
An FIR architecture with the transposed direct form is able to use the pipelining in the accumulation path to reduce the number of input (sub)-data latches. But, for the transposed direct-form architectures using coefficient folding and tap folding, as shown in Figures 2(d) and 2(f), the operation frequencies of input data paths are lower than those of pipelined latches in the corresponding accumulation paths. Hence, the accumulation path has to use more pipelined latches to store the computation results from its adders, in order to generate the correct output of an FIR filter. Due to this fact, the two architectures in Figures 2(d) and 2(f) cannot achieve low hardware complexities, and thereby are not explored further.
To take a closer look at the architectures in Figures 2(a), 2(b), 2(c), and 2(e), the features of functional units of these four architectures are listed in Table 1 . Under the same folding number, W/2 = L/2 = f , these four architectures all have the same amount of Booth decoders. However, with the pre-processing unit capable of performing predecoding on subdata and sub-coefficients to reduce the hardware complexity of the Booth decoders, hardware complexities of the Booth decoders in architectures shown in Figures 
Booth decoder
Booth decoder
Booth decoder architecture in Figure 2 (e) incurs a higher hardware complexity than the other three architectures. As illustrated in Table 1 , when W equals L, architectures in Figures 2(a) and 2(c) have the same number of latches to store input (sub-)data and (sub-)coefficients. They both also have Booth decoders and accumulation paths with the same hardware complexities. However, with the architecture in Figure 2 (c) requiring multiplexers to select the sub-coefficients, its hardware complexity would be slightly higher than the architecture in Figure 2 Input (sub-)data latches Figure 2 (a) displays the lowest hardware complexity but its folding number is limited by the input-data word length. When the high-speed circuitry is employed in this architecture, the only way to lower hardware complexity is to reduce the pipelined latches in the accumulation path. In contrast, the architecture in Figure 2 (e) can increase its folding number to reduce the numbers of Booth decoders and adders, thus to effectively lower the hardware complexity. However, with the partial-product shifting operation performed prior to the accumulation path, the architecture in Figure 2 (e) would have adders and pipelined latches with higher word lengths than those found in the accumulation paths of the architectures in Figures 2(a) , 2(b), and 2(c). Hence, the integrated folding scheme combining input-data folding and tap folding is proposed in this work. Such integrated folding scheme can take advantages of the architectures in Figures 2(a) and 2(e) to have the accumulation path with a low hardware complexity and to have a capability of increasing the folding number to reduce hardware complexity.
PROPOSED FIR ARCHITECTURE
By using input-data folding and tap folding, the FIR filter computation in (1) can be modified as
where f is the folding number of tap folding and W/2 is the folding number of input-data folding.
B(X n−(i f +k),l , C i f +k ) is computed using N/ f Booth decoders, and an accumulation path sums the outputs from the Booth decoders.
(W/2)−1 l=0 f −1 k=0 and ×2 2l are sequentially computed in the post-processing unit. According to (8) , this integrated folding scheme can design an FIR architecture with a high folding number by increasing the folding number of tap folding. Moreover, unlike the conventional tap folding, its partial-product shifting operation is processed in the post-processing unit to reduce hardware complexity in the accumulation path. Based on (8), the proposed FIR architecture is presented in Figure 3 . While the input-data and tap-folding schemes are employed in the proposed FIR architecture, the 2-bit input subdata approach and tree accumulation approach with simplified carry-in-bit processing are developed to further reduce the hardware complexity. The following subsections describe these two approaches.
2-bit input subdata approach
According to (2) , the least significant bit of each original 3-bit input subdatum is either zero or the most significant bit of the previous input subdatum [12, 13] . Consequently, 2-bit input subdata rather than 3-bit input subdata can be used to reduce the number of latches on the input data path. As shown in Figure 4 , the preprocessing unit comprises an input latch, a multiplexer, and a 1-bit XOR gate. The input latch stores input data. The multiplexer that is addressed by the control unit selects a correct sequence of 3-bit input subdata. Meanwhile, the 1-bit XOR gate is used to predecode the 3-bit input subdata to generate new 2-bit input subdata that can slightly reduce the hardware complexities of Booth decoders. Figure 3 shows that 2-bit input subdata generated by the preprocessing unit are pipelined to input subdata latches. Through multiplexers selecting data from input subdata and coefficients, each Booth decoder can obtain the appropriate input subdata and coefficient for Booth decoding. In the radix-4 Booth algorithm, possible results, ± j × C i , from the Booth decoders are generated, where j is an integer between zero and two. However, in the 2-bit input subdata approach, a 2-bit input subdatum from the input subdata latches cannot represent five choices. The Booth decoder must use one bit from the neighboring input subdata latch (b l−1,1 ) as well as two bits from its corresponding input subdata latches (b l,1 and b l,0 ), as shown in Figure 5 . According to (2) , when l in (8) equals zero, this one extra bit (b l−1,1 ) must be set as zero. To realize the computation of (8), a control signal is used to control an AND gate so that b l−1,1 can be reset to zero at every f × (W/2) clock cycles and be held at zero for f clock cycles. Accordingly, b l,1 , b l,0 , and b l−1,1 with this control signal are employed to generate a partial product and a carry-in bit, which represent the output of 0, C i , −C i , 2C i , or −2C i . In particular, an inverter is applied to invert the sign bit of the partial product, so when the outputs generated by the Booth decoders are summed in the accumulation path, the sign extension operation can be omitted and the hardware complexity of the accumulation path is reduced accordingly [5] . Although the proposed Booth decoder is little more complex than the conventional Booth decoder [11] , such a design would allow 2-bit input subdata latches to be used instead of conventional 3-bit input subdata latches in the input data path.
Tree accumulation approach
In the FIR architecture, each Booth decoder generates a partial product and a carry-in bit. The accumulation path sums all of the partial products and carry-in bits. These summed results are then inputted to the post-processing unit to yield the final result. The carry-save addition technique is applied to minimize the carry propagation delay and increase the computational efficiency of the accumulation path. Its fundamental functions include full adders and half adders. The full adder processes three input bits at the same precision index and then generates two output bits at different precision indexes, whereas the half adder processes only a pair of input bits at the same precision index, producing two output bits at different precision indexes. The half adder cannot be used to reduce the bit number because the number of input bits is equal to that of output bits. Therefore, sufficient use of full adders and reduced use of half adders would further decrease the hardware complexity of the accumulation path. The conventional tree accumulation is divided into three parts to perform the additions in the accumulation paththe addition of the partial products, the addition of the carryin bits, and the addition of the outputs of the two parts. The proposed tree accumulation approach hides the summation of the carry-in bits as part of the partial-product summation in the accumulation path, and also as part of the intermediate result summation in the post-processing unit. Eight 4-bit partial products and carry-in bits are used as an example in Figure 6 , to demonstrate the proposed and conventional tree accumulation approaches using carry-save adders. Figure 6 (a) depicts the conventional tree accumulation in which partial products and carry-in bits are summed individually, increasing the number of half adders required. Moreover, the summed partial products must be added to the summed carry-in bits in additional processing time. Herein, the conventional tree accumulation requires 28 full adders and five half adders. Figure 6( accumulation in which the summation of the partial products and the carry-in bits are performed together. The proposed approach effectively exploits full adders to perform the addition of partial products and carry-in bits, and omits the use of half adders. Hence, only 26 full adders are required in the proposed tree accumulation. An accumulation path can be partitioned into many pipelined stages to improve computational performance. When each pipelined stage needs the delay of one or two carry-save adders, 89 or 38 1-bit latches are required in the proposed tree accumulation, and 115 or 52 1-bit latches are required in the conventional tree accumulation. Thus, the proposed tree accumulation also has fewer latches than the conventional one. Also, as shown in Figure 6 (b), a carry-in bit is regarded as the least significant bit of the carry value in each layer and is added with the other sum or carry value. But as Figure 6 (b) points out too, the proposed accumulation only yields six carry values, which implies that it can only process the summation of eight partial products and six carry-in bits. The outputs of sum and carry and the two unprocessed carry-in bits would be moved to the postprocessing unit to perform addition.
In the post-processing unit, carry and sum values generated from the accumulation path and two unprocessed carryin bits are accumulated and shifted. Figure 7 shows the proposed post-processing unit. Two (L+1+log 2 (N/ f ))-bit carrysave adders are employed to perform sequential accumulation, and two (L + W + log 2 N)-bit 2-to-1 multiplexers are applied in shifting. Notably, two (L + W + log 2 N)-bit 2-to-1 multiplexers are used to select a zero value and a correction term in the first clock cycle. Adding the correction term is for compensating the omission of the sign extension operation from the accumulation path [3] [4] [5] . Additionally, the least significant bits of the two carry values generated by the carry-save adders in the post-processing unit are zero, so the unprocessed two carry-in bits can be considered to be the least significant bits of these two carry values, and their addition is performed in the two carry-save adders of the postprocessing unit. Finally, the vector merge adder (VMA) is used to sum the carry and sum values to derive at a final result.
ANALYSES AND COMPARISONS OF PROPOSED AND CONVENTIONAL FIR ARCHITECTURES
In this section, the cell library of the TSMC 0.18 µm CMOS technology is applied to derive at the number of transistors required for each functional unit [15] , and to use such numbers in the analyses and comparisons of hardware complexities between the proposed and conventional FIR architectures. First, three types of the FIR architectures employing input-data and tap folding, types I, II, and III, are defined to analyze the effectiveness of the proposed 2-bit input subdata approach and tree accumulation approach in reducing hardware complexity. All these three architectures have the same folding numbers, with the folding numbers of inputdata folding and tap folding being W/2 and 2, respectively. The type-I FIR architecture uses both the proposed 2-bit input subdata approach and tree accumulation approach to lower its hardware complexity, while the type-II one only uses the 2-bit input subdata approach and the type-III one only adopts the proposed tree accumulation approach. The numbers of transistors required for these three architectures are shown in Figure 8 .
In comparing the type-I and type-II architectures, the type-I architecture would require less transistors than the type-II one because the type-I architecture can simplify the processing of N/2 carry-in bits to reduce its hardware complexity. With an increase in the number of tap number (N), the number of carry-in bits that can be simplified in processing is also increased to allow the type-I architecture to further reduce the number of transistors required. Additionally, the difference in the numbers of transistors required between the type-I and type-II architectures is not significant with the changes found in input-data word length (W) or coefficient word length (L). In comparison to the type-III architecture, the type-I architecture can take the 2-bit input subdata approach to reduce
). The Booth decoder in the type-I architecture demands slightly more logic gates than that of the type-III architecture, but it still requires less transistors than the type-III. With an increase in the input-data word length (W) and tap number (N), the type-I architecture can demonstrate that it requires less transistors than the type-III one.
As stated in Section 2, under the same folding number, the architecture in Figure 2 (a) would have lower hardware complexity than the other architectures in Figure 2 . But in comparison to the fixed folding number of the architecture in Figure 2(a) , the folding number of the architecture in Figure 2 (e) can be increased to lower hardware complexity. Due to this understanding, we compare the hardware complexities of the proposed architecture and the architectures in Figures 2(a) and 2(e) . To fairly compare them, these three architectures must operate at the same throughput rate. According to [13] , the throughput rate can be represented by n s /T clk where T clk is a period of a clock cycle and n s is the number of outputs produced in a clock cycle. Additionally, T clk is equivalent to the critical delay. As for a folded FIR architecture, the folding number is the number of clock cycles required to generate an output. Accordingly, the throughput rate can be denoted as follows [13] :
With T FA representing the delay of the full adder, and the throughput rate fixed at 1/(2 × T FA × W), the numbers of transistors required for the above-mentioned three architectures in comparison are presented in Figure 9 where the word length of input data is equal to that of coefficients. In the proposed architecture, the folding numbers for inputdata and tap folding are W/2 and 2, respectively; hence the folding number of the proposed architecture is W. According to (9) , the proposed architecture has a critical delay of 2T FA , which indicates that the delay for each pipelined stage should be less than or equal to 2T FA . Looking at the architecture in Figure 2(a) , the folding number is W/2, which would derive at a critical delay of 4T FA according to (9) . In comparison to the proposed architecture, the pipelined stage in Figure 2 (a) architecture would have a much longer delay. As for the architecture in Figure 2 (e), since the folding number and critical delay in this architecture are both changeable, two modes of Figure 2 As illustrated in Figure 9 , the architecture in the mode-I of Figure 2 (e) would require the most number of transistors due to having an accumulation path with a high hardware complexity and a low folding number. In contrast, the architecture in the mode-II of Figure 2 (e), also using only tap folding, has a high folding number and a low critical delay, but a lower hardware complexity than those of the architectures in Figure 2 (a) and mode-I of Figure 2 (e). This phenomenon explains that under the same throughput rate, increasing the folding number instead of reducing pipelined latches could cut down more hardware complexity. On the other hand, the integrated input-data and tap folding in the proposed architecture can make adders of the accumulation path having small word lengths and the FIR architecture having a high folding number to reduce the hardware complexity. Additionally, the proposed 2-bit input subdata approach and tree accumulation approach can further lower hardware complexity. As shown in Figure 9 , the comparison results reveal the proposed architecture to request the least transistor number than the other conventional architectures in realizing an FIR filter. 
PROPOSED 128-TAP FIR PROCESSOR
Based on the proposed architecture, the TSMC 0.18 µm single-poly-six-metal CMOS standard cells are employed to realize a 128-tap programmable FIR processor [15] . The Cadance tool is used to generate the layout of the proposed FIR processor, and then extract the netlist. Under such netlist, the Nanosim tool is employed to verify the functionality and power consumption using a uniform-distribution input sequence. This processor's specifications are detailed in Table 2 where input-data and coefficient word lengths are both 10 bits. The folding numbers for input-data and tap folding are 5 (10/2) and 2, respectively, so that the folding number of the proposed processor is 10 (5 × 2). With the clock frequency operated at 200 MHz, the throughput rate is 20 M samples per second (200 M/10), the core area is 0.45 mm 2 , and the layout for the proposed processor is displayed in Figure 10 . Table 3 compares the proposed processor with the other programmable FIR processors that use conventional folding schemes. From Table 3 , the throughput rate of the proposed processor is larger than those of the conventional processors, indicating that the proposed processor meets the computational performance demands of the conventional processors. Differences in fabrications and specifications are such that the following normalization must be completed before the areas are compared [16] tion outputs at full word length, its hardware complexity remains higher than the proposed processor. As for Edwards et al.'s processor, input-data folding is adopted to lower hardware complexity. Yet, the input-data folding inevitably restricts the folding number of this architecture to be limited by the input-data word length and cannot be increased to lower hardware complexity. Lastly, Pao et al. proposes a processor using the half bit-sequential multiplier structure so that the folding number is correlated with input-data and coefficient word lengths. Though this processor has a very high folding number, a full word-length multiplication output is still generated in each tap. The multiplication results from the taps are then summed together. Consequently, the addition in Pao et al.'s processor is performed on product results at a high word length, which then incurs high hardware cost for its adders. With hardware-complexity reduction from the integrated input-data and tap folding, and the approaches using 2-bit input subdata latches and the tree accumulation with simplified carry-in bit processing, the proposed FIR processor is demonstrated to have the least hardware area per tap than the conventional ones.
To fairly compare power consumption of the proposed and conventional FIR processors, the following normalization equation is applied [16, 17] : 
According to Table 3 , the proposed FIR processor can have the least power consumption than the conventional ones owing to its low-complexity hardware design. When considering the product of hardware area and power consumption, the proposed processor still yields the best performance. 
CONCLUSION
Following advances in fabrication technology, circuits can now operate at a high frequency, while the FIR filter performance needs only to meet the real-time demand. Increasing the architecture's folding number can effectively reduce the hardware complexity, without violating the conditions demanded by the applications. Hence, a hardware-efficient FIR architecture with a high folding number is developed by integrating input-data folding and tap folding. Additionally, the 2-bit input subdata approach and tree accumulation approach with simplified carry-in bit processing are proposed to reduce the hardware complexities of input subdata latches and accumulation path, respectively. Based on the proposed architecture, the TSMC 0.18 µm CMOS technology is applied to realize a 128-tap programmable FIR processor with 10-bit input data and coefficients. Operating at 200 MHz frequency, the processor has a core area of 0.45 mm 2 and yields a throughput rate of 20 M samples per second. In comparison to conventional FIR processors, the proposed processor is able to achieve hardware efficiency owing to its lowcomplexity architecture design.
