Abstract -This paper describes an efficient architecture for FIR filters. By exploitlng the reduced complexity made possible by the use of sparse powers-of-two coefficients, a n FIR filter tap can be implemented with only 2B fill adders and 2B (or 4B) latches, where B is the intermediate wordlength. Word and bit level parallelism a h w s high sampling rates, limited only by the hll adder delay. This novel amhitecture allows the implementation of high sampling rate filters of sigaiecant length on a slngle field-programmable gate array @PGA), as well implementation using more conventional VLSI techniques.
I. INTRODUCTION
In recent years, considerable attention has been placed on the implementation of signal processing algorithms in VLSI, ranging from full custom VLSI to general purpose digital signal processors. A variety of approaches to high speed implementation of FIR filters havebeenpursued[l, 2,3,4,6,8,9, 10, 11, 14.15, 17, 191 . Inorder to attain high performance, parallel implementation strategies such as the systolic and wavefront methods have been applied. Wordparallel, bit-parallel processing techniques appear to scale well with improvements in implementation technology and increasing demands for higher performance. This paper presents a new parallel FIR filter building block suited for implementing filters where each of the coefficient values is a sum or difference of several power-of-two terms. It is particularly useful for the case where coefficients are a sum or difference of onIy two power-of-two terms. This architecture allows high sampling rate FIR filters of substantial length to be implemented on the current generation field-programmable gate mays (FPGAs), as well as in more traditional CMOS custom and semi-custom circuitry. The high sampling rates obtained through this architecture are due to extensive pipelining; the implementation efficiency is a result of the use of the highly constrained coefficient values.
II. BACKGROUND
In binary arithmetic, multiplication by a power-of-two is simply a shift operation. Implementation of systems with multiplications may be simplified by using only a limited number of power-of-two terms, so that only a small number of shift and add operations are required.
The improvements in speed and saving in integrated circuit area are, however, achieved at the expense of a deterioration in the frequency response characteristics. The extent of the frequencyresponse deterioration depends on the number of power-of-two terms used in approximating each coefficient value, the architecture of the filter, and the optimization technique used to derive the discrete space coefficient values. It was demonstrated in [13] that an FIR filter with -60dB of frequency response ripple magnitude can be realized using two power-of-two terms for each coefficient value, given that the filter is in cascade form and the coefficient values are derived using mixed integer linear programming.
The basic structure of an FIR filter is illustrated in Figure 1 (a) .
Using cut-set retiming, the pipelined version shown in Figure 1 (b) can be obtained. An inverted form FIR filter, which will be used in FPGA implementations is depicted in Figure 1 (c) .
If the coefficient value is an integer power-of-two, or a sum of two powers-of-two, the multipliers can be replaced by shifters, as depicted in Figure 2 . Since the coefficients will be fixed for this class of filter, the coefficient values can be realized by appropriately routing the inputs to the full adders in the filter structure. That is. 
III. ARCHITECTURE
In order to attain high sampling rates using conventional FPGAs or low cost CMOS processes, bit-level parallelism is exploited. The overall filter architecture is shown in Figure 3 , where the filter taps and final adder stage are shown. The adder is required to resolve the carries that are generated and propagated through the pipeline.
The structure of a filter tap is shown in Figure 4 , where the intemal pipelining is depicted. The two adders, which are necessary for coefficients that are a sum of two signed powers-of-two, are implemented as two rows of full adders, whose inputs are configured with the appropriate shift for the given coefficients. The sign of the coefficients is controlled by inverters. The sum and carry signals from the full adders are pipelined using a carry-save addition (CSA) techniques in order to increase the sampling rate and alleviate potential routing delays in the target implementation technologies. The input data bus passes through the bit-slice array to provide short interconnection distances to the first row of full adders. This bus may be optionally pipelined depending on the particular implementation technology, FPGA or full custom, among other factors. The architecture proposed here is well suited to FPGA implementation; only minor modifications need to be considered to map it to the array. In particular, each of the taps of the FIR filter shown in Figure   3 can be implemented in two array columns of Xilinx XC3 100-series FPGAs. Because of the high degree of spatial and temporal locality, signal routing delays are not a major concem, as they are with typical high performance FPGA designs.
The architecture of an FPGA-based FIR filter tap with two powersof-two coefficients is identical to that shown in Figure 4 , based on the inverter form FIR structure. Each of the bit slices for the tap require two combinational logic blocks (CLBs) in the array for implementation. The extensive local routing capability of typical FPGAs can be used for the majority of signals within and between taps. Figure  5 illustrates the local routing required between CLBs. where column "1 " maps to the first set of full adders for a given tap, and column "2" maps to the second set. The globally routed input data signals are distributed using the horizontal and vertical nets running the length and width of the chip between the rows and columns of CLBs.
The primary concem is with routing of the shift lines. In most realizations, the accumulation path will have a wider word width than the input data from the shifter, in order to account for the overtlow and round-off problems that are inherent in a design of this type.
For example, if the input data is Bd bits wide, the accumulation path will most likely be Bi 2 2Bd bits wide. This implies that the input datapath will be using fewer muting lines in each FPGA column than will the accumulation path. By exploiting these extra, unallocated resources, the low delay vertical routing lines of the FPGA can be used more effectively. The extra resources allow the number of vertical routing lines to be minimized, as illustrated in Figure 6 , where the additional datapath leads to lower congestion in the muting channel between the columns. A tap with Bd input U datapath bits and Bi accumulation path bits can thus be implemented using 2B; logic blocks. The final adder required by the filter can be implemented on an FPGA or using an additional chip. bpical filter characteristics have been implemented on an Xilinx XC3195 FPGA using this architecture. The XC3195 has an array of 22 by 22 (484) CLBs. For example, an eleven tap lowpass FIR filter with the passband cut-off at O.lf,, the stopband beginning O.lSf., and -1 8dB stopband rejection has the discrete-spaceimpulse response shown in Table 1 . An input data word size of 10 bits was used; the 22 rows provide sufficient intermediate word width protection against overflow. All of the columns of the array were required for the eleven taps. The final accumulation stage was not performed on the array. The maximum sampling rate for this particular design was 30 MHz. The delay is highly dependent on the input data routing, and so higher sampling rates may be attainable for other filter responses (with careful routing).
V. CUSTOM IMPLEMENTATION
A prototype CMOS implementation has been developed to evaluate this architecture. This implementation was designed in a 2.0 micron, double level metal, single level polysilicon CMOS process, using the MSU/ITD standard cells [7] . The architecture of a tap in the full custom chip is shown in Figure 7 . The primary difference between this structure and that used for the FPGA implementation is the application of input data pipelining to reduce global signal distribution delays.
Some details of the implementation are given in Table 2 . The custom implementation is substantially smaller (by a factor of approximately two and a half) than the equivalent filter based on full two's complement multipliers and adders. The aredspeed performance of this designcomparesfavorably with thosein [5, 12, 16, 18, 191 when normalized for the older technology used in the present case. In particular, the architecture in [19] , based on the FIRGEN compiler [8] , also uses canonic signeddigit (CSD) coefficients. In that architecture, however, four terms are used, as opposed to the two used Figure 7 . Filter Tap Architecture, Custom Implementation in this work. The design described in [ 121 also uses four term CSD coefficients. The present custom design makes more extensive use of pipelining than either of these designs, in particular for the input datapath.
The area and speed results for the architecture in this 2.0 pm process suggest that 60-70 tap chips with sampling rates exceeding 100 MHz should be feasible in more modem processes, even using the standard cell paradigm. Full custom implementation can most likely yield an additional factor of two in density, due the relatively poor utilization of area in the standard cell method. 
