Abstract-This letter describes a CAD system for automatic implementation of FIR filters on Xilinx field programmable gate arrays (FPGA). Given the frequency specifications, this software package designs an FIR filter, optimizes the filter coefficients in the power of two coefficient space, and implements it on FPGA chips. The FPGA specific mapping techniques used to increase speed are discussed. The performance of the typical filters that were implemented is presented.
I. INTRODUCTION
INITE impulse response (FIR) filters without full multipli-F ers and their potential high-speed VLSI implementations have received considerable attention over the past decade [1]- [3] . It was demonstrated in [3] that an FIR filter with -60 dB of frequency response ripple magnitude can be realized using two power-of-two terms for each coefficient value. An efficient FIR filter architecture suitable for field programmable gate arrays (FPGA) was discussed in [l] . In this letter, we present an improved filter tap structure and several mapping techniques that have been used to increase the sampling rate. This letter also describes a CAD system that can be used for design of FIR filters, optimization of filter coefficients in the discrete coefficient space, and subsequent implementation on Xilinx XC3100-series PGA's.
ARCHITECTURE
An inverted form FIR filter is used in our implementations. The structure of a portion of a filter tap is shown in Fig. 1 , where the intemal pipeline is depicted. The two shifted versions of the data corresponding to the two power-of-two components of each coefficient are shown as dotted lines. The sign of the coefficients is controlled by inverters. The sum and carry signals from the full adders are pipelined using a carry-save addition (CSA) technique in order to increase the sampling rate and alleviate potential routing delays. A graphical user interface (GUI) was designed using the Motif tool kit. As with other Motif GUI's, the interface has the basic menus, namely design (file), edit, and help menus. The design menu has two options for the two main stages in the design process: frequency specification and Xilinx implementation. The output of one stage is used as the input to the subsequent stage. The user can start at any stage, depending on the specification at hand. Thus, a filter can be implemented from the frequency specifications or from coefficients with power-of-two terms.
A . Filter Design and Optimization
To begin a design, the frequency specification option in the design menu is selected. The specifications of the filter are entered using dialog boxes and radio boxes. The user can select any one of the various optimization control options and certain parameters, such as the maximum coefficient value desired. MILP3, written by Lim [4] , is used to obtain a continuous solution (which assumes infinite precision coefficient values) and then to optimize this solution in the discrete power-of-two coefficient space [2].
B. Xilim Implementation
Once the optimization is done, the Xilinx implementation option of the design menu is selected. The input of this stage is either a set of user specified filter coefficients or the output of the optimization stage. The input is fed to code that maps the filter onto the P G A . Due to the limited availability of global and local routing resources, placement of configurable logic blocks (CLB's) and routing of nets are very critical in any FPGA design. When APR was given full freedom of placement for the 22 x 22 array of CLB's of an 11-tap filter, it took 9 h. and 2 min. on a Sun SPARCstation 2 for the completion of placement and routing. Placement is therefore more efficiently done based on the knowledge of the problem at hand.
Each full adder is implemented in a configurable logic block (CLB). The two rows of full adders map to altemate columns of the chip. To reduce congestion, the two shifted versions of the data are distributed among the two sets of full adders, whereas in the previous approach [l], they were routed to the first set of full adders. The present structure makes more efficient use of the local routing resources and is found to achieve an improvement of 5-15% in the sampling rate for several typical filters.
The input data bus is distributed using horizontal long lines from one end of the chip to the other. By careful assignment of the horizontal long lines so as to reduce the maximum distance between any horizontal long line and a CLB, a 20-30% increase in the sampling rate was obtained. The delay that results from routing the data lines to the long lines was prevented by buffering the data. The assignment of long lines was based on the mean shift (in bits) of the data lines, which was found to give equal or better sampling rates compared to the assignment done by APR, in significantly less computation time. C. Pin Constraints As FPGA's are in-system reconfigurable, it is reasonable to impose pin constraints according to the existing PCB layout. With buffering of data lines near the long lines, pin to CLB and CLB to long line routings can be made without sacrificing speed for any pin constraint.
Present Structure
Using this CAD system, one can find the maximum sampling speed using the xdelay option and can also edit the layout using the xact [5] editor option.
Iv. PERFORMANCE
With the Xilinx XC3195, which has an array of 22 x 22 CLB's, the maximum intermediate word length is 22 bits. If Bi > 2Bd, then the maximum input data word is 10 bits. As each tap requires two columns of CLB's, up to 11 taps can be realized per chip.
Typical filter characteristics have been implemented on a Xilinx XC3195 P G A using this system. Two 11-tap lowpass FIR filters (filter #O and filter #I) with passband cutoff at O.lfs and 0.2fs, stopband beginning at 0.15fs and 0.3fs, and -18 dB and -27 dB stopband rejection, respectively, and an 11-tap highpass filter (filter #2) with the cutoff frequency at O.lfs, the passband beginning at 0.15fs, and -18 db stopband rejection were implemented. The sampling speeds of these filters attained using various mapping techniques on the present and the previous structures are listed in Table I . If placement is done by APR alone, with no prior placement, then the sampling speed attained for filter #O is only 25.0 MHz.
The layout of filter #O on an XC3195 is shown in Fig. 2 .
A. Multiple Chips
The CAD system is capable of mapping higher order filters onto multiple chips. Two 21-tap filters with the frequency specifications of filter #O and filter #1 with more stringent rejection specifications were implemented with a sampling speed of 42.0 and 41.2 MHz. Due to buffering of data lines, the number of taps realized by the first chip is one more than the subsequent chips in a multichip implementation. Thus, to realize a 21-tap filter, two XC3195 chips were used.
The final accumulation stage might be implemented on the chip, if sufficient resources are available, or in a dedicated parallel adder. However, it is possible to efficiently implement the accumulation stage in the XC4000 series of FPGA's by virtue of the fast carry logic supported by these devices.
V. CONCLUSION
A CAD system for design and efficient implementation of FIR filters on Xilinx field programmable gate arrays was presented. Several generalized techniques that were used to reduce delay have been described, and their effects on perfor-
