 Abstract -This work presents an FPGA implementation of FIR filter based on 4:2 compressor and CSA multiplier unit. The hardware realizations presented in this paper are based on the technology-dependent optimization of these individual units. The aim is to achieve an efficient mapping of these isolated units on Xilinx FPGAs. Conventional filter implementations consider only technology-independent optimizations and rely on Xilinx CAD tools to map the logic onto FPGA fabric. Very often this results in inefficient mapping. In this paper, we consider the traditional CSA-4:2 compressor based FIR filters and restructure these units to achieve improved integration levels. The technology optimized Boolean networks are then coded using instantiation based coding strategies. The Xilinx tool then uses its own optimization strategies to further optimize the networks and generate circuits with high logic densities and reduced depths. Experimental results indicate a significant improvement in performance over traditional realizations. An important property of technologydependent optimizations is the simultaneous improvement in all the performance parameters. This is in contrast to the technology-independent optimizations where there is always an application driven trade-off between different performance parameters.
Roohie Naaz is with the Faculty of Computer Science and Engineering, National Institute of Technology Srinagar, Jammu and Kashmir, India, 190006 (e-mail: naaz310@nitsri.net).
terminating nature of DSP algorithms can be exploited to design efficient systems by exploiting the concurrencies both within iteration and among multiple iterations [4] . This has provided designers with sufficient impetus to look beyond the traditional software oriented solutions and consider some hardware platforms where the underlying resources can be utilized to develop a complete System on Chip (SoC) solution that best matches the algorithmic complexity by developing the right type of architecture.
Application Specific Integrated Circuits (ASIC) have long been used to develop custom architectures for realizing FIR filters. However, with ASICs the design flow is complicated and time consuming resulting in huge non-recurring engineering (NRE) costs [5] . This has typically reserved ASICs for high volume markets and for some specialized domains. FPGAs provide for reduced design time due to a simplified design flow and pre-fabricated nature which eliminates the requirement for verification of deep sub-micron effects. Some other advantages include reconfigurable design approach [6] , [7] , large scale integration [6] , [8] , availability of several intellectual property (IP) cores [9] , reduced NRE costs [6] , [7] etc. The design cycle in FPGAs has a strong computer aided design (CAD) support. The software handles the time consuming mapping, routing, placement and floor planning phases. The effectiveness of technology-independent optimizations that are generally well suited for ASICs [10] is thus limited in FPGAs. Thus, in order to get maximum performance from the target FPGA device, optimizations that are specific to the underlying technology have to be considered. This requires a complete knowledge about the target device. Also, the choice of the target device will have a prominent effect on the end performance of the system [11] . In this work, we carry out the hardware realization of 4:2 compressor and carry save adder (CSA) based multiplier units that are optimized for FPGAs with 6-input look-up tables (LUT). Since all modern FPGAs from Xilinx support 6-input LUTs [12] , [13] , [14] , the filter realizations based on these individually optimized units should provide an improved performance.
FIR filters in general tend to have high arithmetic complexity due to the required number of multipliers, adders and delay elements. However, the main computational bottleneck is the multiply operation that requires a large [15] . A variety of approaches have been used to speed-up the multiplier operation in a filter structure. These approaches either completely eliminate the existing multiplier unit or reduce its architectural complexity. A widely used multiplier-less approach is the one where the multiplier unit is replaced by some sort of memory. Two frequently used memory-based techniques are the direct ROM based implementation and distributed arithmetic (DA) based implementation. ROM based implementations replace the multipliers with LUTs resulting in faster output compared to multiply and accumulate design [16] . DA based techniques have high throughput processing capabilities and regularity resulting in cost-optimal structures [17] , [18] , [19] . In DA based multipliers the pre-computed partial products are stored in the memory elements and are later read out and accumulated to obtain the desired results. To minimize the logic requirements the authors in [20] present a new kind of FPGA implementation algorithm which is based on the Remainder theorem. Another approach uses the divided LUT method to decrease the computational complexity [21] . The problem with memory-based techniques is the increased onchip memory requirements as the operand word-length increases. Low capacity FPGAs often switch to bit-serial arithmetic for realizing multiplier circuits [22] . Special purpose bit-serial implementations include power-of-two sum or difference approaches. This allows multiplication to be replaced with faster shift and addition operations [23] , [24] , [25] . Linear systolic structures have also been used as bitserial architectures [4] . In these approaches the conventional 2-D bit-parallel architectures are transformed into linear 1-D bit-serial systolic structures. The drawback with bit-serial structure is their reduced speed. Apart from the multiplier-less approaches several techniques have been used to reduce the complexity of the multiplication operation. In [26] the authors present an improved multiplier design based on Canonic Signed Digit (CSD) representation and Horner's scheme. Two multiplier structures have been proposed: a cascaded adder structure and an accumulator structure. Both take advantage of the fixed nature of filter coefficients and reduce the number of partial products resulting in an area efficient realization. Similarly constant coefficient multipliers have been reported in [27] , [28] , [29] . Residue Number System (RNS) is yet another approach that has been used for designing efficient high-speed multipliers [30] . The limited inter-moduli carry propagation and parallel computations make RNS desirable for add/multiply intensive applications. Similarly, the authors in [31] propose a novel design scheme based on the combination of sum of power of two (SPT) coefficients and carry-save addition for implementing fast multiplier blocks.
At system level, one of the frequently used modifications is to develop systolic architectures for the filter structures. Systolic designs posses a high potential to yield a high throughput rate with features like simplicity, regularity and modularity of structure [32] . The authors in [33] and [34] take a systolic approach for filter design by using multipliers based on direct ROM and DA approaches. Similarly, the work in [35] uses the systolic structures but focuses on replacing the original adder unit by using a parallel prefix adder (PPA) with minimal depth algorithm. Poly-phase decomposition has been used to design high-speed and low-power parallel filters [36] , [7] , [37] , [38] , [39] , [40] . A modification of poly-phase decomposition is the Fast FIR algorithm (FFA). Filters based on FFA are area efficient utilizing fewer multiplier units [41] , [42] .
All the above mentioned approaches use technologyindependent optimizations to enhance the performance of the filtering structures. In this paper, we take an alternate lowlevel design approach and propose realizations that are based on technology-dependent optimizations.
The rest of the paper is organized as follows. Section II discusses the basic FIR structure. Section III discusses some of the preliminary terminologies used in this paper. Section IV discusses the technology-dependent optimization of the 4:2 compressor and CSA multiplier unit. Synthesis and implementation is carried out in section V. Conclusions are drawn in section VI and references are listed at the end.
II. BASIC FIR FILTER
An N-tap FIR filter is defined as: (1) Where, h k is the filter coefficient that determines the frequency behavior of the filter and x [n-k] 
III. PRELIMINARY TERMINOLOGIES
Logic synthesis is concerned with realizing a desired functionality with minimum possible cost. In the context of digital design the cost of a circuit is a measure of its speed, area, power or any combination of these. For graphical representations a combinational function may be represented as a directed acyclic graph (DAG) called the Boolean network. Nodes within this network represent logic gates, primary inputs (PI) and primary outputs (PO). Each node implements a local function and together with its predecessor nodes implements a global function. A cone of a node v, C v , is a subgraph that includes the node v and some of its non-PI predecessor nodes. Any node, u within this cone has a path to the root node v, which lies entirely in C v . The level of the node v is the length of the longest path from any PI node to v. If node v is a PO node then the level will give the depth of the network. Thus network depth is the largest level of a node in the network. The critical path and area of a mapped Boolean network is measured by the depth and number of LUTs utilized by the network. A network is said to be k-bounded if the fan-in of every node does not exceed k.
IV. TECHNOLOGY-DEPENDENT OPTIMIZATION
Technology-dependent optimization transforms the initial Boolean network into a circuit netlist that utilizes the target logic elements efficiently. The aim is to distribute the logic among the targeted elements with minimum possible depth and minimum resource utilization. The target element in majority of FPGAs is a k-LUT [43] , [44] . An efficient utilization of this circuit element could lead to increased logic densities and reduced circuit depths.
Technology-dependent optimization using LUTs is a two step process. In the first step, the entire network is partitioned into suitable sub-networks. The individual nodes within each sub-network are then covered with suitable cones. The redundancies within each sub-network are exploited during the covering phase. The logic implemented by each cone is then mapped onto a separate LUT. In the second step, the netlist for the entire network is constructed by assembling the individually optimized sub-networks. The overall aim is to have a circuit implementation that uses minimum possible LUTs and has minimum possible depth.
The FIR structure considered in this paper is based on a multiplier unit and a 4:2 compressor unit. The multiplier is based on the CSA logic and generates two partial vectors. The 4:2 compressor then combines these partial vectors with those generated from the previous stage and generates two new partial vectors. A final adder stage at the output then combines these partial vectors and generates the final result. The CSA multiplier and 4:2 compressor ensure that there is no rippling of carry and the critical path is kept to a minimal. Direct and Transposed FIR structures based on these units are shown in figure 3 and figure 4 respectively. In each case, the input x[n] and the filter coefficients h k are assumed to be in fixed-point 2's complement representation.
A. Technology-Dependent Optimization of the Multiplier unit
The multiplier unit is an array multiplier based on the carrysave logic. In carry-save logic the carry outputs are saved and used in the adder in the next row. This ensures that the additions at different bit-positions within a row are independent of each other. The details of a CSA multiplier are given in [4] . The schematic for 4-bit operands is shown in figure 5 . The vector merging adder which is the main computational bottleneck in a CSA multiplier need not to be included as the partial sum and carry outputs are directly fed to the 4:2 compressor unit. Figure 6 shows the Boolean network for the basic operating cell used in the CSA multiplier. The network is partitioned into two sub-networks corresponding to the sum (S) and carry (C) outputs. Each sub-network is separately mapped onto a circuit of LUTs by covering the individual nodes with suitable cones. A straight forward approach would be to cover each node within a sub-network with a separate cone. The subnetwork is then traversed in a post-order depth-first fashion and the local function implemented by each cone is mapped onto a separate LUT. This is shown in figure 7 (a). The overall depth at PO nodes S and C is three and the LUT count is seven. The number of LUTs may be reduced by decomposing the 3-input OR gate in the carry sub-network and duplicating the AND gate in each of the sum and carry sub-networks. The decomposed node is included in two separate cones and the sub-network is again traversed in a post-order depth-first fashion resulting in the realization of figure 7(b). The shaded nodes represent the duplicated logic. The circuit depth is now two and the LUT count is three. However, an optimal implementation may be obtained by exploiting the reconvergent PI nodes in the carry sub-network. Reconvergent nodes share the same inputs and can be exploited to reduce the number of PIs to a sub-network by realizing such paths within the LUT. This is shown in figure 7 (c). The depth of the circuit is now reduced to one and the total LUT count is reduced to one as both sum and carry sub-networks are mapped onto a single 6-LUT with dual outputs. An n×n multiplier implemented using the realization of figure 7(c) will require n 2 LUTs and will have an overall depth of n.
It was mentioned in the introduction that the design cycle in FPGAs has a strong CAD support that handles the majority of the technology-dependent steps like mapping and placement and route (PAR). Technology-dependent optimizations mainly focus on improving the mapping of Boolean networks onto target LUTs. However, with modern CAD tools, both technology mapping and PAR are automated and the optimization process is not transparent to the user [14] . Thus any optimization done prior to the design entry may get overrid during the mapping and PAR phases. To counter this issue, we re-define the coding strategy at the design entry phase. Instead of writing conventional behavioral codes, which are inferential in nature, we adopt an instantiation based coding strategy, wherein a target element is directly called upon and the desired functionality is assigned to it. This ensures a controlled mapping. The following instantiations were used to map the circuit in figure 7.c. LUT_1: LUT6_2 generic map (INIT => X"96660000E8880000") port map (C, S, c, s, b, a, '1', '1');
B. Technology-Dependent Optimization of the 4:2
Compressor unit A 4:2 compressor unit takes four inputs and produces two output values, sum (S) and carry (C) For technology-dependent optimization, we first consider the covering process at a higher level. The adder units are covered in an inclined fashion as shown in figure 9 . Each covered portion implements a two-stage ripple carry chain, with the difference that the sum output is being rippled and the carry output is retained. The critical path of such a realization will consist of the delay associated with each covered portion. Pipelining the circuit would require placing registers at the output node of each shaded portion. This is shown as small dots in figure 9 . The entire structure would require approximately n flip-flops for pipelining.
Let us now consider the covering process at a lower level. Figure 10 (a) and 10(b) shows the block diagram and the corresponding parent network for a single shaded portion. For technology-dependent optimization we again directly restructure the parent network. This is done by duplicating the logic at node Z. Node duplication enables the parent network to be partitioned into three separate networks corresponding to the outputs Out 1 , Out 2 and Out 3 . Each network is covered with a suitable cone and the logic function implemented by each cone is mapped onto a separate LUT. The overall realization is shown in figure 11 . LUT 1 is used to realize a single threeinput function. LUT 2 is used in the dual mode to realize two five-input functions corresponding to the outputs Out 1 and Out 3 . The overall depth of the circuit is one. Since the critical path of the 4:2 compressor unit is limited by the depth of each shaded portion, the overall critical path is one. Also note that the LUT cost is two. The 4:2 compressor will thus have the same hardware utilization as a binary adder.
The following instantiations were used to map the circuit in figure 18 (c). 
V. SYNTHESIS, IMPLEMENTATION AND RESULTS
The implementation in this work targets 6-LUT FPGAs. In particular we have considered devices from Virtex-5 and Virtex-6 FPGA families from Xilinx. The implementation is carried for different filter orders and an operand word-length of 16 bits. The parameters considered are area, timing and power dissipation. Area is measured in terms of different FPGA resources utilized. Timing analysis may be static or dynamic. Static timing analysis gives information about the critical path delay and operating frequency of the design. Static timing analysis is done post synthesis and post PAR. However, the metrics obtained after synthesis are often not accurate enough due to the programmability of the FPGA which allows for interconnect delays to change significantly between iterations. Therefore, the metrics presented in this paper are post PAR and have been recorded for a high optimization level with area and speed as optimization goals. Dynamic timing analysis verifies the functionality of the design by applying test vectors and checking for correct output vectors. Dynamic timing analysis gives information about the switching activity of the design, which is captured in the value charge dump (VCD) file. Power dissipation is given by the sum of static power dissipation and dynamic power dissipation. Static power dissipation is device specific and is mainly determined by the specific FPGA family. Dynamic power dissipation is related to the charging and discharging of capacitances along different logic nodes and interconnects. Dynamic power dissipation mainly consists of the logic power, clock power and signal power [45] . Logic power depends on the amount of on-chip resources being utilized by the design. Clock power is proportional to the operating frequency. Signal power depends on the switching activity and the density of the interconnects. For simulation and metrics generation similar test benches have been used and are typically designed to represent the worst case scenario (in terms of switching activity) for data entering into the filter. Design entry is done using VHDL. As mentioned earlier instantiation based coding strategy is used. The constraints relating to synthesis and implementation are duly provided and a complete timing closure is ensured. Synthesis and implementation is carried out in Xilinx ISE 12.1 [46] . Power analysis is done using the Xpower analyzer tool. Table 1 provides a comparison of the different FPGA resources utilized by the technology optimized FIR filter against the traditional implementation and the one based on the Xilinx multiply-adder IP v 2.0. The operand length in each case is 16 bits and the filter order is 16. Target device is xc5vlx50-2ff324 from Virtex-5. Both technology optimized and traditional implementations rely on the optimization strategies of the Xilinx CAD tool, however, the initial restructuring ensures that the end performance is better in technology-dependent optimizations. Note that the multiplyadder IP v 2.0 is used with an optimum latency value (-1). Next we compare our implementation against the various realizations reported in [47] and [10] . In [47] the authors have considered the Direct, Transposed and Hybrid realizations of the FIR filter. Two sets of results have been reported; one for symmetric coefficients and the other for asymmetric coefficients. For symmetric coefficients Direct and Transposed forms have been considered for a filter order of 120. Each form is implemented using the architecture based on generic multiplier (GM) and shift-add (SA) operation. For asymmetric coefficients Direct, Transposed and three different Hybrid forms (Hybrid-I-3, Hybrid-II-15, and Hybrid-III-15) have been considered for a filter order of 75. Table 2 provides  the comparison for symmetric coefficients and table 3 provides the comparison for asymmetric coefficients. The devices considered are xc5vlx50-2ff324 and xc6vlx75t-2ff484 from Virtex-5 and Virtex-6 respectively. It is observed that in each case the structures based on the technology optimized multiplier and 4:2 compressor units use the underlying fabric efficiently. It should be noted that the authors in [47] have used Xilinx ISE 13.1 design suite as the synthesis tool while our implementation is carried out in Xilinx ISE 12.1 design suite. Using a higher version will only enhance the performance of the designs.
A. Area Analysis
In [10] the authors present an algorithm that achieves performance speed up by enabling an efficient use of the embedded arithmetic blocks and custom compression trees. Different filter realizations have been considered that include filters based on canonic signed digit (CSD) arithmetic utilizing 6:3 compression trees, systolic architectures for transposed realizations and filters designed using IP multiply-accumulate (MAC) units, unfolded MAC architecture and one generated through MATLAB Filter design and analysis (FDA) tool. Table 4 provides the area comparisons in terms of number of slices utilized. The target device is xc5vlx50-2ff324 from Virtex-5. Again the technology optimized structures show an efficient utilization of the underlying fabric. Apart from that only general logic elements are utilized and no special primitives or macro-support is consumed. Table 5 provides a comparison of the critical path delay and maximum clock frequency for the technology optimized FIR filter against the traditional implementation and the one based on the Xilinx multiply-adder IP v 2.0. The realization considered is transposed. The operand length is 16 bits and the filter order is 16. Target device is xc5vlx50-2ff324 from Virtex-5. Tables 6 and 7 provide a comparison of the critical path delay for the technology optimized realization and those reported in [47] . Table 6 is for symmetric coefficients with a filter order of 120 and table 7 is for asymmetric coefficients with a filter order of 75. The devices considered are xc5vlx50-2ff324 and xc6vlx75t-2ff484 from Virtex-5 and Virtex-6 respectively. The input operand length in each case is 16 bits.
B. Timing Analysis
Technology optimized structures are implemented with minimum possible depth; therefore, the critical path delays are quite low. Since clock frequency is also a strong function of the propagation and routing delays associated with the critical path, a minimum depth circuit also ensures high operating frequencies. This is indicated in table 8 where maximum clock frequency is compared against the various designs implemented in [10] . 
C. Power Analysis
Technology-dependent optimization reduces the power dissipation in two ways. First, the high activity switching nodes within a network are hid within the LUTs in the final circuit netlist. This reduces the overall switching activity associated with the logic nodes [48] . Second, technologydependent optimization results in a minimal depth circuit with a high logic density. This reduces the length of interconnects. Since interconnects in FPGAs are reconfigurable switches, there is a further reduction in the switching activity and thus the power dissipated. The analysis is done for a constant supply voltage and maximum operating frequency in each case. Test benches were designed for worst-case switching activity and the filter functionality was verified for more than 1000 input signals. The design node activity from the simulator database along with the power constraint file (PCF) was used for power analysis in the Xpower analyzer tool. Table 9 gives the detailed power dissipation for the technology optimized FIR filter against the traditional implementation and the one based on the Xilinx multiply-adder IP v 2.0. The target device is Virtex-5 and the filter order and input bit-width is 16. The power dissipated in clocking resources varies with the clock frequency. Since technology optimized design operates at slightly higher frequency, the power dissipated by clocking resources is more. Power dissipated by on-chip resources (logic + DSP) is lesser for technology optimized design because of the efficient utilization of the underlying resources. A reduction in switching activity due to hiding of nodes and reduction of interconnects results in lower power dissipation in the signals. Tables 10 and 11 provide comparison of the power dissipation by the technology optimized realizations and those reported in [47] . Table 10 is for symmetric coefficients with a filter order of 120 and table 11 is for asymmetric coefficients with a filter order of 75. The devices considered are xc5vlx50-2ff324 and xc6vlx75t-2ff484 from Virtex-5 and Virtex-6 respectively. The input operand length in each case is 16 bits.
For high throughput DSP systems it is more appropriate to quantify the power efficiency through energy analysis. In [49] the authors define three energy related parameters for FIR systems. These include Energy per operation (EOP), which is the average amount of energy required to compute one operation; Energy throughput (ET) which is the energy dissipated for every output bit processed and Energy density (ED) which is the energy dissipated per FPGA slice. Tables 12  and 13 provide these metrics for the technology optimized realizations and those reported in [47] . The devices considered are xc5vlx50-2ff324 and xc6vlx75t-2ff484 from Virtex-5 and Virtex-6 respectively. The input operand length in 16 bits. 
XC5VLX50-2FF324

XC6VLX75T-2FF484
XC5VLX50-2FF324
XC6VLX75T-2FF484
XC5VLX50-2FF324
XC6VLX75T-2FF484
XC5VLX50-2FF324
XC6VLX75T-2FF484
XC5VLX50-2FF324
XC6VLX75T-2FF484
XC5VLX50-2FF324
XC6VLX75T-2FF484
VI. CONCLUSIONS
This paper focused on the realization of FIR filters using technology optimized multiplier and 4:2 compressor unit. The results presented in this paper showed that technologydependent optimizations have a direct impact on area, delay and power dissipation of the design. Different filter forms (Direct, Transposed and Hybrid) were implemented and it was shown that for a particular form, the technology optimized realizations will always have an improved performance. Another key feature of technology-dependent optimization is that the same optimization results in the improvement of all the performance parameters (area, speed and power). This is generally not the case with technology-independent optimization where there is always an application driven trade-off that drives the design process. However, performance speed-up through technology-dependent optimization strongly relies on the amount of control the designer has over the mapping process. In this paper, we tackled this issue by modifying the coding strategy and writing instantiation based codes to map the behavior of the optimized Boolean networks. This complicates the design entry and although an efficient mapping is achieved, a complete control over the mapping process still remains a bottleneck in technology-dependent optimizations.
