1 Abstract-Generalized Parallel Counters (GPCs) are frequently used in constructing high speed compressor trees. Previous work has focused on achieving efficient mapping of GPCs on FPGAs by using a combination of general Look-up table (LUT) fabric and specialized fast carry chains. The resulting structures are purely combinational and cannot be efficiently pipelined to achieve the potential FPGA performance. In this paper, we take an alternate approach and try to eliminate the fast carry chain from the GPC structure. We present a heuristic that maps GPCs on FPGAS using only general LUT fabric. The resultant GPCs are then easily re-timed by placing registers at the fan-out nodes of each LUT. We have used our heuristic on various GPCs reported in prior work. Our heuristic successfully eliminates the carry chain from the GPC structure with the same LUT count in most of the cases. Experimental results using Xilinx Kintex-7 FPGAs show a considerable reduction in critical path and dynamic power dissipation with same area utilization in most of the cases.
I. INTRODUCTION
Multi-operand addition is an important operation in many arithmetic circuits. It is frequently used in many applications like filtering [1] , motion estimation [2] , array multiplication [3, 4, 5, 6, 7] etc. Compressor trees form the basic elements in multi-operand additions. Compressor trees based on carry save adders (CSA) typically provide higher speeds due to the avoidance of long carry chains. Wallace [3] and Dadda [7] trees are CSA based compressor trees which are frequently used in application specific integrated circuit (ASIC) design. However, the introduction of fast carry chains in FPGAs has made ripple carry addition faster than the carry save addition. Evidently CSA based compressor trees are not well suited for implementation involving FPGAs [8] .
Prior work on compressor tree synthesis using FPGAs has used GPCs as basic constituent element. It has been demonstrated that the usage of GPCs can lead to a considerable reduction in the critical path delay with comparable resource utilization [8, 9, 10, 11, 12, 13, 14] . Initial attempts in this regard were made by Parandeh-Afshar et al. [8, 9, 10, 11] . In [9] they claim to report the first method that synthesizes compressor trees on FPGAs. The proposed heuristic constructs compressor trees from a library of GPCs that can be efficiently implemented on FPGAs. Their latter work [11] focuses on further reducing the combinational delay and any increase in area by formulating the mapping of GPCs as an integer linear programming (ILP) problem. They reported an average reduction in delay by 32% and area by 3% when compared to an adder tree. In [10] they focus on reducing the combinational delay by using embedded fast carry chains. This concept was further extended in [8] and a delay reduction of 33% and 45% was achieved in Xilinx Virtex-5 and Altera Stratix-III FPGAs respectively.
Matsunaga et al. [12, 14] also formulated the mapping of GPCs as an ILP with speed and power as optimization goals. Their results show a 28% reduction in GPC count when compared to [9] . A reduction in GPC count results in reduction of compression stages thereby reducing the delay and power consumption.
Recent attempts from Kumm and Zipf [15, 16] focus on exploiting the low-level structure of Xilinx FPGAs to develop novel GPCs with high compression ratios and efficient resource utilization. Both general purpose LUT fabric and specialized carry chains have been used for synthesizing resource-efficient delay-optimal GPCs.
All the above mentioned approaches (except [9] ) focus on exploiting the fast carry chain embedded in modern FPGAs. The idea is to use the fast carry chain to connect the adjacent logic cells and by pass the programmable routing network to reduce delay [10] . In this paper, however, we try to avoid the usage of embedded carry chains and propose a heuristic that tries to implement GPCs using only the general LUT fabric. The heuristic tries to minimize the number of LUTs in a GPC. The area-optimized GPCs are then easily retimed by inserting registers that are available in each logic cell. Thus instead of using an LUT-carry chain combination we use an LUTregister combination to map the GPCs. The motivation for our approach is backed by following reasons:
i. GPCs based on LUTs and carry chains are purely combinational in nature. FPGAs are synchronous devices and it is better to adhere to synchronous practices while using them as implementation platforms. Our approach provides this synchronous description by including registers in the synthesis process. iii. Finally, retiming GPC structures by placing registers at the input of nodes with large capacitances reduces the switching activities at these nodes [17] . This results in reduced dynamic power dissipation. The rest of the paper is organized as follows. Section II presents the basic preliminaries about the GPCs and the terminology used in this paper. Section III discusses the heuristic that is used to synthesize different GPCs. Synthesis and implementation is carried out in section IV. Conclusions are drawn in section V and references are listed at the end.
II. PRELIMINARIES AND TERMINOLOGY
A compressor tree is a circuit that takes k, n-bit unsigned operands: A k-1 , A k-2 … A 1 , A 0 and generates two output values, Sum (S) and Carry (C), such that:
A generalized parallel counter computes the sum of bits having different weights. A GPC is traditionally represented as a tuple (K i-1 , K i-2 …K 1 , K 0 ; n), where K i denotes the number of input bits of weight i, and n is the number of output bits. The upper limit on the value of GPC is given by:
As an example, a (1, 4, 1, 5; 5) GPC has five input bits of weight 0; one input bit of weight 1; four input bits of weight 2 and one input bit of weight 3. The upper limit on the output value is 31 and five output bits are required to represent the output.
Logic synthesis is concerned with hardware realization of a desired functionality with minimum possible cost. The cost of a circuit is a measure of its speed, resource utilization, power consumption or any combination of these. A Boolean network is a directed acyclic graph (DAG) that represents a combinational function. Logic gates, primary inputs (PIs) and primary outputs (POs) within this network are represented by nodes. Each node implements a local function. A global function is implemented by connecting the logic implemented by individual nodes. The transformation of a Boolean network into targeted logic elements gives the circuit-netlist. For FPGAs the targeted element is a k-LUT.
A cone of node v, C v , is a sub-network that includes the node v and some of its non-PI predecessor nodes. Any node u within this cone has a path to the root node v, u→v, which lies entirely in C v . The level of the node v is the length of the longest path from any PI node to v. Network depth is defined as the largest level of a node in the network. The critical path delay and area of a circuit is measured by the depth and number of LUTs respectively. A node may have zero or more predecessor nodes known as fan-in nodes. Similarly a node may drive zero or more successor nodes known as fan-out nodes. A network is said to be k bounded if the fan-in of every node does not exceed k.
III. GPC MAPPING HEURISTIC
This section describes the heuristic for efficiently mapping the GPCs onto LUTs. The primary goal of the heuristic is to eliminate the fast carry chain and map the GPCs onto minimum possible LUTs. Eliminating the carry chain makes the GPCs feasible to pipelining. The resulting structures are easily pipelined by placing the registers along the feedforward cut-sets. We explain the different steps involved in the heuristic by considering the mapping of GPC (1, 4, 1, 5; 5). Conventional implementation requires four LUTs and a CARRY4 primitive, with a total delay of T L +4T CC , where T L is the delay associated with a single LUT and T CC is the single carry delay. Recognition and Prioritization: After the individual networks have been obtained, the heuristic searches for redundant nodes in each of the networks. Redundant nodes are the nodes which exist in more than one network. These are shown as shaded portions in figure 2. The network for redundant nodes is then drawn separately as shown in figure 3 . Each redundant network is assigned a priority based on the number of appearances in the original networks of figure 2. For example, the network in figure 3(a) is assigned a priority of 5 because it appears in five different networks. Similarly 3(b) is assigned a priority of 4 because it appears in four different networks and so on. Note that the entire parent network can be constructed by interconnecting these redundant networks.
Covering and Re-structuring: Next the heuristic tries to optimally map these redundant networks onto LUTs. Mapping is done as per the priority, as it results in the maximum logic density. For example the network in figure 3 (a) has a priority of 5 and, if mapped optimally will result in an improved logic density in all the networks it is a part of. In this paper, we have targeted FPGAs with 6-input LUTs as basic logic elements. Thus the mapping should ensure a proper utilization of this basic element. For efficient mapping each network in figure 3 is divided into sub-networks. This is again done by traversing through the network and dividing it at output nodes. Thus the network of figure 3(a) is divided into three sub-networks corresponding to outputs X 0 , X 1 and Z 0 . Similarly networks in 3(b), 3(c) and 3(d) are divided into different sub-network as per their fan-out. This is shown in figure 4 . A straight forward approach to mapping would be to assign the logic implemented by each sub-network to a separate LUT. This, however, leads to under utilization of the resources. For efficient mapping, therefore, the entire assembly of subnetworks is re-structured. This requires transferring some subnetworks from their original networks to sub-networks that 
(a) (b) (c) (d)
belong to different networks. For example sub-network X 0 that originally belonged to 4(a) is now transferred to 4(b) and included with sub-networks X 2 and Z 1 . This re-structuring of sub-networks ensures a proper utilization of the LUT fabric. Note that the 6-input LUTs in Xilinx FPGAs can implement a single 6-input function or two 5-input functions with shared inputs. The re-structured sub-networks are shown in figure 5 . The re-structured sub-networks are then efficiently mapped onto 6-input LUTs by directly mapping their functionalities onto these target elements. 
Re-construction and Re-timing:
The parent network is then constructed by connecting the mapped networks from step III. The overall structure is a simple feed-forward structure having a unidirectional dataflow. This feed-forward nature lends itself for efficient pipelining by simply placing the registers along the feed-forward cut-sets. The final mapped and re-timed structure is shown in figure 6 .
The circuit implementation of figure 6 requires four LUTs and three registers and has a critical path that includes only the delay of a single LUT (T L ). The carry chain has been eliminated and there is no increase in the delay associated with the GPC. Different GPCs proposed in prior work were implemented using this heuristic. The carry chain was successfully eliminated in all of the GPCs with no extra hardware cost, except in few cases where the column length of the GPCs exceeded five. The circuits for different GPCs are shown in figures 7, 8, 9 and 10. A theoretical evaluation of different GPCs is listed in table 1. With respect to table 1 it should be noted that previous implementations using carry chains consider only LUTs as the hardware resource. However, for each bit in a carry chain there is a carry multiplexer (MUXCY) and a dedicated XOR gate for adding/subtracting the operands with a selected carry bit. Thus an increase in LUT count that is observed in some GPCs using our heuristic may be compensated by the elimination of the resources included in the carry chain. 
IV. SYNTHESIS, IMPLEMENTATION AND RESULTS
Synthesis and implementation is done using xc7k70t-2fbg676 device from Xilinx Kintex-7 family. The parameters considered are resources utilized, critical path delay and dynamic power dissipation. Constraints relating to synthesis and implementation are duly provided and a complete timing closure is ensured in each case. Synthesis and implementation is carried out in Xilinx Vivado 2016. 3 [18] with speed as the optimization goal. Power analysis is done using the Xpower analyzer tool. For power analysis switching activity is captured in the value change dump (VCD) file by applying test vectors and checking for correct output. Similar test benches have been used to ensure a fair comparison. Table 2 provides a comparison of different performance metrics for different GPCs.
From table 2 it is observed that the GPC mappings based on the proposed heuristic show an average increase in speed by almost 65% and an average reduction in dynamic power dissipation by 10%. The carry chain is eliminated in each GPC with an overhead of pipelining registers and LUTs (in few cases). Each slice in Kintex-7 supports four registers which normally remain unutilized. Our experimentation with different arithmetic circuits on Kintex-7 devices reveal that each carry chain utilizes resources that are equivalent to 1 to 1.5 6-input LUTs. Thus any increase in LUT count is justified by the elimination of carry chain. 
V. CONCLUSIONS
In this paper we took an alternate approach to GPC synthesis on FPGAs. Unlike prior work on GPC synthesis that used a combination of LUTs and carry chains, we used a combination of LUTs and registers and eliminated the carry chain completely from the GPC structure. Our approach works in two steps: first a heuristic is used to eliminate the carry chain and map the GPC logic efficiently onto the underlying LUT fabric. The mapped GPC is then retimed by placing the registers along the feed-forward cut-sets. Retiming breaks the critical path resulting in higher operating frequencies. Our implementation targeting Xilinx FPGAs show an increase in speed and reduction in power dissipation for almost same resources utilized.
