Early compressor trees based on carry-save adders and single-column parallel counters show good performance in ASIC design, but do not adapt well to modern field-programmable gate arrays (FPGAs). Recently, compressor trees built from generalized parallel counters (GPCs) were synthesized on FPGAs to address this issue. Despite the improved timing performance of GPC-based compressor trees, area reduction is not as significant as delay, and can be further optimized. In this paper, we propose improved GPC mappings as well as new approaches for GPC cascading and binding for Xilinx FPGAs. With these improvements, we develop an integer linear programming (ILP) method for FPGA synthesis of GPC-based compressor trees that supports cascading and binding between GPCs. Experimental results show that the single-cycle compressor trees produced by the proposed ILP can reduce the average area by 42.40% compared with those generated by existing heuristic method, but are 13.16% slower; the pipelined compressor trees produced by the proposed ILP can reduce the average area by 33.43% at the cost of an average 14.35% decrease in maximum clock frequency compared with those obtained by existing heuristic method.
I. INTRODUCTION
Multi-operand addition is a common arithmetic operation, and is widely used in digital signal processing, fast multiplier, and multimedia applications [1] - [4] . To realize this functionality, compressor trees were first introduced by Wallace [5] and then improved by Dadda [6] . In their works, the basic blocks of compressor trees were based on carry-save adders (CSAs) and single-column parallel counters instead of carry-propagate adders (CPAs) in order to avoid long carry propagation delay.
Such compressor trees are preferred in ASIC design because they have better performance than traditional adder trees. However, the situation is different in The associate editor coordinating the review of this manuscript and approving it for publication was Gian Domenico Licciardo.
field-programmable gate array (FPGA) implementations. First, modern FPGAs contain carry chains, which are dedicated architectures for fast carry propagation. Without their help, the critical path delays of these compressor trees might be even worse than that of CPA-based compressor trees. Second, the basic reconfigurable cells in FPGAs cannot be fully utilized by those compressor trees.
In this case, compressor trees built from generalized parallel counters (GPCs) were first synthesized on look-up table (LUT)-based FPGAs in 2008 [7] . Since then, the GPC-based compressor tree has become an attractive structure in FPGA design because it is capable of utilizing the carry chain and can be well mapped to FPGA logic cells [8] . In general, a GPC-based compressor tree includes a final adder and a GPC network. The GPC network is the key component and has a great influence on the performance of the compressor VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ tree, thus its architecture is the main work for the synthesis of a GPC-based compressor tree. In doing so, the synthesis flow is implemented in two main phases. In the first phase, appropriate GPCs are built and then a GPC library is formed. In the second phase, the number of each type of GPC and the interconnections between them are decided by a specific method according to the compressor tree specification and GPC library generated in the former phase. The superiority of GPC-based compressor trees is mainly reflected in the improved timing performance compared to other structures in FPGA design, but the area cost still has much untapped potential for further optimization. To address this issue, we propose an area-efficient integer linear programming (ILP) method for GPC-based compressor tree synthesis. In our method, the optimized GPC mappings are introduced and used to construct the GPC library, and two advanced methods for GPC packing are also proposed to form a compact FPGA mapping. The major contributions of this paper are summarized as follows: 1)Improved GPC mappings are proposed by exploring the architecture of the carry chain and six-input LUTs on Xilinx FPGAs, among which GPC (6, 0, 7; 5) have the highest area efficiency (E = 2) reported so far. 2)Two advanced methods for GPCs packing are introduced, including GPC cascading and binding for the synthesis of single-cycle compressor trees and pipelined compressor trees, respectively, which can help to form a more compact FPGA mapping. 3)An areaefficient ILP method that supports GPC cascading and binding is also proposed for the improvement of compressor tree synthesis.
The rest of this paper is organized as follows. Section II presents related works. Section III briefly introduces the basic preliminaries. Section IV presents the improved GPC mappings and the efficient approaches toward GPC packing after analyzing the logical structure of Xilinx FPGAs. The proposed area-efficient ILP method is illustrated in Section V. Section VI shows our implementation flow and experimental results, and Section VII summarizes the paper.
II. RELATED WORKS
Numerous studies have introduced GPC-based compressor tree synthesis as well as the GPC mappings on FPGAs. The heuristic method and ILP are two main methods that are used for the synthesis of GPC-based compressor trees. There are both merits and demerits for these methods. The heuristic method can complete the task in a short time, but it may not get the optimal solution; the ILP method may achieve a better result at the cost of increased runtime.
Parandeh-Afshar et al. introduced the synthesis of GPCbased compressor trees by using heuristic methods [7] , [10] . They also built an ILP formulation for minimizing the number of logic levels in compressor trees [9] . The optimizations of their works mainly focused on the combinational delay of the critical path in the compressor tree. Compared to ternary adder trees, the combinational delays of compressor trees built by the heuristic method [10] could be decreased by 33% and 45% in Xilinx Virtex-5 and Altera Stratix-III FPGAs, respectively.
Matsunaga et al. [11] formulated an ILP model to optimize the delay and power of GPC-based compressor trees. They found that the total dynamic power of the compressor tree on an FPGA was related to the total number of GPCs as well as the GPC levels. In this case, they achieved the goal by minimizing the number of GPCs and the GPC levels. Their results showed that the delay was reduced by up to 20% with a slight increase in total power dissipation compared to previous heuristic methods.
Brunie et al. [12] proposed a novel data structure called bit heap that could hold a set of bits with different columns, and introduced timing-driven compression under this structure. After analyzing the area efficiency and combinational delays of various GPCs as well as the ternary adder, they found the fully utilized small ternary adders might lead to better synthesis results. Thus, they considered only LUT-based GPCs and small ternary adders in their heuristic method for compressor trees synthesis, unlike the approach of [10] . Khurshid and Mir [13] proposed a heuristic method that aimed to improve the area efficiency of GPC mappings. However, their method neglected the interconnect constraints between the LUTs and the fast carry chain within a slice, and thus additional LUTs listed as ''route-thru LUTs'' in the Xilinx reports had to be used in some cases [14] .
Preuβer [15] introduced several evaluation criterions for GPCs and a greedy algorithm for the construction of matrix summation. Besides that, a novel GPC (2, 5; 1, 2, 1) was also proposed which could achieve the function of three full adders within a single logic level.
Kumm and Zipf [16] proposed new GPC mappings and a 4:2 compressor based on a ternary tree for modern Xilinx FPGAs. They also proposed an ILP formulation for compressor trees synthesis, and then further improved it to support row adders [17] , [18] . Their experimental results showed average LUT reductions of about 40% compared to trees of 2-input adders but are about 12 . . . 20% slower.
III. PRELIMINARIES A. DOT DIAGRAM
For better analysis of the computing process, a multi-operand addition can be abstracted into a dot diagram. Fig.1 shows a dot diagram for an addition that contains four four-bit addends. Each dot represents a binary bit of each operand that can be a 0 or 1. A set of dots in one column has the same binary weight, and the binary weights are ordered from lower on the right to higher on the left. The dots above the line describe the inputs to be added, and the lower dots represent the outputs.
B. GENERALIZED PARALLEL COUNTER
Assume A = [a n−1 , a n−2 , . . . , a 0 ] is an n-bit positive binary number. The rank of bit a n is represented by subscript value n, which stands for binary weight. For example, in the binary value 00010, the bit set to 1 has rank 1 and thus contributes a value of 1 × 2 1 . For counters, all of the input bits that have the same rank can be referred to as a column.
The signal-column parallel counter counts input bits that are set to 1, and outputs an unsigned binary number. Compared to it, the GPC can accept input bits from different columns. A GPC can be described as [m k−1 , m k−2 , . . . , m 0 ; n], where m i stands for the number of input bits in column i, and n represents the bit width of the output. The GPC value and output bit width are given by equations (1) and (2), respectively. The area efficiency of a GPC is evaluated by E, which is defined as removed bits λ divided by the number of occupied LUTs L by equation (3).
For example, a (6, 0, 7; 5) GPC has six input bits in column 2, seven input bits in column 0, and five output bits. When all of the input bits are set to 1, the (6, 0, 7; 5) GPC gets its maximum value M max = 6 × 2 2 + 7 × 2 0 = 31 and output bit width n = log 2 (31 + 1) = 5. Moreover, its hardware cost is four six-input LUTs and the removed bits are eight, measured by the difference between input bits and output bits. Thus, the efficiency of (6, 0, 7; 5) GPC is E = 8/4 = 2. The functionalities of GPCs can also be represented by using dot diagrams. As an example, Fig. 2 depicts dot diagrams for GPC (6, 0, 7; 5), (1, 5; 3), and (6; 3).
C. COMPRESSOR TREE
A compressor tree sums the inputs, which are a series of binary numbers, and outputs a corresponding result. It often consists of two parts: a GPC network and a vector merge adder (VMA). An example of a compressor tree is shown in Fig.3 where the GPC network is denoted by the dotted box.
A GPC network is often divided into several stages. For instance, the compressor tree in Fig. 3 has two stages. In principle, each stage contains different GPCs to compute input data on the basis of the specific strategy. The inputs of the first stage are fed by the original inputs of the compressor tree, and the inputs of the following stages include GPC outputs and unused original inputs from the previous stages. The compression process continues stage by stage until there are only two or three bits left in every column. Finally, these remaining bits are summed by a VMA.
IV. PROPOSED GPC MAPPINGS AND EFFICIENT APPROACHES TOWARD GPC PACKING A. THE XILINX VIRTEX-6 SLICE ARCHITECTURE
A slice is the basic reconfigurable cell of Xilinx FPGAs, and it contains four six-input LUTs (LUT6s), eight registers, widefunction multiplexers, and a four-bit carry chain block [19] . Each LUT6 can be configured to implement a single six-input function, two five-input functions with shared inputs, or a sixinput function and a five-input function with shared inputs and shared values. The carry chain is a dedicated architecture that is used to implement fast add or subtract and can be cascaded to form a larger functional unit. Fig. 4 shows four LUT6s and a corresponding four-bit carry chain in a slice.
This carry logic contains 10 inputs including four S inputs (S0 to S3), four DI inputs (DI0 to DI4), inputs CIN and CYINIT, and eight outputs, including four O outputs (O0 to O3) and four CO outputs (CO0 to CO3). S inputs source only from the O6 outputs of the corresponding LUT6s; DI inputs are fed by the O5 outputs of the corresponding LUT6s or bypass inputs (AX to DX); CYINIT input can accept bypass input AX, 0 or 1; and CIN input can be connected only to COUT output from the previous slice. O ports output the sum of the addition or subtraction, and CO ports output the carryout of the corresponding bit.
B. PROPOSED GPC MAPPINGS FOR XILINX FPGAS
Recent work on GPC mapping [8] has proved that a GPC built by combining carry chain and LUTs has a better area efficiency than one constructed by using only LUTs. In this case, effective use of each bit of the carry chain and every LUT can help improve the area efficiency.
The first carry-in (CYINIT or CIN) of a four-bit carry chain [19] can source from (type 1) carry-output from the previous slice, (type 2) one-bit binary number (0 or 1), or (type 3) bypass input AX. Among these different input types, type 1 cannot source from the input within the slice because of the routing constraints; type 2 offers no contribution to GPC logic, because it can connect only to GND or VCC; and type 3 is the best choice for GPC construction, because one of the GPC inputs can be mapped onto bypass input AX and a part of the GPC function can be implemented by using the first XOR gate in the carry chain.
However, when six inputs of a GPC are connected to LUTA, previous works [8] , [16] , [17] configured LUTA to implement a single six-input function and use the bypass input AX to feed the DI0 port. In this situation, the first carryin of a four-bit carry chain can be connected only to CIN, GND, or VCC, and cannot be used as a GPC input within the slice. For example, GPC (6, 0, 6; 5) FPGA mapping from [17] is shown on Fig. 5 , where a0 to a5 and b0 to b5 denote the six inputs associated with column 0 and column 2; and z0 to z4 represent the five outputs. In this case, the AX input is assigned to GPC input a5, and therefore the first carry-in cannot be used as an additional input.
To address this issue, we propose an improved method for GPC mappings, which is inspired by [20] . As mentioned previously, the first bypass input AX in a slice must be available, when the first carry-in of a four-bit carry chain is connected to a GPC input. In this case, LUTA is configured to implement a six-input function and a five-input function with shared inputs and shared values, just as Walters III has done in his array multipliers. More specifically, assume that a0 to a5 are the six GPC inputs with the same column and are connected to LUTA, at the same time O_DI and O_S represent the computation results and are assigned to O5 and O6 outputs, respectively, as shown on Fig.6 . In this case, O_DI outputs the sum of a5 to a1 calculated by using the five-input function, and O_S outputs the sum of O_DI and a0 computed by using the six-input function; then, O_DI and O_S are used to feed the DI0 and S0 input in the carry chain, respectively. Thus, the data input DI0 is fed by the LUTA output (O_DI), and an additional GPC input can be connected to the first carry-in port through bypass input AX. By doing so, a six-input LUT with the corresponding carry logic can be used to compute the summation of seven one-bit variables, and thus the input bits in the lowest column of a GPC that are mapped to such a structure can be increased from six to seven without any extra LUT6 cost.
In general, the proposed GPC mapping procedure using a combination of LUT6 and carry-chain involves the following phases: 1) clustering related GPC inputs and forming a Boolean network for each output; 2) removing selected inputs connected to the carry chain and reconstructing the GPC logic associated with each output; 3) packing the chosen GPC inputs to the carry chain; and 4) mapping the remaining GPC logic onto six-input LUT.
We take the proposed (1, 3, 5; 4) GPC as an example to further illustrate the details of each phase. Assume that a0 to a4 are the five GPC inputs associated with column 0, b0 to b2 are the three inputs associated with column 1, c0 is an input associated with column 2, and z0 to z3 are the fourbit GPC outputs. In the first phase, related GPC inputs are clustered and a Boolean network is formed according to the functional description of each output. For a (1, 3, 5; 4) GPC, the Boolean networks for four-bit outputs are shown in Fig. 7 . In the second phase, the inputs connected to the carry chain are selected and the related logics are excluded from the Boolean network. As mentioned above, no more than two inputs can be directly assigned to the first block of a fourbit carry chain, and only one input can be connected to the rest of the blocks. For a (1, 3, 5; 4) GPC, inputs a0 and a4 of column 0 are selected to connect to the carry chain, which has two effects on the original Boolean networks. On the one hand, the full adder FA1 is replaced by an XOR gate, as shown in Fig.8(d) , because the rest of the logic is implemented using carry chain. On the other hand, the inputs a0 and a4 as well as the full adder FA1 are removed from the network of Z1, as shown in Fig.8(c) , because the carry output of FA1 is propagated through the carry chain. The same process goes on from column 0 to column 2, and the networks are rebuilt as shown in Fig.8 . In the third phase, the chosen GPC inputs are packed to the carry chain. For a (1, 3, 5; 4) GPC, the inputs a0 and a4 from column 0, b2 from column 1, and c0 from column 2 are selected and connected to the carry chain, as shown in Fig.9(a) . In the last phase, the remaining GPC logics are mapped to LUT6. For a (1, 3, 5; 4) GPC, its FPGA mapping is shown in Fig.9(b) . Please note that the remaining networks of Z3 and Z2 share the same inputs and can be mapped into a single LUT6.
As a result, the improved mappings for GPC (7; 3), (6, 0, 7; 5), (1, 4, 0, 7; 5), and (1, 3, 5; 4) can be optimized or extended from the existing GPC (7; 3), (6, 0, 6; 5), (1, 4, 0, 6; 5), and (3, 5; 4), respectively. A new GPC mapping (2, 1, 1, 7; 5) can also be created. In terms of timing performance, the criticalpath delays of the proposed GPCs contain only LUT and carry propagation delay. In terms of design area, the proposed GPC (7; 3) and (6, 0, 7; 5) have the highest area efficiency (E = 2) reported so far. Although the area efficiency of the proposed GPC (1, 4, 0, 7; 5), (1, 3, 5; 4) , and (2, 1, 1, 7; 5) are 1.75, 1.67, and 1.5 respectively, they may be preferred in some cases because of the various input operands. The schematics and gate-level mappings of GPC (7; 3), (6, 0, 7; 5), (1, 3, 5; 4) , and (2, 1, 1, 7; 5) in a Xilinx FPGA are shown in Fig.10 , where a n , b n , c n , and d n denote the GPC input associated with column 0, 1, 2, and 3, respectively, and Z n represents the GPC output. The properties of these GPCs are listed in Table 1 , where D L stands for the delay of LUT6, and D c stands for the delay of one-bit carry propagation. Here, the Xilinx Virtex6 device is taken as an example, and the proposed GPC mappings can also be applied to other recent Xilinx FPGAs based on six-input LUTs.
C. EFFICIENT APPROACHES TOWARD GPC PACKING
When a GPC built by a combination of carry chain and less than four LUTs are mapped onto a Xilinx slice, the rest of the carry chain logic within the same slice cannot be efficiently utilized in the process of rebuilding the GPC logic, which requires an explicit guide on how to connect different GPCs along with the carry chain. To address this issue, GPC cascading and binding methods are introduced, aimed at packing GPCs densely via carry chain.
GPC1 can cascade to GPC2 along the carry chain, if two conditions are met: 1) one of the least significant input bits of GPC2 can be fed by the most significant output bit of GPC1; 2) the stage of GPC2 is equal or greater than that of GPC1.
Here, GPC1 and GPC2 are denoted as the ''donor'' and ''acceptor'' respectively in cascading. Interestingly, in this point of view, a large GPC can also be implemented by cascading two ''smaller'' ones, for instance, GPC (6, 0, 7; 5) can be formed when one GPC (7; 3) is cascaded to another. Although GPC cascading cannot improve the average area efficiency of two GPCs, it can help to make efficient use of each LUT as well as the corresponding carry chain logic within a slice. The cascading between the GPCs from the same stage can be achieved easily by adding cascaded GPCs in the GPC library. The properties of cascaded GPCs used in our library are listed in Table 2 .
An early GPC binding method was reported by Parandeh-Afshar et al. [10] . They developed an approach that could pack the logics of adjacent GPCs efficiently along the carry chain. Their main idea is to map GPCs onto slices via the cascaded carry chains. However, their method works only for GPC (7; 3) and (2, 3; 3) on the Xilinx Virtex-5 FPGA in [10] , and may still have the situation where unused carry logic is left behind in a slice.
To address this issue, we propose a novel approach toward GPC binding, which can be applied to various kinds of GPCs and makes efficient use of carry logic. Different from [10] , our main idea is to pack the logic of two GPCs onto slice pairs, one using carry chain and one without using carry chain, when the following conditions are met: the logic of the GPC mapped first (GPCA) can be achieved by using fewer than three LUT6s with carry logic, and the GPC packed second (GPCB) have no more than six inputs in the least significant column. Here the granted target device is a Xilinx FPGA with four LUT6s in a slice.
In our method, the logic of GPCA and GPCB is rebuilt: the whole logic of GPCA and the logic of GPCB for the top two output bits are mapped to the first slice by using LUT6s with carry logic, whereas the remaining logic of GPCB is mapped to the second slice by utilizing only LUT6s. On the one hand, the remaining LUTs of this case in the second slice are idle and can be utilized for other purposes, and thus the logic cells in these two slices can be fully utilized; on the other hand, the logic of the GPCs can be executed in parallel, and thus there is no side effect for the critical-path delay. Fig.11 depicts our proposed GPC binding between (7; 3) and (1, 5; 3) . In slice A, LUTA and LUTB with carry logic are utilized to implement the logic of GPC (7; 3), LUTC functions as a routing channel and keeps two GPCs in logical disjoint, and LUTD with corresponding carry logic is used to implement the logic of GPC (1, 5; 3) for the top two output bits. In slice B, LUTA is used to achieve the logic of GPC (1, 5; 3) for the least significant output bit, and the remaining LUTs can be applied to other uses. Cascading and binding between different GPCs have the same goal but different merits: GPC cascading can help to form compact FPGA mapping, as it does not need the LUT that isolates the logic of two GPCs and works as a routing channel; GPC binding has a larger application scope in compressor tree construction, because it does not require that two GPCs have specific data dependence.
D. MAPPING TO INTEL FPGA
Unfortunately, the proposed method cannot be directly applied to Intel FPGAs whose low-level structures are quite different from those of Xilinx FPGAs. In Intel FPGAs, each logic array block (LAB) contains carry chains, shared arithmetic chains, local interconnect, and several adaptive logic modules (ALMs). Each ALM includes two adaptive LUTs (ALUTs) and two registers [21] . By exploring this structure, we found two main differences compared to Xilinx FPGAs, which make it difficult to migrate our proposed method onto Intel FPGAs directly. First, the carry-in sources only from the previous ALM or LAB and cannot be fed by a bypass input as in Xilinx FPGAs, and the carry chains can begin only in the first or the fifth ALM in the LAB. Second, when the ALM is configured to arithmetic mode or shared arithmetic mode in order to use carry chains, each ALUT can accept only four inputs and cannot be used to implement a six-input logic function and a five-input logic function with shared inputs and logic values.
V. PROPOSED COMPRESSOR TREE SYNTHESIS A. FLOW OF COMPRESSOR TREE SYNTHESIS BASED ON INTEGER LINEAR PROGRAMMING (ILP)
The first step is to create a GPC library that contains various kinds of GPCs. The GPC library is the foundation of the compressor tree synthesis and has a great influence on the performance of the compressor tree, where each kind of GPC is denoted by its inputs, outputs, and area cost. In our method, the dedicated GPC types are created for the GPCs, which have the potential for cascading or binding, because the area strongly depends on the implementation method of such GPCs. For instance, GPC (6; 3) can be achieved by using two LUT6s with carry logic when it is cascaded to other GPCs, whereas the used logic cells are changed to three LUT6s because an additional LUT6 is required for the logical disjoint when it is used for GPC binding. The second step is to build an ILP model according to the compressor tree specification and GPC library generated in step 1. The third step is to solve the ILP model and generate a description file of the GPC network. Finally, from this description file, the HDL specification of the compressor tree is produced, and then is synthesized by the FPGA design suite. Table 3 lists the variables that are used in our ILP formulation.
B. ILP VARIABLES

C. ILP CONSTRAINTS
In this subsection, we introduce the ILP constraints for compressor tree synthesis. C1: C5:
The first constraint (C1) sums up all kinds of the compressors, including GPCs and the row adders used in each column and stage. The second constraint (C2) ensures that all bits in each column and stage, except the output stage, are used as inputs of compressors. The third constraint (C3) is utilized to calculate the output bits generated from compressors, which are the inputs to the following stage. The fourth constraint (C4) is used to support row adders, and ensures that the components of the row adder are connected properly according to the internal carry propagation. The fifth constraint (C5) limits the number of bits in the output stage.
The second to fifth constraints (C2 to C5) have similar ideas as the ILP formulation from [18] ; more detailed information can be found in [18] . On the basis of previous work, we extend ILP model to support GPC cascading and binding by using constraints C6, C7, and C8.
C6: C7:
C8:
More specifically, the sixth and seventh constraints (C6 and C7) are designed for cascading across different stages. These constraints are preferred in single-cycle compressor trees because additional registers are required to store the immediate results in pipelined circuits. More specifically, the sixth constraint (C6) ensures that a GPC can be used only once in GPC cascading, namely, a GPC can either cascade to another one in the subsequent stage (s + 1 in C6), or be cascaded by a GPC from the previous stage (s − 1 in C6). The seventh constraint (C7) requires that the prerequisite for GPC cascading is the existence of suitable GPC at the next stage. In the proposed method, the logic of these proper GPCs can be implemented using a combination of no more than four LUTs and the carry chain, such as GPC (6; 3), (7; 3), and (1, 5; 3) . In contrast to GPC cascading, the GPCs used in GPC binding are logical disjoint, and a more relaxed constraint (C8) is introduced for GPC binding. This constraint ensures that a dedicated GPC for resource binding can be utilized only when there is another GPC that has no more than six inputs in the least significant column.
The objective function is used to minimize the hardware resources represented by the equivalent number of six-input LUTs and formulated as equation (12) .
Min : 
D. COMPRESSOR TREE CONSTRUCTION
The construction of the compressor tree can be achieved according to the solution to the proposed ILP model. First, the compressors used are instantiated, and then the interconnections between them are built according to data dependencies. For instance, GPC (7; 3), (5; 3), (1; 1), and (2, 5; 1, 2, 1) as well as the 4:2 row adder are utilized in seven-bit, seveninput addition, as listed in Table 4 , among which GPC (1;1) represents either a simple wire in a single-cycle compressor tree or a register in a pipelined compressor tree. The compression of seven-bit, seven-input addition is represented by dot diagram as shown on Fig.12 , where dots denote binary bits and boxes of different shapes stand for various compressors. The inputs of the compressors in the first stage (stage 0) are the original input operands, and those in the following stages source from the outputs produced by the previous stages. For example, the inputs of the 4:2 compressor in stage 1 and column 1 are fed by the results of stage 0, which are generated from GPC (7; 3) in column 0, and GPC (2, 5; 1, 2, 1) and GPC (1; 1) in column 1. In a single-cycle compressor tree, such connections can be implemented by routing. In a pipelined compressor tree, the registers are first inserted after GPC to store the results from GPCs, and then their outputs are taken as inputs to the next stage. The interconnections of a compressor tree are constructed stage by stage until there are only two bits left in each column. Finally, these remaining bits are computed by a carry-propagate adder.
VI. IMPLEMENTATION AND RESULTS
A. OVERVIEW
Universal GPCs including GPC (3; 2), (6; 3), (5; 3), (1, 5; 3), (2, 3; 3), (1, 4; 3), (1, 4, 1, 5; 5), (1, 3, 2, 5; 5), (2, 5; 1, 2, 1), (6, 1, 5; 5), (6, 2, 3; 5), (1; 1) and 4:2 row adder are contained in our GPC library. In addition to those components, the proposed improved GPC (7; 3), (1, 3, 5; 4) , (6, 0, 7; 5) , (1, 4, 0, 7; 5) , (1, 3, 4, 3; 5) , (2, 1, 3, 5; 5) , and (2, 1, 1, 7; 5) are also covered. In our library, the dedicated GPC types that derive from GPC (6; 3), (1, 5; 3), and (7; 3) are mapped first during GPC binding and acted as donors in cascading. Meanwhile, GPC (5; 3), (2, 3; 3) , and (1, 4; 3) utilize the carry chain only if they are used for cascading or binding. Specially, GPC (3; 2) functions as a full adder, and GPC (1; 1) directly transfers the input bit to a register or a GPC in the next stage. Following the experimental setting in [18] , the design of each benchmark was synthesized and implemented for a Virtex 6 (xc6vlx760-ff1760-2) FPGA using Xilinx ISE 13.4. The circuit area was reported after implementation, which meant these designs could be mapped to a Xilinx FPGA device successfully, and the delays were obtained by inserting registers at the inputs and outputs. Furthermore, we used the maximum compression option (−c 1) during the map process as in [18] to realize a compact FPGA mapping.
B. SINGLE-CYCLE COMPRESSOR TREES
To evaluate the performance of the proposed method, we first selected a series of multi-input additions that have different numbers of input operands as well as input bit widths. We set the number of operands m equal to their input bit widths for each benchmark circuit, resulting in a total of m 2 input bits where m varied from 7 to 16. For instance, when m is set to 7, a seven-input addition with a bit width of 7 is used as a test case, which leads to a total of 49 input bits. All of these circuits could be achieved by using only compressor trees that consist of a GPC network and VMA, where the GPC network was structured using heuristic or ILP method and a two-input ripple carry adder was chosen as VMA in these designs.
As GPC binding needs an additional LUT to isolate the logic of two GPCs, we conducted the experiments by integrating the GPC cascading method into the ILP model for single-cycle compressor tree synthesis, and expected to further optimize the design area. We also reproduced the heuristic method [12] and the ILP method [18] for comparison. The synthesis results of the single-cycle compressor trees for the multi-input addition with varying number of input bits ranging from 49 to 256 are given in Fig.13 . Fig. 13 (a) depicts the area measured by the number of slices, and Fig. 13(b) shows the maximum clock frequencies of the single-cycle compressor trees. As shown in Fig. 13(a) , the proposed ILP always produces the smallest compressor trees compared with the heuristic method [12] and the ILP method [18] . The reason is that our ILP method supports GPC cascading that assembles small GPCs into large ones, which can help to make efficient use of four-bit carry chain logic within a slice. Otherwise, the largest compressor tree is always generated by the heuristic method [12] , mainly because the heuristic method [12] only utilizes LUTbased GPCs without any carry logic. Although the use of the carry chain can efficiently reduce the area of the compressor tree, carry propagation delays are introduced at the same time leading to the degradation of timing performance. That is why the compressor tree produced by the heuristic method [12] has the highest clock frequency, as shown in Fig. 13(b) . The average values of slices and the maximum clock frequencies for different methods as well as slice reductions and the improvements of the maximum clock frequencies over the heuristic method [12] are listed in Table 5 . Taken together, the ILP [18] can achieve an average slice reduction of 35.96% compared with the heuristic method [12] , but is 15.10% slower. The proposed ILP method further reduces the average slice by 42.40% at the cost of an average 13.16% decrease in maximum clock frequency, compared with the heuristic method [12] . Moreover, a detailed comparison of the two ILP methods is given in Table 6 . Both of the ILP models were solved by ILOG Cplex and run on an Intel Core i5-4670, 3.40 GHz (8.00GB RAM) machine. We set the execution time limit to 600 seconds for ILP solution, and a feasible solution could always be found within this time frame. When the optimal solution was obtained, the run time was marked out; otherwise, the ''−'' was used to indicate time out. Because new constraint is added, the proposed ILP model takes more time to find optimal solutions in most cases, but it always obtains the solutions with equal or less slice cost compared with the ILP [18] within the fixed time as shown in Table 6 . [18] for multi-input addition using single-cycle compressor trees.
TABLE 7.
Comparison of average values of slices and maximum clock frequency (Fmax) as well as the slice reduction and the improvement of the maximum clock frequency (compared to Heuristic [12] ) for pipelined multi-input addition.
C. PIPELINED COMPRESSOR TREES
The same multi-input additions were taken as benchmarks for pipelined compressor trees. In these designs, a register was placed after each GPC output. As the extra registers are requested to balance the pipeline in situations where one GPC cascades to another with different stages, we performed the experiments by only injecting the GPC binding method into the proposed ILP model for generating the pipelined compressor tree. Each benchmark was implemented using three different approaches: the proposed ILP, the ILP [18] , and the heuristic method [12] . The synthesis results of the pipelined compressor trees for the multi-input addition with varying number of input bits ranging from 49 to 256 are given in Fig. 14. Fig. 14(a) and 14(b) depict the number of slices and the maximum clock frequencies of the pipelined compressor trees, respectively, which show a similar trend as the corresponding curves of the single-cycle compressor trees. Fig.14 shows that the fastest compressor trees are always produced by the heuristic method [12] while the smallest compressor trees are generated by the proposed ILP algorithm in nine out of ten test cases.
The average values of slices and maximum clock frequencies for different methods as well as slice reductions and improvements in maximum clock frequencies over the heuristic method [12] are listed in Table 7 . As Table 7 shows, the ILP [18] can decrease the average slice by [18] for multi-input addition using pipelined compressor trees.
28.65% compared with the heuristic method [12] , but is 17.61% slower. The proposed ILP method further reduces the average slice by 33.43% at the cost of an average 14.35% decrease in maximum clock frequency, compared with the heuristic method [12] . A detailed comparison of the two ILP methods is also given in Table 8 . From Table 8 , it can be observed that the proposed ILP method can find the solutions VOLUME 7, 2019 with equal or less slice cost compared with ILP [18] in most cases, except for 64-input bits.
D. FAST MULTIPLIERS
Fast multipliers were utilized to perform experiments to test the performances of the proposed compressor trees in a more comprehensive way. In our experiments, we set the bit width n of the multiplier identical to that of multiplicand, where n varied from 10 to 24. The proposed compressor trees and twoinput adder trees were adopted to sum up the same partial products generated using the Booth Algorithm in the construction of fast multipliers. The implementation results for the fast multipliers are listed in Table 9 , where the multipliers built using a combination of the Booth Algorithm and proposed compressor trees are denoted by ''Booth + prop.ILP,'' and those constructed using a combination of the Booth Algorithm and two-input adder trees are represented by ''Booth + Adder Tree.'' From Table 9 , it can be observed that the multipliers using the proposed compressor trees can significantly reduce the average slice by 27.86%, with a tiny increase in average maximum clock frequency (0.14%) compared with those using adder trees.
VII. CONCLUSION
In this paper, we proposed improved GPC mappings by efficiently utilizing the carry input of the carry chain as well as each LUT for Xilinx FPGAs. We also developed new approaches for GPC cascading and binding that can help reduce the design area and pack GPCs more densely. With these improvements, we proposed a novel area-optimized ILP method that supports GPC cascading and binding for the synthesis of single-cycle compressor trees and pipelined compressor trees, respectively.
To evaluate the performance of the proposed ILP method, the compressor trees generated by the proposed ILP were synthesized onto the Xilinx FPGAs, and the implementation results were compared with the heuristic method [12] and ILP method [18] . The proposed ILP method can significantly reduce the area overhead compared with the heuristic method [12] at the cost of reduced maximum clock frequency. On average, the single-cycle compressor trees and pipelined compressor trees generated from the proposed ILP are 42.40% and 33.43% smaller than that obtained by the heuristic method [12] , but are 13.16% and 14.35% slower, respectively. Moreover, the proposed ILP method can find solutions with equal or less slice cost compared with ILP [18] in all cases except one within a specified time. In addition, the fast multipliers using the proposed compressor trees show a substantial average slice reduction of 27.86% compared with those using adder trees, which further demonstrates the superiority of the proposed compressor trees in implementing large-scale designs with other circuits.
