Abstract-Integer addition is a pervasive operation in FPGA designs. The need for fast wide adders grows with the demand for large precisions as, for example, required for the implementation of IEEE-754 quadruple precision and elliptic-curve cryptography. The FPGA realization of fast and compact binary adders relies on hardware carry chains. These provide a natural implementation environment for the ripple-carry addition (RCA) scheme. As its latency grows linearly with operand width, wide additions call for acceleration, which is quite reasonably achieved by addition schemes built from parallel RCA blocks. This study presents FPGA-specific arithmetic optimizations for the mapping of carryselect and carry-increment adders targeting the hardware carry chains of modern FPGAs. Different trade-offs between latency and area are explored. The proposed architectures can be successfully used in the context of latency-critical systems or as attractive alternatives to deeply pipelined RCA schemes.
I. INTRODUCTION
One of the most prevalent operations in digital arithmetic is the addition. It is part of virtually all implementations of more complex operators including the rather fundamental multiplication, the computation of scalar products or the calculation of vector magnitudes. It is present in unrolled formulations as well as in many iterative computation approaches.
FloPoCo [1] is a tool that is capable of generating VHDL code for a wide variety of arithmetic operators. It provides a vast library of builtin operators, which may be used by themselves or may be combined to form complex custom data flows. The generator is able to attune the constructed implementation for a desired target operation frequency and draws from a great pool of knowledge to optimize the pipeline depth and the implementation size according to the user specification.
This paper describes a new implementation option for wide binary adders as implemented in FloPoCo. This implementation builds on the carry-select addition approach to accelerate the addition in comparison to the ripple-carry implementation, which is standard on FPGA devices. However, it features quite a few measures that optimize the mapping of the carry-select addition onto contemporary FPGA devices. These include: (a) an optimized computation of the inter-block carries, (b) the use of shorter comparators to compute the speculative block carries when the associated sum is not needed, and (c) the elimination of the high-fanout signal controlling the multiplexer for the final result selection.
After the description of the envisioned architectures, the generator strategies for frequency-optimized block splitting will be detailed. The resulting complexities in terms of LUT counts and the achievable timings will be derived and verified experimentally. The proposed architectures are also faced against pipelined ripple-carry adder (RCA) schemes in terms of LUT-count complexity. Pipelining options are discussed when high frequencies, unreachable by the combinatorial versions, are required. The final implementation of the generator will be integrated in the open-source framework offered by FloPoCo.
A. Background 1) FPGA: A simplified view of the logic blocks present in Xilinx Virtex4 [2] and Virtex5 [3] FPGAs is presented in Figure 1 . Among other components, it contains:
• a function-generator (Look-Up Table) :
On Virtex4 the LUT has a capacity of 16 bits, being able to implement any 4-input logic function. The Virtex5 LUT with a capacity of 64 bits may either implement an arbitrary 6-input function output on O6, or it may be sliced into two 5-input functions sharing the same set of inputs. Then both outputs O6 and O5 are used.
• the fast carry-chain logic:
FPGAs typically implement the binary word addition as a RCA in a way that one logic block assumes the operation of one full adder. The carries between these full adders are forwarded across the designated carry chains. The mapping of the full adder on the logic blocks is performed as follows:
where
The general routing between logic blocks (inputs on the left and outputs on the right of Figure 1 ) is about 3 times slower than the LUT delay. The carry-propagation chain (running vertically from c in to c out in Figure 1 ) is typically 10-15 times Fig. 2 . Classic Carry-Select Architecture faster than the general routing. Therefore, it is desirable to map computations to this carry-chain whenever possible.
2) Classic Carry-Select Adder: The classic carry-select adder [4] block consists of two RCAs and one multiplexer. Each pair of adders computes the two possible block results, one speculating on a carry-in of 0 and one on a carry-in of 1. The carry-in then feeds the select line of the multiplexer to choose the correct sub-sum and carry-out bit.
Large additions can be split into multiple carry-select adder blocks (k). The speculative sub-sums S Figure 2 presents the architecture of such an addition that is split into multiple carry-select blocks. For clarity, the block carry-out multiplexers have been separated from the block result multiplexers.
The multiplexer network is generally fast. However, if greater performance is needed, a costly but faster carry lookahead structure can be used for carry-bit computation.
Unfortunately, both the multiplexer network and carry lookahead adders map poorly on FPGAs. This is because in FPGAs the routing delay logic elements exceeds by 3 to 4 times the delay of the logic element. Despite this major drawback, this naive mapping outperforms in latency the highly FPGAoptimized RCA for large operands.
B. Related Work
An initial study evaluating the performance of fast addition schemes on FPGAs is presented in [5] . The study concludes that among the numerous fast addition schemes, the only ones mapping relatively well to the FPGA structure are carry-skip and the carry-select, the latter providing the best performances. The optimizations applied to the classical carry-select architectures are structural, speculative carry-bit computations being addressed by carry-skip structures. The carry-in computation for each carry-select block is done using the classical multiplexer network, which is slow in FPGAs.
A discussion on the synthesis of carry-select adders in modern FPGAs is presented in [6] . The study proposes bitwise computation of the speculative sums using XOR gates and an inverters. The impact of these optimizations in modern FPGAs is little, if any as presented in Section IV.
Another variation of the carry-select architecture is presented in [7] . It is based on the idea of time-multiplexing the same adder resource for computing the two speculative sums and carry-bits. The design manages to reduce the area at the expense of latency. However, the implementation requires low-level directives for mapping the circuit to hardware, thus lacking portability. The larges adders addressed by the authors are only 32-bit wide, making it impossible to compare against.
A better mapping of the carry-select architecture to the FPGA logic was introduced by some of us in [8] . There, the k-level multiplexer network is mapped to a 2k-bit RCA, significantly improving the adder timings. Unfortunately, for a given width, the 2k carry-propagation size restricts the unpipelined version of these adders to lower frequencies. For a fixed frequency, 2k carry-propagation restricts the maximum adder width.
The current study presents a novel mapping of the multiplexer network to the carry chain based on the work of Preußer et al. [9] on mapping general prefix computations to the carry-chain. The it presents several improvement over [8] : the multiplexer network is mapped to one k-bit RCA and a carry-recovery (CR) circuit which most of the time may be fused with other computations in modern FPGA. In addition, this study also provides structural improvements of the carryselect scheme based on specific FPGA feature of using the faster and smaller comparator structures for speculative carrybit computations instead of adders.
II. FPGA-SPECIFIC MAPPING OF CARRY-SELECT ADDERS A. Acceleration of Inter-Block Carries
The inter-block carries of the carry-select adder take a shortcut through the multiplexer network skipping a complete block with a single multiplexer stage. This advantage is mostly given away if the multiplexers are implemented using standard LUTs connected through the general-purpose routing network. To compete with the fast carry propagation within a block, the inter-block carry propagation must also exploit the available carry-chain structures. This will be achieved using the technique described by Preußer and Spallek [9] .
As shown in Table I , the different cases of the propagation of the inter-block carries can be easily distinguished by the values of the speculative block carry outputs. As c 
. . . tapping of the carry signals is, indeed, possible on the Virtex architectures, such a solution is not portable and would require the use of device-specific, low-level component primitives. A better alternative, which is also portable, is offered through Equation 1. This equation allows to infer the incoming carry from the other two full-adder inputs and the obtained sum bit s k so that a standard addition operator suffices to implement the core carry-chain implementation:
and hence (see also Table I ):
The carry computation circuit with the resulting recovery of the carries from the sum bits is depicted in Figure 3 . Note that the recovery computation can often be merged into the further processing of the recovered carry signal.
B. The AAM Carry-Select Architecture
The Add-Add-Multiplex (AAM) architecture derives directly from the classic carry-select architecture. The multiplexer chain computing the carry bits is replaced with the much faster carry-computation-circuit (CCC) and carryrecovery (CR) circuit. Figure 4 highlights the three stages of the AAM Carry-Select architecture:
1) For each block, two sums are computed, one for each possible value of the block carry-in. Both of these additions are extended to compute the block carry-out. 2) The two bit vectors formed by the block carries speculating on a carry-in of 0 and 1 are added in the CCC using a fast short ripple-carry adder. The output sum bits and their two respective speculative input carries are fed to the CR circuit, which recovers the proper block carry outputs. 3) The computed block carries are used to select the proper speculative block sum for the adder output. The AAM architecture uses a multiplexer to select among the two block sums. The multiplexer is a 3-input function, the two sum-bits and the carry-bit generated by the CR. For FPGAs with 5-input LUTs, the CR can be merged with the multiplexing. This is the case for modern FPGAs like Virtex5 and Virtex6 having 6-input LUTs. Having only 4-input LUTs available such as on Virtex4 devices, the CR introduces an extra LUT and a supplementary wire delay. On these architectures, adders with a low block count and, thus, a short CCC should prefer the carry-add-cell architecture described by de Dinechin et al. [8] . It uses extra intermediate propagating stages (p = 1), which provide direct access to the inverted propagated carry through Equation 1. As soon as the combined delay of these extra stages exceeds the delay of a CR, the AAM will become the superior choice also on these architectures.
C. The CAI Carry-Increment Architecture
The Compare-Add-Increment (CAI) architecture adopts some features from the carry-increment adder, a widely adopted structural simplification of the carry-select scheme. In particular, the CAI only uses the block sums produced for the case of no incoming block carry. The final multiplexer stage is replaced by another adder, which adds the actual incoming carry and, thus, corrects the produced sum if necessary. Note that the choice of this incrementer instead of a multiplexer does not increase the number of occupied LUTs.
As the CAI does not need the sum speculating on an incoming block carry, the corresponding adder only serves the purpose of computing the associated carry-out of the speculative block sum X k + Y k + 1. This can, however, be obtained by the simple comparison:
All in all, the CAI offers the following improvements:
1) The use of a comparator for computing c 1 k is, at most, as complex as the replaced addition. On Virtex5 and Virtex6 devices, the number of required LUTs is even cut in half as every stage on the carry chain processes two adjacent input positions rather than just one. This is possible as the sum bits are not asked for.
2) The number of registers required in a pipelined implementation is almost cut in half as only one of the two speculative block sums must be stored. 
The final step is turned from an incrementer into a complete adder computing
The greatest benefit of this implementation is achieved on FPGAs with 5-input LUTs. Not only can the CR be merged into the LSB computation of the final addition but the whole critical path is shortened as the computation of both speculative block carries is only half as wide as a true adder. The architecture is outlined in Figure 6 .
III. FREQUENCY-DIRECTED ADDER DESIGN
Let L denote the adder width and let f denote the target frequency. Our objective is finding a length k vector of block sizes denoted by (l k−1 ...l 0 ), L = k−1 0 l i such that the delay on all adder outputs is less than 1/f = T .
For our architectures, the components and their delays are:
• ripple-carry adder; worst-case delay is the sum's MSB δ RCA(j+1) = δ sj = δ LUT + jδ carry + δ XOR (8) sum-bits are produced in sequence, as the carry-bit takes time to propagate:
therefore for sum-bit j, inputs may arrive as late as:
• comparator; output delay is either δ cmp(j) = δ RCA(j) or for Virtex5/6:
• multiplexer; usually implemented in LUTs: δ MUX = δ LUT • wires; delay is denoted by δ w The data dependences between stages of the proposed architectures together with the FPGA-specific component timings yield different block-splitting strategies for maximizing adder size for a frequency f .
A. Block-splitting strategies 1) The AAM Carry-Select Architecture: The constraints given by the timing model of this architecture will allow us to determine the optimal block sizes. A visual indication of a tight computation scheduling which optimizes the AAM block-sizes is given in Figure 7 (a). The length of the segments is proportional with the computation delay of the components (adders and multiplexers for AAM). The length of the RCA delays (first stage) is proportional to the block size.
Considering the timing and architectural constraints, the CCC is a k − 2-bit RCA having the delay of the MSB sum bit δ s k−2 (Eq. 8). The MSB sum-bit inputs the select line of the k − 1 th block multiplexer (Figure 4) , having a delay δ MUX . On the other hand, as CCC is implemented as an RCA, it allows the inputs to be delayed at most as specified by Eq. 10. As the speculative carries (c 1 i and c 0 i ) are also computed using RCAs, this allows the size of successive blocks to increase by exactly one bit.
We therefore choose to fix the 2 nd block size, l 1 = 1bit. For a given frequency f , this sets the maximum value of k as:
As successive-block size increases by exactly one bit,
The l k−1 and l 0 block sizes are the solutions of the equations:
The maximal addition size for frequency f is l 0 + (k − 2)(k − 1)/2 + l k−1 .
2) The CAI Carry-Increment Architecture: The CAI architecture computes the speculative c 1 i bit using Eq. 6. On Virtex5 devices this comparison takes half the resources needed to obtain c 1 i using a RCA. The latency improvement is roughly equal to j/2δ carry but is lost by using a RCA for c The successive sum-bits of the CCC are available at δ carry increments (Eq. 8). These are used as carry-in bits for the final-stage adder. By enforcing that the result block outputs are synchronized (Figure 7(b) ) this leads to successive blocks size decreases by 1-bit.
We choose to fix the size of the k − 1 th block, l k−1 = 1 bit which leads to l 2 = k − 2. Moreover, the difference in input delay between the speculative carry bits of l 2 and of l 1 for CCC is δ carry . This leads to
Given the constraint that the carry-out of block 0 is the carry-in of CCC, the size of this block is the solution of the equation:
The maximal adder size for this architecture for frequency
3) The CCA Carry-Select Architecture: The CCA architecture uses comparators for computing the two speculative caries, c 0 i , c 1 i (Eq. 7,6). When compared to the CAI architecture, the latency of the first stage is reduced.
However, the block splitting strategy remains the same. The l 1 is the solution of:
The number of blocks k is now the solution of the equation:
The size of l 0 is the solution of:
B. Area complexity of the designs
Once the block-splitting procedure is finished, we can closely approximate the area of the circuit on the FPGA. The value is further used in the FloPoCo addition generator to choose among the proposed architectures and the pipelined architectures presented in [8] . Table II [resents LUT-count formulas for the proposed architectures for Virtex5/6 devices (similar formulas can be derived for Virtex4). The formulas are deduced based on the resources occupied by the basic blocks:
• 2:1 n-bit multiplexer -n LUTs.
• n-bit RCA -n LUTs 
• n-bit comparator -n/2 LUTs on Virtex5/6 and n LUTs on Virtex4. 1) Comparison with pipelined-RCA schemes: The immediate advantages of the proposed addition architectures over pipelined RCA architectures is the reduced latency. We were interested in the area cost we have to trade to get this advantage so we compared the area magnitude of our architectures to that of pipelined RCA architectures [8] . Table II synthesizes resource estimation formulas for Virtex5 FPGAs. Please note L remains constant and k, (l 0 , ...., l k−1 ) are architecture specific (Fig. 7 offers an indication of their magnitudes). The proposed addition architectures are attractive alternatives to the pipelined RCA schemes especially when these require more than two pipeline levels, for which the CCA roughly occupies the same amount of resources for a 1 cycle latency. For a larger number of pipeline levels, the proposed architectures outperform the pipelined RCA scheme, providing that they can match the frequency.
2) Pipelining options: The proposed architectures have three stages, and therefore their critical path contains two wire delay. Due to these implicit delays, the architectures are unable to target very high frequencies. For such situations, they can be pipelined (1-2 stage usually suffice for any practical uses).
For AAM and CAI architectures the pipeline register level can be inserted after the first adder stage. The registers are combined with the LUTs (Figure 1 ) so they come for free (except for the CAI last block's inputs -a better solution is to perform the final block addition for no-incoming carry, and register only the adder's output). For CCA, a good pipelining would regroup the first two levels in order to to balance the size of the adders at the last level. In any case, pipelining this architecture require buffering the inputs which is expensive (2L − l 0 LUT Flip-Flops pairs). Pipelined implementations should be explored if no combinatorial version can reach the required frequency. Among these architectures CAI architecture requires the less resources to pipeline.
IV. REALITY-CHECK
The presented architectures have been integrated in the FloPoCo core generator next to the pipelined schemes. For a given context: adder width (L), target FPGA (most Xilinx and Altera FPGAs are supported) and target frequency f it the cost of all possible adder architectures is evaluated and the best suited is generated in a portable, human-readable VHDL file.
The largest theoretical adder width for the unpipelined implementations for a given f is plotted in Figure 8 for Virtex5 FPGAs. The 200-300 MHz frequency range is a high reasonable limit for chip-filling designs.
Table III compares our proposed architectures against a pipelined RCA implementation for addition sizes ranging from 128 to 512-bits, targeting f = 250MHz.
The results prove two points: 1) the routing delays penalize all architectures, including the RCA 2) for sufficiently large L, the proposed architectures take less resources and reduce the cycle count.
Table IV presents a comparison of the AAM architecture against [6] , the Altera lpm_add_sub megafunction [10] and an optimized pipelining of the RCA scheme presented in [8] . Compared to [6] our critical path delay of is much shorter. When compared to the pipelined approaches, the AAM has a 1-cycle latency for a competitive area. When compared to the Kogge-Stone adder [11] with sparsity 32 for carry-bit generation (delay is estimated: L=128 → 7 levels of logic with delay 7(δ LUT+δw ) + δ RCA(32) ) our solution is clearly superior in the FPGA context due to the large values of δ w .
V. CONCLUSION
This paper presents three efficient mappings of the carry select/increment adders on modern FPGAs. Their integration in the open-source FloPoCo framework widens the user's design space and offers a needed latency trade-off for wide additions, such as those needed to implement quadrupleprecision floating-point. When integrating such low-latency operators in coarser data-paths a natural question arises: Could the use of such operators with an area larger than its pipelined version counterpart, actually reduce the overall data-path area due to less synchronizations? Generally, this is a global optimization problem which we have yet to solve.
