Driven by the excellent properties of FPGA's and the need for high-performance and flexible computing machines, interest in FPGA-based computing machines has increased dramatically. Fixed-point adders are essential building blocks of any computing systems. In this work, various high-speed addition algorithms are implemented in FPGA's devices, and their performance is evaluated with the objective of finding and developing the most appropriate addition algorithms for implementing in FPGA's, and laying the ground-work for evaluating and constructing FPGA-based computing machines. The results demonstrate that the performance of adders built with the FPGA's dedicated carry logic combined with some other addition algorithms will be greatly improved, especially for larger adders.
INTRODUCTION
Recent studies '9 have demonstrated that the reconfigurable computing systems indeed have the feasibility and potential for improving the performance of a system by modifying its hardware or architecture by the software in real time to match the computational characteristics of the individual application. As the densities and speeds of the SRAM-based FPGA's (Field Programmable Gate Arrays) continue to increase, FPGA-based reconfigurable and custom computing systems have become one of the hottest research topics in computer science and computer engineering. Fixed-point adders are essential building blocks in any arithmetic units in a computing system. Their performances or speeds of operation depend on the carry propagation delay. In order to reduce the worst-case carry propagation times, various high-speed fixed-point addition algorithms have been studied extensively in the area of designing fixed VLSI processors1123. However, an addition algorithm optimally implemented in one technology may not necessarily be so in a different technology. The elemental building blocks are gates in fixed VLSI technology and CLB's (Configurable Logic Blocks) in FPGA devices. The implementation techniques of addition algorithms in the two technologies are different and thus their performance and cost parameters. Therefore, this work undertakes the performance evaluation of various available addition algorithms implemented in Xilinx 4000 series devices in an effort to determine their suitability to FPGA's. The paper aims to lay some ground-work for evaluating and constructing FPGA-based reconfigurable computing systems.
BASIS OF PERFORMANCE EVALUATION
In fixed VLSI technology, the assessment of different addition algorithms is usually based on the conventional analytic approach. The gate-count model is used for arealcost evaluation and the gate-delay units model for operational time evaluation. However, with FPGA's technology, the gate numbers and the gate delay units serve no useful purposes in performance and cost evaluations, because the basic functional units in FPGA's are CLB's rather than the basic logic gates in the fixed VLSI.
In order to evaluate FPGA's implementations of different addition techniques, the evaluation should more appropriately be based on the features of FPGA's. Each FPGA device'0 includes three major reconfigurable elements: configurable logic blocks (CLB's), I/O blocks (lOB's), and interconnections. These elements all contribute to the propagation delay of the processing unit implemented in the FPGA devices. The JOB's provide the interface between the internal and external signals. The programmable interconnect resources connect the inputs and outputs of the CLB's and JOB's into an appropriate network. Any XC4000 series CLB is capable of implementing up to two four-variable or one nine-variable logic frmnctions. The logic functions in a CLB is accomplished in a table-look-up operation. Any logic functions of more than nine variables require two or more CLB's. Generous on-chip buffering makes block delays insensitive to loading by the interconnect structure, though all interconnect delays are layout-dependent. The total propagation delay depends of the amount of resources used.
To evaluate the performances of the different addition algorithms, the parameters: operation time (T), cost I area (C) and performance:cost ratio (ri) will be examined. Obviously, the cost/area (C) ofa design implemented in FPGA's should be calculated in terms of the number of required CLB's. The operational time (T) is obtained from the timing simulation results with Xilinx software which uses the actual block and routing delay times from the routed design. The simulation results allow a much more accurate assessments of the behaviors of the implementations under worst-case conditions to be made. The performance:cost ratio (ri) is defined as the reciprocal of the cost multiplied by the operational time in this work. Therefore, the comparison of FPGA's implementations of different addition techniques can be based on the value of r. If one technique has a larger value of r, it will be considered to be better than another.
EXISTING ADDITION ALGORITHMS AND THEIR FPGA IMPLEMENTATIONS
Based on the way carries are propagated, the classical high-speed fixed-point addition algorithms mainly include carryripple, carry-completion, carry-skip, carry-lookahead, and carry-select addition algorithms"'4. In order to reduce the cost/area and the carry-propagation time or both of them, numerous variations of these classical approaches have been studied1523 and newer ones are still being developed and implemented with the fixed VLSI technology.
The different addition algorithms have been implemented with different part-types of the widely used Xilinx 4000 series devices. This paper reports the performance evaluation for the carry-ripple, carry-completion, carry-lookahead, carryskip and carry-select adders.
Carry-ripple adder
The carry-ripple adder is one of the oldest and simplest adder designs. The n-bit carry-ripple adder is easily implemented using the dedicated carry logic (Figure 1 ) which is one of the excellent features of Xilinx 4000 series devices. The carry logic circuit is independent of the function generators, but shares some of the same input with the function generators. Each CLB can implement approximately 40 different functions and carry modes. Table 1 shows the performance parameters of the carry-ripple adders of sizes from 8 to 80 bits which are implemented in different Xilinx 4000 series part-types.
SPIE Vol. 2914 /27
Configuration Memory Bit Note: -can not be implementeu in one device.
Carry-completion adder
The carry-completion adder is obtained by modifying the carry-ripple adder to include the carry-completion detection logic. Because this adder is asynchronous, the operational time of this adder varies according to the operands although the worst-case operational time of this adder can still be linearly proportional to the length n of the adder. Table 2 shows the performance parameters for carry-completion adders. In order to compare this algorithm with others, the average operational times are taken rather than the worst-cast operational times of the adders. Note: -can not be implementeu in one device.
Carry-lookahead (CLA) adder
Theoretically, fundamental CLA adders can be constructed and always results in a constant addition time independent ofthe width ofthe adder ifthe CLA unit can be freely expanded. Due to the rapid increase in the fan-out and fan-in required 28/SPIEVoI. 2914 to implement the carry generation and the cany propagation functions as the adder size increases. Such designs are not practical but for the smallest adders. Therefore, large adders are generally implemented modifying the fundamental approach with a multilevel structure or combining CLA algorithms with some others to reduce the fan-in and fan-out difficulties. These approaches will usually result in additional delay, and the operational time of the adder will be no longer a constant. The cost and performance data of FPGA implementations of multilevel CLA adders are shown in Table 3 . Note: -can not be implementea in one device.
Carry-skip adder
The carry-skip adder is built from the carry-ripple adder. An n-bit ripple adder is partitioned into blocks and carry-skip logic is added to each block. The worst-case carry propagation delay in carry-skip adders highly depends on the configurations of such adders. In this work, only one-level carry-skip adders are implemented. This is because the dedicated carry logic is so efficient that there is likely to be little value beyond two or more skip levels for the adders as the carry-skip The carry-skip adder is the next cheapest one in cost and the next best in performance:cost ratio. However, the operational time ofthis adder compares less favorably to that of the carry-ripple adder. This makes the cany-skip adder not the best choice of addition algorithms to be implemented in FPGA's. At the writing of this paper, this is not yet conclusive. The configuration optimization of this adder has yet to be examined. When the adder has been optimally designed, it could be a candidate for the best performing addition algorithm to be implemented in FPGA's.
From the tables and graphs above, the carry-select adder appears to be the most appropriate choice for FPGA's implementations. This adder has the best operational time when the adder width is larger than 56 bits at the medium cost. Although the cost does not appear to be very good, the algorithm does have the advantage of the regular structure and almost same the performance:cost ratio as that of the carry-skip adder. Moreover, other algorithms can be easily applied in this adder. When combined with other algorithms and after the further examination of partitioning the adder, the performance parameter for this adder could still be significantly improved.
CONCLUSIONS
In this work, we have implemented the five classical fixed-point addition algorithms in the widely used XC4000 device. An attempt has been made to model the performance ofthe adders with the empirical formulas derived from the data resulted from their implementations. The following conclusion can be drawn:
For fast applications, the carry-skip adder and the carry-select adder appear to be the most appropriate solutions due to their excellent performance:cost ratio and the reasonable cost.
For low cost applications, the carry-ripple adder is the most appropriate solution. Although the operational time is not as good as that of some others for larger adders, it does have a very simple and regular structure and the highest performance:cost ratio which make it attractive for the FPGA's applications and especially for the parallel applications. ® The CLA and carry-completion adders seem to be the worst performers because of their high cost and low perfôrmance:cost ratio. In general, those algorithms which have regular structures and can take advantage ofthe dedicated carry logic feature are suitable for FPGA implementations.
LIMITATIONS OF THE PRESENT EVALUATION
In practice, a successful evaluation is affected by a large number factors. In order to make a more accurate evaluation, the following factors should be considered.
The two-operand addition is the most fundamental operation in computers. Therefore, the adders can be easily evaluated by isolating the hardware associated with two-operand adders from the rest of 32/SPIE Vol. 2914 than is expected. The implementation results show that it has the highest cost and the worst performance:cost ratio. The major reasons for this are its irregular structure and that its inability to take advantage of the dedicated cany logic. Consequently, the pure CLA algorithm is impractical for FPGA's applications unless it is combined with some other algorithms which would give reasonable performance. components of the computer such as the ALU, memory, control circuitry, etc., should be carefully considered.
Consequently, more work has to be done to link these results to the rest of a computing system. The results are obtained by implementing adders in a single chip. When all components of a computer are considered and multiple chips are used, the I/o resources should be taken account into the evaluation. Moreover, the present evaluation is based on the XC4000-families, therefore, it is difficult to say that the results is equally valid for other FPGA devices. ® The regularity of an adder should be taken account into the evaluation because it is important to both the cost and design. However, it is very difficult to quantify the regularity in the evaluation.
