Abstract-This paper presents improved architectures for a fused floating-point add-subtract unit. The fused floating-point add-subtract unit is useful for digital signal processing (DSP) applications such as fast Fourier transform (FFT) and discrete cosine transform (DCT) butterfly operations. To improve the performance of the fused floating-point add-subtract unit, a dual-path algorithm and pipelining are employed. The proposed designs are implemented for both single and double precision and synthesized with a 45-nm standard-cell library. The fused floating-point add-subtract unit saves 40% of the area and power consumption compared to a discrete floating-point add-subtract unit. The proposed dual-path design reduces the latency by 30% compared to the discrete design with area and power consumption between that of the discrete and fused designs. Based on a data flow analysis, the proposed fused dual-path floating-point add-subtract unit can be split into two pipeline stages. Since the latencies of two pipeline stages are fairly well balanced, the throughput is increased by 80% compared to the nonpipelined dual-path design.
I. INTRODUCTION

C
URRENT digital signal processing (DSP) systems are making the transition from fixed-point arithmetic (used initially because of its simplicity) to floating-point arithmetic. The latter has several advantages including the freedom from overflow and underflow and ease of interfacing to the rest of the system (which generally will use IEEE-754 Standard floating-point arithmetic [1] ). To improve the performance of floating-point arithmetic, several fused floating-point operations have been introduced: Fused Multiply-Add (FMA) [2] - [4] , Fused Add-Subtract [5] , and Fused Two-Term Dot-Product [6] . The fused floating-point operations not only improve the performance, but also reduce the area and power consumption compared to discrete floating-point implementations.
This paper presents improved architecture designs and implementations for a fused floating-point add-subtract unit. Many DSP applications such as fast Fourier transform (FFT) and discrete cosine transform (DCT) butterfly operations can benefit from the fused floating-point add-subtract unit [7] , [8] . Therefore, the improved fused floating-point add-subtract unit will contribute to the next generation floating-point arithmetic and DSP application development.
The proposed fused floating-point add-subtract unit takes two normalized floating-point operands and generates their sum and difference simultaneously. It supports all five rounding modes specified in IEEE-754 Standard [1] . Several techniques are applied to achieve low area, low power consumption, and high speed: 1) Instead of executing two identical floating-point adders, the fused floating-point add-subtract unit shares the common logic to generate the sum and difference simultaneously. Therefore, it saves much of the area and power consumption compared to a discrete floating-point add-subtract unit. Also, it reduces the latency by simplifying the control signals. 2) A dual-path algorithm can be applied to increase speed.
The dual-path logic consists of a far path and a close path.
In the far path, the addition, subtraction and rounding logic are performed in parallel. By aligning the significands to the minimal number of bits, the addition, subtraction and rounding logic are simplified. There are three cases for the close path depending on the difference of the exponents. For each case, addition, subtraction and leading zero anticipation (LZA) are performed in parallel and rounding is not required. Therefore, the dual-path design reduces the latency of the critical path. 3) To increase the throughput, pipelining can be applied.
Based on data flow analysis, the proposed dual-path design is split into two pipeline stages. By properly arranging the components, latencies of the two pipeline stages are balanced so that the throughput of the entire design is increased. Section II describes the traditional discrete floating-point add-subtract unit with two identical floating-point adders. The next three sections present improved architectures for a fused floating-point add-subtract unit design. In Section III, the fundamental concepts of the fused floating-point add-subtract unit and its implementation are presented. Improved architectures for applying the dual-path algorithm and implementation details are described in Section IV. Based on the data flow analysis pipelining is applied to the dual-path design in Section V. To evaluate the performance of the designs, the designs are implemented for both single and double precision. The implementation for double precision can be done by extending the single precision implementation. For simplicity, only the single precision designs are described in Sections II to V. Then, the evaluation results of various designs for both the single and double precision are discussed in Section VI.
1549-8328/$31.00 © 2012 IEEE 
II. TRADITIONAL FLOATING-POINT ADD-SUBTRACT UNIT
A direct way to implement the floating-point add-subtract operation is to use two identical floating-point adders in parallel as shown in Fig. 1 . One of the adders performs an addition and the other performs a subtraction to produce the sum and difference simultaneously. A traditional floating-point adder [9] , [10] such as that of Fig. 2 can be used for each operation. The steps to execute the floating-point addition are as follows:
1) Exponent compare logic compares the exponents of the two operands A and B to determine which exponent is greater and calculates their difference.
2) The exponent comparison results are used for the significand swap logic. When the exponents are equal, the significands are compared to identify the smaller significand. The significand of the smaller operand is shifted by the amount of the exponent difference (if any) for the alignment and the guard, round and sticky bits are attached to the LSB. 3) Since some of rounding modes specified in IEEE-754 Standard [1] require knowing the sign (i.e., round to positive and negative infinity), the sign logic must be performed prior to the round logic. The sign logic provides the sign of the sum and the operation decision bit to the round logic and significand adders, respectively. 4) The two significands are passed to the significand add-subtract unit and LZA simultaneously. The add-subtract unit performs the addition or subtraction of the two significands depending on the operation. It produces rounded and unrounded results and the round logic selects one of them for a fast rounding. The LZA generates the amount of cancellation during the subtraction in a constant time so that the subtraction result is immediately normalized [11] . The overflow of the significand adder and the shift amount from the LZA are passed to the exponent adjust logic. 5) Using the shift amount, the exponent adjust logic generates the exponent of the sum. In this step, inexact, overflow and underflow of the exponent (if any) are detected for setting the exception flags.
III. FUSED FLOATING-POINT ADD-SUBTRACT UNIT
The discrete floating-point add-subtract unit produces the sum and difference simultaneously by executing two identical floating-point additions. However, much of the logic such as exponent comparison, significand swap and alignment in the two floating-point adders is nearly the same for the two operations.
In order to reduce the overhead, a fused floating-point add-subtract unit shares the common logic for the two operations [5] , [7] . Fig. 3 shows the design of a fused floating-point [5] and [7] ).
add-subtract unit. The fused floating-point add-subtract unit produces the sum and difference results simultaneously by executing the shared logic such as the exponent comparison, significand swap and alignment. Also, the fused floating-point add-subtract unit performs only one significand addition and subtraction for each operation. Table I shows the sign decision table based on the signs of the two operands and comparison of the exponents and significands. Since two operations are explicitly performed for sum and difference results (e.g., if the addition is used for the sum, the subtraction is used for the difference), the addition and subtraction are separately placed and only one LZA and normalization (for the subtraction) is required. Assuming both sign bits are positive, the addition and subtraction are performed separately. Then, two multiplexers select the sum and difference with the operation decision bit, which is the XOR of the two sign bits. More details of the logic are described in next section. This approach simplifies the addition and subtraction operations. It also reduces the control signals for distinguishing the signs and final results. Thus, the fused floating-point add-subtract unit achieves low area, low power consumption and high speed.
IV. DUAL PATH FUSED FLOATING-POINT ADD-SUBTRACT UNIT
To achieve a high-performance fused floating-point add-subtract unit, this paper proposes a dual-path approach. Most high-speed floating-point adders employ the dual-path algorithm [12] , [13] . Fig. 4 shows the dual-path fused floating-point add-subtract unit. The dual-path algorithm skips the normalization step depending on the exponent difference. Since the normalization after the subtraction is one of the bottlenecks in the fused floating-point add-subtract unit, the dual-path approach improves the performance.
The dual-path approach consists of far path and close path logic. The far path logic takes the significands if the difference of the exponents is greater than 1. In this case, massive cancellation does not occur during the subtraction so that the LZA is unnecessary. The far-path logic is implemented similar to the front end of the traditional floating-point adder as shown in Fig. 5 . where is the exponent difference. The two significands are aligned with a 1 attached to the MSB end to make 24-bit normalized significands. By aligning the two significands to 24-bits, significand addition and subtraction are simplified, resulting in a reduction in the logic area and delay. The significand of the smaller operand is right shifted by amount of the exponent difference and aligned to 24-bits. The sticky bit is set if at least one bit of the 22 LSBs is a 1 and the 23rd and the 24th LSBs become the round and guard bits, respectively, as shown in Fig. 6 . Since the significand of the larger operand is not shifted, the 24-bit significand is kept as it is without guard, round and sticky bits. The greater and smaller significands are passed to the addition and subtraction units. For fast integer addition and subtraction, the Kogge-Stone parallel prefix approach is used [14] . The addition and subtraction produce the rounded and unrounded results and one of them is selected by the round logic: if otherwise if otherwise. The round logic takes the LSBs, guard, round and sticky bits of the two significands and performs 4-bit addition and subtraction to determine if the result is rounded up or not for each operation. Also, it requires the sign bits of the addition and subtraction to support all five round modes specified in IEEE-754 Standard [1] as shown in Table II . Since the far path requires at most a 1-bit normalization shift for both addition and subtraction, it avoids a large normalization procedure.
The close path takes the significands if the difference of the two exponents is 0 or 1. Fig. 7 shows the close path logic. There are three cases for the close path depending on the difference of the exponents:
For each case, addition, subtraction and LZA are performed simultaneously. LZA with concurrent correction is used for a fast normalization [15] , [16] . 1 One of the three results is selected based on the small exponent comparison, which compares the two LSBs of the exponents. In contrast to the far path, the significands are not swapped to avoid a large significand comparison. When the subtraction result is negative, a two's complement operation is performed to convert the result to a positive value. The carry-out of the subtraction indicates a significand comparison, which is passed to the sign logic, to determine the sign bits when the two exponents are equal. Since the significands in the close path are mis-aligned by at most 1-bit, rounding is not required [17] . The addition result is normalized by 1-bit overflow, while the subtraction result is normalized by up to 23-bits using the shift amount from the LZA.
The remaining logic for the dual-path fused floating-point add-subtract unit is shown in Fig. 8 . The exponent compare logic shown in Fig. 8(a) calculates the difference of the two exponents and determines which is greater, these are the same functions required for the traditional logic. In addition to this, The path decision flag is passed to the two multiplexers for selecting the addition and subtraction results between the far and close paths.
The exponent adjust logic shown in Fig. 8 (b) performs addition and subtraction to adjust the exponents by the amount that the significands are shifted. The exponent adjust logic produces two exponent results simultaneously. In the case of addition, one of the increment values is added depending on the path decision that is the overflow from the significand addition. In the case of subtraction, if the far path is selected, the decrement value is subtracted that is the underflow from the significand subtraction. If the close path is selected, the normalization shift value is subtracted that is the shift amount of the massive cancellation that occurred during the subtraction. The two adjusted exponents are passed to the exception logic. Since underflow does not occur in default exception handling, the exception logic supports abruptUnderflow, 2 an alternate exception handling specified in IEEE-754 Standard [1] to detect three exception cases: if otherwise if otherwise where round up is the rounding decision of the significand result. The overflow flag is set if the exponent exceeds the maximum value that can be represented such as positive and negative infinity. The underflow flag is set if the exponent is too small to be represented such as zero and subnormal values. Overflow only occurs in addition and underflow only occurs in subtraction [18] . The inexact flag is set if the rounded significand result is not exact, which is the case if either of the rounding bit, overflow flag or underflow flag is set.
The sign logic consists of two parts. The first sign logic generates two sign bits of the addition and subtraction for the rounding in the far path and the second part generates the sign bits of the sum and difference and an operation decision bit. In the far path case, the exponent difference is large enough to determine the sign bits with the exponent comparison. Since the round logic in the far path requires the sign bits, the sign bits are passed to the far path logic. The close path, however, requires significand comparison for the case of equal exponents. Therefore, the sign bits of the sum and difference are generated after the significand comparison bit is provided by the close path. The sign logic for sign bits and an operation decision bit are where and are the comparison results of exponents and significands, respectively. Once the operation decision bit is generated, it is passed to the two multiplexers for selecting the sum and difference.
V. PIPELINED FUSED FLOATING-POINT ADD-SUBTRACT UNIT
As is well known, proper pipelining increases the throughput of floating-point adders [12] , [13] , [19] . In order to achieve a proper pipelined fused floating-point add-subtract unit, the latencies of the components in the proposed design are investigated. Each component is implemented in Verilog-HDL and synthesized with the Nangate 45-nm technology standard-cell library. The latencies of the various elements of the single precision floating-point add-subtract unit are listed in Table III . Since several components are executed in parallel, they are combined to a stage and the sum of the component delays determines the latency of the stage. Considering the latencies of components and their parallel execution, the proposed design is split into two pipeline stages. Each pipeline stage is executed every cycle so that the largest latency determines the throughput of the design. Fig. 9 shows the data flow, the latency of each component, and the critical path. The critical paths of the two pipeline stages are The first pipeline stage consists of unpacking logic and the two data paths: the far path and the close path. The two data paths are the first half of the dual path, which is described in Figs. 5 and 7. The far path in the first pipeline stage contains the exponent compare, sign logic 1, significand swap, align and sticky logic. The close path in the first pipeline stage contains the small exponent compare, small significand align, three additions, subtractions and LZAs, and 3:1 select logic. Among the two data paths, the close path takes the larger latency so that it becomes the critical path. The series of components in the close path determines the latency of the first pipeline stage, which is 0.52 ns.
The second half of the dual path and the remaining logic comprise the second pipeline stage. The far path in the second pipeline stage contains the addition, subtraction, round logic, and round select logic. The close path in the second pipeline stage contains the sign logic 2, complement and normalization logic. Among the two data paths, the far path takes the larger latency so that the second half of the far path logic and the remaining logic (path select, exponent adjust, and operation select logic) comprise the critical path, which adds up to 0.48 ns. The latencies of the two pipeline stages are fairly well balanced so that the throughput of the design is increased. Since the latency of the first pipeline stage is slightly larger than that of the second pipeline stage, it determines the throughput of the entire design.
VI. RESULTS
The previous sections have introduced various designs for the fused floating-point add-subtract unit. Each design is im -TABLE IV  FUSED FLOATING-POINT ADD-SUBTRACT DESIGN COMPARISON   TABLE V  PIPELINE STAGE COMPARISON plemented in Verilog-HDL and synthesized with the Nangate 45-nm technology standard-cell library. The functionality of the implementations is verified by performing a simulation with 1000 random input vectors. In order to evaluate the designs, the area, critical path latency, throughput, and power consumption are compared. Table IV shows the results for the four designs in single and double precision implementations. Since the fused floating-point add-subtract unit shares much of the logic, it saves more than 40% of the area and power over the traditional discrete floating-point add-subtract unit. Also, the fused floating-point add-subtract unit performs only one sign and operation decision at the end of the entire logic, while the traditional floating-point adder requires sign and operation decision logic for each addition, subtraction and exponent adjustment. As a result, the fused floating-point add-subtract unit shows 8% less latency than the traditional discrete floating-point add-subtract unit. The dual-path fused floating-point add-subtract unit requires more area and power consumption due to the three parallel additions, subtractions and LZAs for the close path. However, the dual-path design reduces the latency by 30%. The benefits of the proposed design are shown in both the single and double precision implementations. The double precision implementation requires about twice the area and power consumption of the single precision implementation due to the larger addition and subtraction logic. Since the addition and subtraction logic using the parallel prefix form [14] logarithmically increases the latency, the latency for double precision increases by approximately 25%.
The proposed pipelined fused floating-point add-subtract unit contains two stages. Each stage requires latches as many data and control signals are passed from the first stage to the next. The comparison of the area, latency, throughput and power consumption of each pipeline stage are given in Table V. Although the latches and control signals in the pipeline stages increase the total area, latency and power consumption, the throughput is increased by more than 80% compared to the non-pipelined dual-path implementation.
VII. CONCLUSION Improved architectures for the design and implementation of a fused floating-point add-subtract unit are presented. The floating-point add-subtract unit is useful for digital signal processing applications such as FFT and DCT butterfly operations. This paper presents improved architectures which apply the dual-path algorithm and pipelining to the fused floating-point add-subtract unit and compares the area, latency, throughput, and power consumption with the traditional parallel implementation.
The fused floating-point add-subtract unit saves more than 40% of the area and power consumption compared to the traditional discrete floating-point add-subtract unit by sharing the common logic. Also, the fused floating-point add-subtract unit reduces the latency due to its simplified control logic. The dualpath fused floating-point add-subtract unit reduces the latency by 30% compared to the discrete design by performing several add-subtract operations for each case in parallel. Additionally, a pipelined implementation to increase the throughput of the dual-path fused floating-point add-subtract unit is described. It uses two pipeline stages and the latencies are well balanced so that the throughput is increased by about 80% compared to the non-pipelined dual-path design.
