Abstract-This paper extends the consideration of fused floating-point arithmetic to operations that are frequently encountered in DSP. The Fast Fourier Transform is a case in point, it uses a complex butterfly operation. For a radix-2 implementation, the butterfly consists of a complex multiply followed by the complex addition and subtraction of the same pair of data. These butterfly operations can be implemented with two fused primitives, a fused two-term inner product and a fused add subtract unit. A floating-point fused FFT Butterfly unit is presented that performs single-precision butterfly floating-point operation in a time that is only 87% the time required for a conventional floating-point butterfly. When placed and routed in a 45nm process, the fused FFT Butterfly unit occupied about 72% of the area needed to implement a floating-point butterfly using conventional floating-point adders and multipliers. The numerical result of the fused butterfly unit is more accurate because fewer rounding operations are needed.
I. INTRODUCTION
Floating-point data representations provide a wide dynamic range, freeing DSP designers from scaling and overflow/underflow concerns that arise with fixed-point representations.
Much research has been done on the floating-point fused multiply add (FMA) unit [1] . It has several advantages over discrete floating-point adders and multipliers in a floatingpoint unit design. Not only can a fused multiply-add unit reduce the latency of an application that executes a multiplication followed by an addition, but the unit may entirely replace a processor's floating-point adder and floating-point multiplier. Many DSP algorithms have been rewritten to take advantage of the presence of FMA units. For example, in [2] a radix-16 FFT algorithm is presented that speeds up FFTs in systems with FMA units. Highthroughput digital filter implementations are possible with the use of FMA units [3] . FMA units are utilized in embedded signal processing and graphics applications [4] , used to perform division [5] , argument reduction [6] , and this is why the FMA has become an integral unit of many commercial processors such as those of IBM [7] , HP [8] and Intel [9] .
Similar to operations performed by a FMA, in many DSP algorithms and in other fields, both the sum and difference of a pair of operands are needed for subsequent processing. Another frequently used DSP operation is calculating the sum of the products of two pairs of operands (dot-product). For example, these operations are required in computation of the FFT and DCT butterfly operations.
In traditional floating-point hardware, these operations may be performed in a serial fashion, which limits the throughput. Alternatively, they may be performed in parallel, which is expensive (in silicon area and in power consumption).
Two new floating-point primitives have been introduced by the authors: a floating-point fused add-subtract unit (FAS) [12] , and a floating-point fused dot-product unit (FDP) [13] . The FAS unit performs the following operation: x = a + b and y = a -b
A block diagram of the FAS unit is shown in Figure 1 . The FAS unit executes at the same speed and provides substantial saving in area and power consumption when compared to the conventional parallel approach used to realize the simultaneous add subtract.
The FDP unit performs the following operation:
A block diagram of the FDP unit is shown in Figure 2 . 
II. APPROACH
There are two approaches that can be taken with conventional floating-point adders and multipliers to realize the FFT radix-2 decimation in frequency butterfly unit. The parallel implementation shown in Figure 4 uses six multipliers operating in parallel and four adders. The parallel approach is appropriate for applications where maximizing the throughput is more important than minimizing the area or the power consumption. This paper introduces a third approach where the FFT radix-2 decimation in frequency butterfly unit is realized using the FAS and FDP primitives as shown in Figure 6 . This approach replaces four of the floating-point adders by two FAS units, and replaces the remaining two floatingpoint adders and four floating-point multipliers by two FDP units. Using the FAS and FDP units should provide substantial saving in area and power consumption, faster execution speed and more accurate results compared to the conventional parallel approach. 
III. ERROR ANALYSIS
Floating-point computation suffers from two types of errors: propagation error, which is determined by the errors of input data and the operation type only, and rounding error, which is caused by the rounding of the operation result [11] .
The propagation error is derived as follows:
Where: k is the amplification factor, determined based on the operation type and data. For floating-point multiplication, the propagation error amplification factors are:
While for floating-point addition, the amplification factors are:
The second component of the overall error of a floatingpoint operation is rounding error, a formula for it is derived as shown in the following equations, where the precious value of a floating-point significand is: (12) The arithmetic model for the error of any floating-point add or multiply operation is the sum of the two errors given in Equations (5) and (12) .
In the fused butterfly unit the worst case error is in the computation of the re y and im y which is:
prop prop round (13) In the case of the conventional parallel implementation, the overall error is:
The above analysis shows that the fused butterfly unit has 60% of the rounding error of the discrete execution.
The FDP unit and discrete floating point adder and multiplier were modeled in Matlab. The FFT butterfly operation was simulated using the single-precision FAS and FDP units and then using discrete single-precision floatingpoint operations. Both were compared to the Matlab implementation using the built-in double precision operations. Figure 7 shows the error plots of the two approaches. The error values using the fused butterfly unit were in the range of -1.7x10 -5 to 1.6x10 -5 , while the error values using the discrete floating-point operations were in the range of -2.4x10 -5 to 2.3x10 -5 (about 40% higher). 
IV. VERILOG MODELING AND SYNTHESIS
To confirm the benefits of the fused butterfly unit, the following floating-point units were implemented in synthesizable Verilog-RTL:
The Verilog models were synthesized using 45nm libraries. The area and the critical timing paths were evaluated. All the units were designed to operate on singleprecision IEEE Std-754 operands [10] .
V. PLACE AND ROUTE
The floating-point adder, multiplier and the fused dotproduct unit were implemented using an automatic synthesize, place and route approach. A high performance 45 nm process was used for the implementation with a standard cell library designed for high speed applications. Figure 8 shows the floorplan and the critical timing path of the conventional butterfly unit while Figure 9 shows the floorplan and the critical timing path of the fused butterfly unit. Table 1 shows the implementation data. The conventional floating-point butterfly unit occupies an area of 247 μm by 248 μm, while the fused floating-point butterfly unit occupies an area of 218 μm by 218 μm. The placed, routed and tapeout ready floating-point fused butterfly unit and the conventional parallel butterfly unit timing were analyzed using industry standard STA tools, with an extracted and back-annotated netlist. The fused butterfly unit performed a butterfly operation in 4.0 ns while the conventional butterfly needed 4.6 ns . 
VII. CONCLUSIONS
The area and latency of the two conventional approaches (ignoring the multiplexers and register) and the FDP unit are compared in Table 2 and plotted in Figure 10 . The fused dot product is intermediate in area between the conventional serial and the conventional parallel approaches. Its latency is about 80% of that of the conventional parallel approach and about half that of the conventional serial approach. 
