Abstract -Two new floating-point fused multiply-add architectures for the single instruction execution of (A x B) + C are presented. The three-path architecture uses parallel hardware paths similar to those in dualpath floating-point adders. The new bridge architecture re-uses common floating-point components to add a fused multiply-add instruction.
I. INTRODUCTION The fused multiply-add (FMA) operation was introduced in 1990 on the IBM RS/6000 for the single instruction execution of the equation (A x B) + C with single and double precision floating-point operands [1] , [2] . This hardware unit was designed to reduce the latency of dot product calculations and provide greater floating-point arithmetic precision since only a single rounding is performed in an FMA on the combined full precision product and sum.
Since 1990, a plethora of algorithms that utilize the (A x B) + C single-instruction operation have been introduced, for applications in DSP and graphics processing [3] , [4] , FFTs [5] , FIR filters [3] , division [6] , argument reduction [7] , etc. To accommodate the increased use of the FMA instruction, several commercial processors have implemented embedded FMA units. These chips include designs by IBM [1] , [8] [9] [10] , HP [11] , [12] , MIPS [13] , ARM (MACC/un-fused) [3] , and Intel [14] , [15] . Some chips entirely replace the floating-point adder (FADD) and floating-point multiplier (FMUL) with an FMA unit by using constants to perform single floating-point operations, e.g., (A x B) + 0.0 for multiplies and (A x 1.0) + C for adds. All these realizations use a serial implementation based on a modified IBM RS/6000 architecture.
This combination of commercial implementation and increased algorithmic activity has pushed the IEEE P754 committee to consider including the FMA instruction into the proposed IEEE standard for floating-point arithmetic [16] . Most recently, AMD introduced the SSE5 extensions to the x86 instruction set [17] which is centered on the inclusion of the FMA instruction and its derivatives into modern x86 computing processors.
Previous FMA works
The hardware implementation of an FMA unit is a difficult undertaking. The latency and power problems presented by the massive internal precision of an FMA datapath, not to mention the overhaul of a supporting floating-point co-processor required to support a threeoperand system, requires a large amount of resources with several undesired trade-offs -among them a noticeable reduction in FADD and FMUL single-instruction performance [1] , [9] .
In efforts to reduce the costs of the original FMA architectural design, several proposals for the improvement of FMA execution units have been made. A reduced latency FMA has been proposed that combines the addition and rounding stage to reduce the number of serial stages required by a traditional FMA unit [18] . The design later is updated to allow higher performance FADD instructions by introducing a dual-path system similar to that found in a dual-path FADD architecture [19] . A multi-path FMA presented in reference [20] identifies the 5 possible major arithmetic cases of an FMA and processes these cases in two parallel hardware paths. An FMA presented in reference [21] selectively shuts down two parallel datapaths in exchange for increased area and lower precision to reduce the power consumption of the massive execution unit.
Most new FMA proposals support a dual-path FADD concept as applied to FMAs [19] , [20] , [21] . However, forcing an FMA unit into a strictly dual-path far/close system presents new implementation challenges never faced by a dual-path FADD unit. A dual-path FMA results in either a reduction of precision or a massive far-path latency due to the large aligner bit-widths and case-specific carry/sticky hardware. The dependency on whether the addend or product is the greater term greatly complicates the path and increases the hardware requirements to the point where the far path itself becomes itself the limiting feature of the block. Such issues found in dual-path FMA applications worsen as precisions increase and technologies scale down, as a result of an increasingly wire dominated datapath [22] - [24] .
Proposed FMAs
Two new FMA designs are presented with differing goals. The first of the designs, the three-path FMA, is intended to reduce the power and latency requirements of the FMA unit by using three-parallel hardware paths. Three implemented hardware paths are necessary in an FMA to achieve the benefits of a multi-path system and avoid the costs of a dual-path FMA. The classic concept of a far path itself is further divided in the presented FMA to keep both full precision and provide latency and power reductions.
The second FMA design is intended to provide an FMA solution to current floating-point units without requiring either a massive co-processor architectural overhaul or a third independent execution unit. Instead, the design presents a new architecture that adds FMA functionality by including hardware between an FADD and FMUL unit, creating a "bridge" that connects the two. The bridge FMA architecture is designed to re-use as many components as possible from both the FADD and FMUL to minimize the area and the power consumption of the enhancement.
To provide details on the design costs of the bridge and three-path FMAs [1] , [2] . There are localized improvements and varying bit-width changes from one design to another, but the basic micro architecture has remained invariant. The traditional architecture (denoted "classic" FMA unit in this paper) is shown in Figure 1 .
There are several serial steps to complete the execution of a FMA instruction in double-precision format on the traditional architecture:
1.) Multiply the 53-bit input significands A x B to produce a 106-bit carry-save format result. Align the addend C to the appropriate point in the range of the 106-bit product, including up to 55-bits above the product (53-bit significand length and a double carry buffer of 2-bits) based on exponent difference. This requires a 161-bit aligner. 2.) Combine the lower 106-bits of the addend and the product with a 3:2 CSA to produce a 161-bit string of data in carry-save format. The three-path FMA architecture shown in Figure 2 is designed to follow these guidelines. The global design splits the data-path following the multiplier tree into three case specific blocks. Only one path of the three is ever selected for use in any given instruction.
Each parallel path independently processes and prepares the numerical data for a combined add/round stage. As in many modern arithmetic unit designs, a combined add/round stage removes the requirement for a massive adder followed by another addition/increment unit for the purpose of IEEE-754 compliant rounding.
The specifics of each path, as well as an explanation of the selected add/round scheme, are described briefly in the following sections.
A. Three-Path FMA Split Far Paths The three-path FMA unit uses two separate far paths with opposite operand "anchoring" for data dependent processing. The use of the term "anchor" is a reference to locking the position of the greater operand in a classic far path. As seen in Figure 2 , the two blocks are named the adder far path and the product far path. Figure 3 shows the adder far path in detail. This path is selected when the exponent difference detects that the addend is larger. For this case, the addend is anchored. The later arriving product is then aligned over a 57-bit range and inverted for subtracts. Following the inversion stages, all three operands are combined in 3:2 carry save adders (CSAs) or half adders (HAs) to produce two 163-bit numbers. The most significant bits of both results are used for corner-case correction, and the lower 55-58-bits are sent to a carry/sticky tree, as the least significant bits will never be selected in the final result.
The adder far path finalizes with two 106-bit operands ready for addition and rounding as well as an input carry and sticky bit generated by the discarded lower bits. The adder far path unit is not a timing bottleneck, as a 57-bit normalization not loaded with parallel carry/sticky logic is easily optimized as compared to the FMA close path. Figure 4 shows the product far path. This path is the complement of the adder far path and is enabled when the exponent difference determines that the product is larger than the addend. Much like the classic FMA, this aligner is not only the largest of the system, but is loaded with sticky fallout. Therefore this path, if selected, runs in parallel with the multiplier tree. When the product arrives from the multiplier, all the data are combined in a 3:2 CSA, adjusted and sent in 106-bit sum/carry form to the add/round stage.
B. Three-Path FMA Close Path For cases when the exponent difference between the addend and the product is too close to easily determine a larger operand, all data is passed to the close path. This path only handles FMA subtraction operations and is geared specifically to deal with massive cancellation.
To match the two far paths, the close path is designed without a complementation stage. Shown in Figure 5 , the close path accomplishes this via significand swapping. First, the close path uses 3:2 CSAs and HAs to combine a complemented aligned addend with the product. Likewise, the logically opposite term is also created with a complemented product and an un-complemented addend. The first 3:2 combination is passed to a 57-bit comparator (57-bits is selected since all bits after position 57 in the aligned addend are always 'O's, represented as the "no round" path) to determine which of the operands is larger. The comparator signals the swap multiplexers to choose the correct inversion combination and the results are normalized in preparation for addition and rounding. The LZA that controls this normalization is passed only one combination of inversion inputs, as its functionality is not affected by which operand is larger. Depending on the addition/rounding scheme selected, the one-bit LZA correction shift may be handled in the add/round block.
C. Three-Path FMA Add/Round Stage All three middle stage paths in the three-path FMA design prepare the data for the 106-bit add/round stage. The combined addition and rounding stage algorithm combines various suggestions for the add/round stages of a floatingpoint multiplier [25] , [26] with modifications to the control logic, signals, and multiplexer sizes to account for the FMA functionality. Finally, a "no round" path block has also been added in parallel to the scheme to handle the extra output case from the close path.
In the case of a close path selection with the no round signal assertion, the no round data inputs are added and normalized in a path separate from and parallel to the add/round stage. The result from this "no round" path is forwarded to the add/round result multiplexer, postnormalized, and latched.
When either the no round path result or add/round result is latched, the FMA instruction is complete and the data exit the unit, finalizing the three-path FMA datapath.
IV. THE BRIDGE FMA ARCHITECTURE
The bridge fused multiply-add unit is a design intended to add FMA functionality to existing floating-point coprocessor units by including specialized hardware that reuses FADD and FMUL components. Additionally, the bridge architecture is intended to provide a realistic study of the implementation costs involved when adding the FMA feature to popular FADD and FMUL co-processor units. Figure 6 shows a high level block diagram of the bridge FMA architecture. The design begins with common FMUL and FADD units capable of independent and parallel execution. Several blocks are added between the two execution units, creating a "bridge" capable of carrying and processing data from the FMUL unit to the FADD unit to perform an FMA instruction.
The bridge FMA architecture does not require a fully independent FMA hardware implementation. Pieces from both the FMUL and FADD units are modified and reused for dual functionality. Specifically, the FADD add/round stage is used for both additions and FMAs, while the FMUL re-uses the single largest component block of any arithmetic unit, the multiplier array. The remaining hardware requirements for a complete FMA instruction are implemented in the bridge unit itself, which is only powered on via clock-gating during an FMA instruction.
A. The Bridge FMA Unit The bridge unit is shown in Figure 7 . The unit is essentially the classic FMA architecture described in Section II without the multiplier array, rounding, or postnormalization block. Instead, the bridge unit accepts the product from the floating-point multiplier and combines it with a pre-aligned 161-bit addend taken from an operand originally provided to the FADD. The unit then proceeds with steps 2-6 of the FMA algorithm described in Section II.
B. The Bridge Add/Round Unit The addition and rounding unit is designed to perform several roles. When a stand-alone FADD instruction is required, the add/round unit acts as a common FADD dualpath merge stage, selecting between the far and close path operands for inputs to the addition and rounding units. For FMA instructions, the same multiplexer used for a merge in the FADD path is expanded to select the FMA unit unrounded result. The second operand input to the addition and round units is passed a null string, as another operator is not needed for the FMA rounding completion.
The add/round unit used in the implementation of the bridge FMA is a combined add/round scheme suggested by several schemes seen in references [27] and [28] . The combined FADD/FMA add/round stage finalizes the bridge FMA datapath. 
RESULTS
Each unit's Verilog RTL was custom and handsynthesized into AMD's 65nm silicon-on-insulator technology using a standard-cell library. The use of a standard-cell library is intended to provide results that provide insight to a logical build of each unit, rather than a more complex and ambiguous custom transistor-level construction.
To fully understand the units as a whole, each design's internal pipeline latches were removed so that a total latency and power result could be obtained. Each power simulation frequency was set to the slowest in the group of compared blocks (without latches) for a normalized result. Table I compares the classic FMA and the three-path FMA designs in the categories of latency, both logic level counts and physical timing, area, and maximum observed power consumption. The three-path FMA shows an approximate 25% decrease in required logic levels, a 12% decrease in latency, and a 15% decrease in maximum power consumption. These gains require an approximate 40% increase in area. 
