Abstract-Decimal floating point operations are important for applications that cannot tolerate errors from conversions between binary and decimal formats, for instance, scientific, commercial, and financial applications. In this paper we present an IEEE 754-2008 compliant parallel decimal floating-point multiplier designed to exploit the features of Virtex-5 FPGAs. It is an extension to a previously published decimal fixed-point multiplier. The decimal floating-point multiplier implements early estimation of the shift-left amount and efficient decimal rounding. Additionally, it provides all required rounding modes, exception handling, overflow, and gradual underflow. Several pipeline stages can be added to increase throughput. Furthermore, different modifications are analyzed including shifting by means of hard-wired multipliers and delayed carry propagation adders.
I. INTRODUCTION
Numerical problems are usually formulated in decimal notation. Therefore, the problems should be solved with a decimal floating-point arithmetic in order to avoid conversion errors during input and output. Because of the increasing importance, specifications for decimal floating-point arithmetic have been added to the recently approved IEEE 754-2008 Standard for Floating-Point Arithmetic that offers a more profound specification than the former Radix-Independent Floating Point Arithmetic IEEE . For this reason, new efficient algorithms, in particular multiplication, have to be investigated and providing hardware support for decimal arithmetic is becoming more and more a topic of interest.
In this paper our previously published parallel fixed-point multiplier [1] is extended to support decimal floating-point multiplication. It is fully parallel and several configurable pipeline stages can be inserted. A fast leading zeros counter is presented and different new implementations are traded off, including delayed carry propagation adders and shifting by means of multiplication. It is IEEE 754-2008 compliant, i.e., it supports all rounding modes, exception handling, overflow, and gradual underflow. Particularly, the latter requires a significant overhead and is therefore often omitted.
The outline is given as follows: Section II shortly describes the fixed-point multiplier. Section III introduces the decimal floating-point multiplication followed by a description of the proposed architecture in detail. Post-place & route results are presented in section IV and finally in section V the main contributions of this paper are summarized. 
II. DECIMAL FIXED-POINT MULTIPLIER
The Decimal Fixed-Point Multiplier (DFixMul) computes the product A · B of the unsigned decimal multiplicand A and multiplier B, both with the same precision p. It is fully combinational and can be pipelined. In particular, it is based on BCD recoding schemes, fast partial product generation, and a BCD-4221 Carry Save Adder (CSA) reduction tree. It is optimized for use on Xilinx Virtex-5 FPGAs.
A decimal number Z is called BCD-X 1 X 2 X 3 X 4 coded when Z can be expressed by (1) .
Time-critical components are BCD-8421 Carry Propagation Adders (CPA) that are used in the generation of the multiplicand multiples as well as for final addition. The adders proposed in [3] are designed and placed on slice-level, considering a minimum carry chain length and least possible propagation delays. The DFixMul presented in this paper is designed for Virtex5 devices and is based on a previously published work [1] , which was optimized for Virtex-II Pro FPGAs. The architectures are very similar. However, the low-level components are different, i.e., CSA, CPA, and the Multiplicand Multiples Generator (MMGen). Generally, the fixed-point multiplier consists of six functional blocks as depicted in Fig. 1 . The basic idea is to generate p + 1 partial products and to sum them up, which is performed by the parallel Carry Save Adder Tree (CSAT) and the final BCD-8421 CPA. The CSAT is based on (3:2) CSA blocks for BCD-4221 format. The partial products are The MMGen exploits the correlation between shift operation and constant value multiplication [2] . Thus, the multiples A × 2, A × 4, and A × 5 can be easily computed by digit recoding and constant shift operations. A recoding is very fast and consumes two (6:2) LUTs per digit, whereas a constant shift operation costs nothing because it is just a renaming of signals. Hence, all multiples can be easily generated by simple shift operations and digit recodings, only A × 3 requires an additional CPA, as depicted in Fig. 2 .
The Decimal Recoding Unit (DRec) shown in Fig. 1 transforms each multiplier's digit B k from the digit set {0, . . . , 9} into the signed digit set {−5, . . . , 5}. This recoding increases the number of partial products by one (p + 1) but gets along without any ripple carry. Hence, it is a very fast operation.
Since the multiplier's output is of length 2p but one single partial product is of length p, for 10's complement generation each partial product has to be extended and if required padded with nines. To keep the input length of CSAT short, the Negative Digits Correction Unit (NegDC) combines the paddings of all partial products in a single word and passes it to the CSAT. This is feasible because adding several words composed of leading nines and following zeros always yields to a decimal word composed of only 0, 8, and 9. Moreover, the position of the nines and eights can be calculated very fast by using the FPGA's fast carry chain, see [1] .
The reduction of the partial products is performed by a (n:2) CSA tree. The tree is composed of parallel and consecutively wired BCD-4221 (3:2) CSAs, that add three BCD-4221 digits yielding a sum and a carry digit, both of BCD-4221 coding scheme. The n = p + 2 decimal words are composed of p + 1 partial products and one summand that regards the sign paddings, as described previously. The redundant carrysave format of the CSAT can be further reduced by a carry propagation adder of length 2p to obtain a unique result.
Several DFP numbers are encoded in three fields: a sign field s, a combination field G, and a trailing significand field T . Two fixed-width basic formats for DFP numbers (decimal64 and decimal128) are specified in IEEE 754-2008 and are provided in Table I . The combination field G encodes the biased expo- nent, the most significant digit (MSD), and informations about infinity and Not a Number (NaN). The trailing significand is either encoded via Densely Packed Decimal (DPD) algorithm or as an unsigned binary integer. The floating-point multiplier presented in this paper satisfies the decimal64 interchange format with DPD encoding. In contrast to decimal fixed-point multipliers, there are only a few papers presenting designs for decimal floating-point multiplication that are fully or partly in compliance with IEEE 754-2008. Erle et al. [4] describe an iterative as well as a parallel DFP multiplier that include early estimation of the Shift-Left Amount (SLA) and efficient decimal rounding. Their proposed iterative DFP multiplier can be extended to support gradual underflow but their parallel DFP cannot. However, the underflow feature increases the iterative multiplier's maximum latency significantly. In [5] is presented a DFP multiplier that is implemented on an FPGA but does not 
A. Proposed Parallel Floating-Point Multiplier
The Decimal Floating-Point Multiplier (DFlMul) proposed in this paper is an extension of the DFixMul. A block diagram is depicted in Fig. 3 . A DFP multiplication begins with the decoding of the input operands and the extraction of the signs, the significands, and the exponents. If one of the operands is a signaling NaN (sNAN) or quiet NaN (qNAN), then the result is also a quiet NaN with the payload of the original NaN. Hence, in order to preserve the NaN payload, the exponent and significand reset units revise the non-NaN operand's significand and exponent to one and zero, see Fig. 4 . Additionally, the exponents of operands that are equal to zero are reset in order to prevent of possible exceptions because of exponent overflow. In the next step the decimal fixed-point multiplication is calculated. The result are two 2p-digit BCD-4221 coded words. We first do not use a 2p-digit CPA to eliminate this redundancy because summing up the result after left shifting (so-called delayed CPA) leads to a CPA of length p + 2 only. But later, we will also consider an implementation which uses the fixed-point multiplier's unique CPA output and we will compare the implementation results with those of the delayed CPA design. In parallel to multiplication, the significands are examined and their leading zeros (LZ A , LZ B ) are counted to determine the SLA. Moreover, the Intermediate Exponent of the Shifted Intermediate Result (IE SIP ) is calculated by the exponent computation unit and the signs are XORed to determine the product's sign (sign P ). According to the SLA, the multiplier's outputs IPs and IPw are left-shifted and afterwards summed up. Furthermore, the p-digit upper word of the shifted intermediate product (SIP U ), the guard digit (G), the round digit (R), the sticky bit (sb), and a carry signal are determined. Depending on the current IE SIP , the overflow and underflow correction unit can insert a corrective right shift when gradual underflow occurs. Likewise, a corrective left shift is necessary when IE SIP exceeds the maximum exponent while SIP U has still leading zeros. If the number of essential digits in the intermediate product is greater than the precision p or if gradual underflow occurs, then rounding is required. The Rounding Unit (RU) computes the rounded product (RP), the exponent of the rounded product (E RP ), and an inexact signal that is asserted when accuracy is lost due to rounding. Finally, the result is encoded again considering sign P , RP, E RP , infinity and NaN. The Exception Unit (ExU) might assert additional exception signals, i.e., invalid operation (result is NaN), inexact (a rounding has been performed), overflow, and underflow. In the following, the SLA computation and exponent computation, shifting, rounding, overflow and underflow correction, and exception generation are explained in detail and an example is given.
The proposed architecture uses the concept of early estimation of SLA [4] which is calculated simultaneously to fixedpoint multiplication and reduces the maximum number of left shift positions during succeeding rounding to one. The SLA ranges from 0 to p and is computed based on LZ A and LZ B .
SLA = min(LZ
The SLA is used to shift the p most significant digits of the Intermediate Products (IP s and IP w ) into the upper word and to perform rounding according to the least significant digits. Two solutions for parallel shift registers are analyzed. The first one uses multiplexers and the second one applies the hard-wired multipliers of DSP48E slices. The latter is possible because an Lk-shift complies with a multiplication by 2 k . The multiplier-based shift register has the advantage that it saves LUTs and the number of leading zeros in power representation (LZ A pow , LZ B pow ) can be calculated very fast by means of the FPGA's carry logic, see 
Prior to rounding, the fixed-point multiplier's MSDs are shifted into the upper word of length p and the lower word is used for round-up detection. Thus, IE SIP is calculated by means of PE and SLA [4] :
For the DFixMul two solutions are analyzed. The first uses the redundant carry-save output and applies a CPA after shifting. The second directly uses the reduced carry-propagation output of the fixed-point multiplier. The drawback of the carrysave output solution is the doubled resource usage for the left shift register. On the other hand, the utilization of the CPA after the operands are left-shifted reduces the maximum CPA's length to p + 2 instead of 2p, i.e., the addition of length 2p can be subdivided into two parallel additions of the lower and upper parts. The lower part is only required to calculate the sticky and carry bits, where the sticky bit indicates if any of the digits beyond the round digit is nonzero. Fig. 6 
RoundTowardZero ru cls0 = (G = 9) · (R = 9) · c ru cls1 = (R = 9) · c Legend: G=guard digit, R=round digit, c=carry bit, sb=sticky bit, l=LSB of TP +0 , lg = LSB of G, '+'= logical OR, '·'=logical AND In this design no rounding overflow can occur, which has been proven by Erle et al. [4] . A rounding overflow might arise when the SIP U is incremented due to rounding and a carry out of the MSD position occurs. The rounding algorithm is explicitly described in [4] . However, the algorithm presented in this paper slightly differs due to an additional carry signal. Fig. 7 summarizes the calculations of the Rounded Product (RP), the Exponent of the Rounded Product (E RP ), and the inexact flag that is asserted whenever the RP is inexact due to rounding. The algorithm simply chooses between TP +0 , TP +1 , or these values left shifted by one digit with either G or G+1 concatenated.
The Overflow/Underflow Correction Unit (OUC) handles overflow and gradual underflow. When (q max + p) > IE SIP > q max and SIP has sufficient leading zeros, then IE SIP is set to the maximum exponent and the significand is corrected by a left shift. On the other hand, when (q min − p) < IE SIP < q min , then the significand is right-shifted and the exponent is increased up to q min . This is also called gradual underflow. The OUC has to be performed before rounding, otherwise in case of gradual underflow the result might be computed incorrectly, which is illustrated in the following simple example with rounding mode RoundTiesToEven , A = 3461713317126677 · 10 . If the rounding is performed before gradual underflow correction (GUC), it leads to the wrong result (5) . On the other hand, if the GUC is accomplished before rounding, the correct result
(MSD of T P +0 ) = 0 and (MSD of T P
(MSD of T P +0 ) = 0 and (MSD of T P +1 ) > 0 a) same as in Case 2 b) same as in Case 2 c.1) same as in Case 2 c. 2) if (ru cls1 = 1) then 
The OUC algorithm is described in Fig. 8 . An example that illustrates the algorithm of the DFP multiplier is depicted in Fig. 9 .
The ExU generates the four exception signals invalid operation, inexact, overflow, and underflow according to IEEE 754-2008. An invalid operation emerges when one of the operands is a signaling NaN or one operand is zero and the other one is infinity. The inexact signal is asserted whenever the result must be rounded. The overflow exception is signaled when a result's magnitude exceeds the largest finite number and the underflow exception is signaled when a result is both tiny and inexact. Tininess is when the result's magnitude is between zero and the smallest normal number.
The decimal floating-point multiplier presented in this paper has been implemented in three different variants. Both, the first one (type 1) and second one (type 2) use the fixedpoint multiplier's redundant carry-save output. However, type 1 applies DSP48E slices as shift registers, whereas the shift registers of type 2 are multiplexer-based. Type 1 and 2 require an additional CPA of length p + 2 after left shifting the intermediate result. The third modification (type 3) uses the fixedpoint multiplier's unique carry-propagation output of length 2p and multiplexer-based shift registers. The implementation results for this three variants are presented in section IV.
The floating-point multipliers are designed to support pipelining which can be controlled via VHDL generic switches. The type 1 multiplier can be subdivided into 24 stages, the type 2 and type 3 multiplier into 22 stages.
IV. IMPLEMENTATION RESULTS
All circuits are modeled using VHDL. For synthesis and implementation Xilinx ISE 10.1 has been used. The 16 digits DFixMul and the decimal64 DFlMul have been implemented for Xilinx Virtex-5 devices with speed grade -2. The fixedpoint multiplier with CPA output has been implemented for several pipeline configurations, see Table IV . The results show that the minimum overall latency of about 17.5 ns can be achieved without any pipeline registers and the best operating frequency of 229 MHz can be obtained with 10 pipeline registers. However, using 6 or more pipeline registers does not reduce the longest path delay significantly but increases the overall latency instead. However, a comparison is limited because it is based on a Virtex4 device and does not implement all rounding modes, exception handling, and gradual underflow. To compare our design with multiplier designs implemented for the same FPGA chip, we have analyzed a binary 64 bit floating-point multiplier on a Virtex-5 provided by the Xilinx Core Generator. The binary floating-point multiplier can be implemented using 0, 9, 10, or 11 DSP48E slices. Furthermore, the number of implemented pipeline registers can be adjusted. The implementation results are summarized in Table VI . Compared to the binary floating-point multiplier without DSP48E usage, the DFlMul proposed in this paper uses 3.0-3.5 times more LUTs and has a 1.8-2.2 times higher latency. However, it must taken into account that decimal floating-point multiplication has a much greater overhead than binary floating-point multiplication.
Considering a medium Xilinx Virtex5 FPGA device such as XC5VLX110T and an average LUT usage of 8000-9000 LUTs, 11.5% -13% of the area is then occupied and leaves enough area for implementations of decimal floating-point addition, subtraction and division. Hence, a complete IEEE 754-2008 compliant decimal floating-point co-processor for decimal64 data types might fit into a single FPGA chip.
V. CONCLUSION
In this paper we extended a previously published decimal fixed-point multiplier to a decimal floating-point multiplier that maps onto FPGA architectures (in particular Virtex 5 devices) and allows the implementation of an IEEE 754-2008 compliant co-processor. The design is fully parallel and can be pipelined by means of configurable pipeline stages. We compared different implementations, including multiplier-based shift registers and delayed CPAs. Finally, we analyzed the performance with respect to the number of pipeline stages and we presented implementation results that are useful to trade off overall latency against longest path delay. Summarizing these values, we achieved a decimal floating-point multiplication within 35 ns and obtained a maximum operating frequency of 192 MHz using 13 pipeline stages.
