Abstract-The IEEE Standard 754-1985 for Binary Floating-Point Arithmetic [19] was revised [20] , and an important addition is the definition of decimal floating-point arithmetic [8], [24]. This is intended mainly to provide a robust reliable framework for financial applications that are often subject to legal requirements concerning rounding and precision of the results, because the binary floating-point arithmetic may introduce small but unacceptable errors. Using binary floating-point calculations to emulate decimal calculations in order to correct this issue has led to the existence of numerous proprietary software packages, each with its own characteristics and capabilities. The IEEE 754R decimal arithmetic should unify the ways decimal floating-point calculations are carried out on various platforms. New algorithms and properties are presented in this paper, which are used in a software implementation of the IEEE 754R decimal floating-point arithmetic, with emphasis on using binary operations efficiently. The focus is on rounding techniques for decimal values stored in binary format, but algorithms are outlined for the more important or interesting operations of addition, multiplication, and division, including the case of nonhomogeneous operands, as well as conversions between binary and decimal floating-point formats. Performance results are included for a wider range of operations, showing promise that our approach is viable for applications that require decimal floating-point calculations. This paper extends an earlier publication [6] .
INTRODUCTION
T HERE is increased interest in decimal floating-point arithmetic in both industry and academia as the IEEE 754R 1 [20] revision becomes the new standard for floatingpoint arithmetic. The 754R Standard describes two different possibilities for encoding decimal floating-point values: the binary encoding, based on using a Binary Integer [24] to represent the significand (BID, or Binary Integer Decimal), and the decimal encoding, which uses the Densely Packed Decimal (DPD) [7] method to represent groups of up to three decimal digits from the significand as 10-bit declets. In this paper, we present results from our work toward a 754R decimal floating-point software implementation based on the BID encoding. We include a discussion of our motivation, selected algorithms, performance results, and future work. The most important or typical operations will be discussed: primarily decimal rounding but also addition, multiplication, division, and conversions between binary and decimal floating-point formats.
Motivation and Previous Work
An inherent problem of binary floating-point arithmetic used in financial calculations is that most decimal floatingpoint numbers cannot be represented exactly in binary floating-point formats, and errors that are not acceptable may occur in the course of the computation. Decimal floating-point arithmetic addresses this problem, but a degradation in performance will occur compared to binary floating-point operations implemented in hardware. Despite its performance disadvantage, decimal floating-point arithmetic is required by certain applications that need results identical to those calculated by hand [3] . This is true for currency conversion [13] , banking, billing, and other financial applications. Sometimes, these requirements are mandated by law [13] ; other times, they are necessary to avoid large accounting discrepancies [18] .
Because of the importance of this problem a number of decimal solutions exist, both hardware and software. Software solutions include C# [22] , COBOL [21] , and XML [25] , which provide decimal operations and datatypes. Also, Java and C/C++ both have packages, called BigDecimal [23] and decNumber [9] , respectively. Hardware solutions were more prominent earlier in the computer age with the ENIAC [27] and UNIVAC [16] . However, more recent examples include the CADAC [5] , IBM's z900 [4] and z9 [10] architectures, and numerous other proposed hardware implementations [11] , [12] , [1] . More hardware examples can be found in [8] , and a more in-depth discussion is found in Wang's work [26] .
The implementation of nonhomogeneous decimal floating-point operations, whose operands and results have mixed formats, is also discussed below. The method used is based on replacing certain nonhomogeneous operations by a similar homogeneous operation and then doing a floatingpoint conversion to the destination format. Double rounding errors that may occur in such cases are corrected using simple logical equations. Double rounding errors have been discussed in other sources. A case of double rounding error in conversions from binary to decimal (for printing binary values) is presented in [15] (in the proof of Theorem 15; a solution of how to avoid this is given based on status flags and only for this particular case). A paper by Figueroa [14] discusses cases when double rounding errors will not occur but does not offer a solution for cases when they do occur. A paper by Boldo and Melquiond [2] offers a solution for correcting double rounding errors when rounding to nearest but only if a non-IEEE special rounding mode is implemented first, named in the paper as rounding to odd (which is not related to rounding to nearest-even). However, the solution presented here for correcting double rounding errors seems to be new.
Estimations have been made that hardware approaches to decimal floating-point arithmetic will have average speedups of 100-1,000 times over software [18] . However, the results from our implementation show that this is unlikely, as the maximum clock cycle counts for decimal operations implemented in software are in the range of tens or hundreds on a variety of platforms. Hardware implementations would undoubtedly yield a significant speedup but not as dramatic, and that will make a difference only if applications spend a large percentage of their time in decimal floating-point computations.
DECIMAL ROUNDING
A decimal floating-point number n is encoded using three fields: sign s, exponent e, and significand with at most p decimal digits, where p is the precision (p is 7, 16, or 34 in IEEE 754R, but p ¼ 7 for the 32-bit format is for storage only). The significand can be scaled up to an integer C, referred to as the coefficient (and the exponent is decreased accordingly): n ¼ ðÀ1Þ s Á 10 e Á ¼ ðÀ1Þ s Á 10 e 0 Á C. The need to round an exact result to the precision p of the destination format occurs frequently for the most common decimal floating-point operations: addition, subtraction, multiplication, fused multiply-add, and several conversion operations. For division and square root, this happens only in certain corner cases. If the decimal floating-point operands are encoded using the IEEE 754R binary format, the rounding operation can be reduced to the rounding of an integer binary value C to p decimal digits. Performing this operation efficiently on decimal numbers stored in binary format is very important, as it enables good software implementations of decimal floating-point arithmetic on machines with binary hardware. For example, assume that the exact result of a decimal floating-point operation has a coefficient C ¼ 1234567890123456789 with q ¼ 19 decimal digits that is too large to fit in the destination format and needs to be rounded to the destination precision of p ¼ 16 digits.
As mentioned, C is available in binary format. To round C to 16 decimal digits, one has to remove the lower x ¼ 3 decimal digits ðx ¼ q À p ¼ 19 À 16Þ and possibly to add one unit to the next decimal place, depending on the rounding mode and on the value of the quantity that has been removed. If C is rounded to nearest, the result will be C ¼ 1234567890123457 Á 10 3 . If C is rounded toward zero, the result will be C ¼ 1234567890123456 Á 10 3 . The straightforward method to carry out this operation is to divide C by 1,000, to calculate and subtract the remainder, and possibly to add 1,000 to the result at the end, depending on the rounding mode and on the value of the remainder.
A better method is to multiply C by 10 À3 and to truncate the result so as to obtain the value 1234567890123456. However, negative powers of 10 cannot be represented exactly in binary, so an approximation will have to be used. Let k 3 % 10 À3 be an approximation of 10 À3 , and thus, C Á k 3 % 1234567890123456. If k 3 overestimates the value of 10 À3 , then C Á k 3 > 1234567890123456. Actually, if k 3 is calculated with sufficient accuracy, it will be shown that bC Á k 3 c ¼ 1234567890123456 with certainty.
(Note that the floorðxÞ, ceilingðxÞ, and fractionðxÞ functions are denoted here by bxc, dxe, and fxg, respectively.)
Let us separate the rounding operation of a value C to fewer decimal digits into two steps: first calculate the result rounded to zero and then apply a correction if needed, by adding one unit to its least significant digit. Rounding overflow (when after applying the correction, the result requires p þ 1 decimal digits) is not considered here.
The first step in which the result rounded toward zero is calculated can be carried out by any of the following four methods.
À3 as a y-bit approximation of 10
À3
rounded up (where y will be determined later). Then, bC Á k 3 c ¼ 1234567890123456, which is exactly the same value as bC=10 3 c.
It can be noticed, however, that 10 À3 ¼ 5 À3 Á 2 À3 and that multiplication by 2 À3 is simply a shift to the right by 3 bits. Three more methods to calculate the result rounded toward zero are then possible:
À3 as a y-bit approximation of 5
rounded up (where y will be determined later). Then, bðC Á h 3 Þ Á 2 À3 c ¼ 1234567890123456, which is the same as bC=10 3 c. However, this method is no different from Method 1, so the approximation of 5 À3 has to be identical in the number of significant bits to the approximation of 10 À3 from Method 1 (but it will be scaled up by a factor of 2 3 ).
The third method is obtained if instead of multiplying by 2 À3 at the end, C is shifted to the right and truncated before multiplication by h 3 .
Method 2. Calculate h 3 % 5
À3
rounded up (where y will be determined later). Then, bbC Á 2 À3 c Á h 3 c ¼ 1234567890123456, which is identical to bC=10 3 c. It will be shown that h 3 can be calculated to fewer bits than k 3 from Method 1.
The final and fourth method is obtained by multiplying C by h 3 and then truncating, followed by a multiplication by 2 À3 and a second truncation.
Method 2a. Calculate h 3 % 5 À3 as a y-bit approximation of 5
rounded up (where y can be determined later). Then, bbC Á h 3 c Á 2 À3 c ¼ 1234567890123456. However, it can be shown that in this last case, h 3 has to be calculated to the same number of bits as k 3 in Method 1, which does not represent an improvement over Method 1.
Only Method 1 and Method 2 shall be examined next. We will solve next the problem of rounding correctly a number with q digits to p ¼ q À x digits, using approximations to negative powers of 10 or 5. We will also solve the problem of determining the shortest such approximations, i.e., with the least number of bits.
Method 1 for Rounding a Decimal Coefficient
For rounding a decimal coefficient represented in binary using Method 1, the following property specifies the minimum accuracy y for the approximation k x % 10
Àx , which ensures that a value C with q decimal digits represented in binary can be truncated without error to p ¼ q À x digits by multiplying C by k x and truncating. Property 1 states that if y ! dflog 2 10 Á xg þ log 2 10 Á qe and k x is a y-bit approximation of 10
Àx rounded up, then we can calculate bC Á k x c in the binary domain, and we will obtain the same result as for bC=10
x c calculated in decimal.
C 10 q À 1, x 2 f1; 2; 3; . . . ; q À 1g, and ¼ log 2 10. If y 2 IN satisfies y ! df Á xg þ Á qe and k x is a y-bit approximation of 10 Àx rounded up (the subscript RP , y indicates rounding up to y b i t s ) , i . e . , k x ¼ ð10 Àx Þ RP ;y ¼ 10 Àx Á ð1 þ "Þ, w h e r e 0 < " < 2 Àyþ1 , then bC Á k x c ¼ bC=10
Proof outline. The proof can be carried out starting with the representation of C in decimal,
. . . ; 9g, and d 0 6 ¼ 0. The value of C can be expressed as C ¼ 10
x Á H þL, where H ¼bC=10
Then, the minimum value of y can be determined such that bC Á k x c ¼ H , H ðH þ10
Àx Á LÞ Á ð1 þ "Þ <H þ1 , 10 Àx Á " < 10
Àx Þ.
But 10
Àx Á " is the absolute error in k x ¼ 10 Àx Á ð1 þ "Þ, and it is less than 1 unit in the last place (ulp) of k x . It can be shown that for all x of interest in the context of the IEEE 754R floating-point formats, x 2 f1; 2; 3; . . . ; 34g, k x is in the same binade as 10
Àx , because one cannot find a power of two to separate 10
Àx and k x ¼ 10 Àx Á ð1 þ "Þ (which would place the two values in different binades) with k x represented with a reasonable number of bits y. This can be checked exhaustively for x 2 f1; 2; 3; . . . ; 34g. s . The conclusion is that k x is in the same binade as 10
Àx , which will let us determine ulpðk x Þ ¼ 2
ÀbÁxcÀy . In order to satisfy inequality (1), it is sufficient to have the following (increasing the left-hand side and decreasing the right-hand side of the inequality): ulpðk x Þ 10 À2x ð10 qÀx À10 Àx Þ . This leads eventually to y ! df Á xg þ Á qe, as required.
t u
In our example from Method 1 above,
À3 needs to be rounded up to 65 bits). The values k x for all x of interest are precalculated and are stored as pairs ðK x ; e x Þ, with K x and e x positive integers, and k x ¼ K x Á 2 Àe x . This allows for efficient implementations, exclusively in the integer domain, of several decimal floating-point operations, particularly addition, subtraction, multiplication, fused multiply-add, and certain conversions.
The second step in rounding correctly a value C to fewer decimal digits consists of applying a correction to the value rounded to zero, if necessary. The inexact status flag has to be set correctly as well. For this, we have to know whether the original value that we rounded was exact (did it have x trailing zeros in decimal?) or if it was a midpoint when rounding to nearest (were its x least significant decimal digits a 5 followed by x À 1 zeros?).
Once we have the result truncated to zero, the straightforward method to check for exactness and for midpoints could be to calculate the remainder C percent 10
x and to compare it with 0 and with 50 . . . 0 (x decimal digits), which is very costly. However, this is not necessary because the information about exact values and midpoints is contained in the fractional part f of the product C Á k x .
The following property states that if C Á 10 Àx covers an interval [H, H þ 1) of length 1, where H is an integer, then C Á k x belongs to [H, H þ 1) as well, and the fraction f discarded by the truncation operation belongs to (0, 1) (so f cannot be zero). More importantly, we can identify the exact cases C Á 10
Àx ¼ H (then, rounding C to p ¼ q À x digits is an exact operation) and the midpoint cases
Àyþ1 . Then, the following are true:
(The proof is not included but is straightforward to obtain.) This property is useful for determining the exact and the midpoint cases. For example if 0 < f < 10
Àx , then we can be sure that the result of the rounding from q decimal digits to p ¼ q À x digits is exact.
Another important result of Property 2 was that it helped refine the results of Property 1. Although the accuracy y of k x determined in Property 1 is very good, it is not the best possible. For a given pair q and x ¼ q À p, starting with the value y ¼ df Á xg þ Á qe, one can try to reduce it 1 bit at a time, while checking that the corresponding k x will still yield the correct results for rounding in all cases, and it will also allow for exactness and midpoint detection, as shown above. To verify that a new (and smaller) value of y still works, just four inequalities have to be verified for boundary conditions related to exact cases and midpoints: H Á 10
Àx , and ðH þ 1 À 10 Àx Þ Á 10 x Á k x < H þ 1. Because the functions of H from these inequalities can be reduced to monotonic and increasing ones, it is sufficient to verify the inequalities just for the maximum value
For example, this method applied to the constant k 3 used to illustrate Method 1 (for truncation of a 19-digit number to 16 digits) allowed for a reduction of the number of bits from y ¼ 65 to y ¼ 62 (importantly, because k 3 fits now in a 64-bit integer).
Method 2 for Rounding a Decimal Coefficient
This method has the advantage that the number of bits in the approximation of 5
Àx is less by x than that for 10 Àx in Method 1, as stated in the following property.
However, determining exact cases and midpoints is slightly more complicated than with Method 1.
Two fractions are removed by truncation: the first in bC Á 2 Àx c and the second in bbC Á 2 Àx c Á h x c. To determine whether the rounding was exact, test first whether the lower x bits of C are zero (if they are not, C could not have been divisible by 10 x ). If C was divisible by 2 x , we just need to test bC Á 2 Àx c for divisibility by 5 x . This is done by examining the fraction removed by truncation in bbC Á 2
Àx c Á h x c, using a property similar to Property 2 from Method 1. Actually, exact and midpoint cases can be determined together if the first shift is by x À 1 bits only (and not by x bits). If C had the x lower bits equal to zero, the rounding might be exact. If it had only x À 1 bits equal to zero, the rounding might be for a midpoint. In both cases, the answer is positive if the fraction f removed by the last truncation in bbC Á 2
Àxþ1 c Á h x c satisfies conditions similar to those from Property 2. The least significant bit in bbC Á 2
Àxþ1 c Á h x c and the values of the fractions removed in the two truncations will tell whether the rounding was exact, for a midpoint, or for a value to the left or to the right of a midpoint.
ADDITION AND MULTIPLICATION
Addition and multiplication were implemented using straightforward algorithms, with the rounding step carried out as explained in the previous section. The following pseudocode fragment describes the algorithm for adding two decimal floating-point numbers encoded using the binary format, n1 ¼ C1 Á 10 e1 and n2 ¼ C2 Á 10 e2 , whose significands can be represented with p decimal digits (C1, C2 2 Z Z, 0 < C1 < 10 p , and 0 < C2 < 10 p ). For simplicity, it is assumed that n1 ! 0 and e1 ! e2 and that the rounding mode is set to rounding to nearest: n ¼ ðn1 þ n2Þ RN;p ¼ C Á 10 e . In order to make rounding easier when removing x decimal digits from the lower part of the exact sum, 1=2 Á 10
x is added to it, and rounding to nearest is reduced to truncation except possibly when the exact sum is a midpoint. The notation digitsðXÞ is used for the minimum number of decimal digits needed to represent the integer value X.
Àp the result is exact else the result is inexact endif For multiplication, the algorithm to calculate n ¼ ðn1 Á n2Þ RN;p ¼ C Á 10 e for rounding to nearest is very similar. There are only two differences with respect to addition: the multiplication algorithm begins with / table lookup and at the end, before checking for rounding overflow and inexactness, the final result is n ¼ C Á 10 e1þe2þx instead of n ¼ C Á 10 e2þx . The most interesting aspects are related to the rounding step. For addition and multiplication, the length of the exact result is at most 2 Á p decimal digits (if this length is larger for addition, then the smaller operand is just a rounding error compared to the larger one, and rounding is trivial). This also means that the number x of decimal digits to be removed from a result of length q 2 ½p þ 1; 2 Á p is between 1 and p. It is not difficult to prove that the comparisons for midpoint and exact case detection can use the constant 10 . The values t Ã are calculated such that they always align perfectly with the lower bits of the fraction f Ã , which makes the tests for midpoints and exactness relatively simple in practice.
The IEEE status flags are set correctly in all rounding modes. The results in rounding modes other than to nearest are obtained by applying a correction (if needed) to the result rounded to nearest, using information on whether the precise result was an exact result in the IEEE sense, a midpoint less than an even floating-point number, a midpoint greater than an even number, an inexact result less than a midpoint, or an inexact result greater than a midpoint.
DIVISION AND SQUARE ROOT
The sign and exponent fields are easily computed for these two operations. The approach used for calculating the correctly rounded significand of the result was to scale the significands of the operands to the integer domain and to use a combination of integer and floating-point operations.
The division algorithm is summarized below, where p is the maximum digit size of the decimal coefficient, as specified by the following format: p ¼ 16 for a 64-bit decimal and p ¼ 34 for a 128-bit decimal format. EMIN is the minimum decimal exponent allowed by the format. Overflow situations are not explicitly treated in the algorithm description below; they can be handled in the return sequence, when the result is encoded in BID format.
eliminate trailing zeros from Q: The square root algorithm (not presented here) is somewhat similar, in the sense that it requires computing an integer square root of 2 Á p digits (while division required an integer division of 2 Á p by p digits). Rounding of the square root result is based on a test involving long integer multiplication.
CONVERSIONS BETWEEN BINARY AND DECIMAL FORMATS
Conversions between IEEE 754 binary formats and the new decimal formats may sometimes be needed in user applications, and the revised standard specifies that these should be correctly rounded. We will first consider the theoretical analysis determining the accuracy required in intermediate calculations and then describe more details of the algorithms for some typical source and target formats.
Required Accuracy
We analyze here how close, in relative terms, a number x in one format can be to a rounding boundary y (either a floating-point number or the midpoint between two floating-point numbers) in another. If we can establish a lower bound for all cases such that either x ¼ y or jx=y À 1j > , then we will be able to make a rounding decision based on an approximation good to a relative error of order , as spelled out in more detail below. Consider converting binary floating-point numbers to a decimal format; essentially, similar reasoning works the other way round. We are interested in how close, in relative terms, the answer can be to a decimal floating-point number (for directed rounding modes) or a midpoint between them (for rounding to nearest), i.e., either of
where integers m and n are constrained by the precision of the formats, and d and e are constrained by their exponent ranges. We can encompass both together by considering problems of the form j 2 eÀ1 m 10 d n À 1j, where n is bounded by twice the usual bound on the decimal coefficient.
Naive Analysis
Based on the most straightforward analytical reasoning, we can deduce that either the two numbers are exactly equal or else
because all quantities are integers. However, this bound is somewhat discouraging, because the exponent d may be very large, e.g., d ¼ 6;111 for Decimal128.
On the other hand, it seems plausible on statistical grounds that the actual worst case is much less bad. For example, imagine distributing N input numbers (say, N ¼ 2 64 for the two 64-bit formats) randomly over a format with M possible significands. The fractional part of the resulting significand will probably be distributed more or less randomly in the range 0 x < 1, so the probability of its being closer than to a floating-point number will be about . Therefore, the probability that none of them will be closer than is about
A relative difference of typically corresponds to an absolute error in the integer significand of about ¼ M, and so, the probability of relative closeness is about 1 À e ÀMN . Thus, we expect the minimum to be of order % 1=ðMNÞ.
Diophantine Approximation Analysis
The above argument is really just statistical hand-waving and may fail for particular cases. However, we can establish such bounds rigorously by rearranging our problem. We are interested in
We can therefore consider all the permissible pairs of exponent values e and d, finding the n=m with numerator and denominator within appropriate bounds that gives the best (unequal) approximation. In fact, if we expand the exponent range a little, it is easy to see that we can assume both m and n to be normalized. This gives strong correlations between the exponents, meaning that we only have to examine a few possible e for each d and vice versa.
For each e and d, we can find lower bounds on this quantity subject to the size constraints on m and n, using a variant of the usual algorithms in Diophantine approximation based on computing a sequence of convergents using mediants or continued fractions. (For those interested, we spell out the exact Diophantine approximation algorithm that we use in the Appendix.) As expected from the naive statistical argument, the bounds determined are much better, typically a few bits extra beyond the sum of the precisions of the input and output formats. For example, the most difficult case for converting from Binary64 to Decimal64 is the following, with a relative distance of 
Implementations
We will now show the general pattern of the actual algorithms for conversions in both directions; in fact, both directions have a lot of stylistic similarities. Generally, we consider the two problems "parametrically," using a general p for the precision of the binary format and d for the number of decimal digits in the decimal format. However, some structural differences arise depending on the parameters, two of which we mention now. First, for conversions between certain pairs of formats (e.g., from any binary format considered into the Decimal128 format), no overflow or underflow can occur, so certain tests or special methods noted below can be omitted and are not present in the actual implementations.
Second, many of the algorithms involve scaled prestored approximations to values such as 10 e =2 f . In the presentation below, we assume that such values are stored for all possible input exponents (the output exponent is determined consequentially). For several of our implementations, this is indeed the case; doing so ensures high speed and allows us to build appropriate underflow of the output into the tables without requiring special program logic. However, for some conversions, in particular those into Decimal128, the storage requirement seems excessive, so the necessary values are instead derived internally from a bipartite table.
Decimal-Binary Conversion Algorithm
Consider converting from a decimal format with d decimal digits to a binary format with precision p. The basic steps of the algorithm are the following:
. Handle NaN, infinity and zero, and extremely large and small inputs. . Strip off the sign, so the input is now 10 e m, where 0 < m < 10 d . . Round depending on the sign and rounding mode to give binary significand n. If n ¼ 2 p , set n ¼ 2 pÀ1 and add 1 to the provisional exponent. . Subtract k from the provisional exponent, check for overflow, underflow, and inexactness, and combine with the sign. We now consider some of the central steps in more detail, starting with the selection of the provisional binary exponent. If we set f ¼ dðe þ dÞ log 2 10e À p, then we have 2 fþpÀ1 < 10 eþd 2 fþp . Therefore, for any floating-point number 10 e m with exponent e and significand 10 d =2 m < 10 d , we have
That is, x lies strictly between the smallest normalized binary floating-point number with exponent f À 1 and the next beyond the largest with exponent f. We therefore have only two possible choices for the "provisional exponent" f. (We say provisional because rounding may still bump it up to the next binade, but this is easy to check at the end.) And that is without any knowledge of m, except the assumption that it is normalized. Moreover, we can pick the exact m min that acts as the breakpoint between these possibilities:
Now, we aim to compute n with 2 f n % 10 e m, i.e., the left-hand side should be a correctly rounded approximation of the right-hand side. Staying with the provisional exponent f, this entails finding a correctly rounded integer n % 10 e 2 f m. We do this by multiplying by an approximation to 10 e =2 f scaled by 2 q for some suitably chosen q and rounded up, i.e., r ¼ d2 q 10 e =2 f e;
and truncating by shifting to the right by q bits. How do we choose q? From the results above, given the input and output formats, we know an > 0 such that if n 6 ¼ We want to make a rounding decision for 10 e 2 f m based on mr 2 q . First, since the latter is an overestimate, it is immediate that if the true result is ! a rounding boundary, so is the estimate. So we just need to check that if the true answer is < a rounding boundary, so is the estimate. We know that if the true answer is < a rounding boundary, then so is 2 22 2 À54:81 , i.e., q ! 57. In fact, we almost always make q slightly larger than necessary so that it is a multiple of 32 or 64. This makes the shift by q more efficient since it just amounts to picking words of a multiword result.
When q is chosen satisfying these constraints, we just need to consider the product mr to make our rounding decisions. The product with the lowest q bits removed is the correctly truncated significand. Bit q À 1 is the rounding bit, and the bottom q À 1 bits (0 to q À 2) can be considered as a sticky bit: test if it is at least 10 d =2 q . Using this information, we can correctly round in all rounding modes. Where the input is smaller in magnitude than the smallest possible number in the output format, we artificially limit the effective shift so that the simple interpretation of "sticky" bits does not need changing.
The mechanics of rounding uses a table of "boundaries" against which the lowest q bits of the reciprocal product (i.e., the "round" and "sticky" data) is compared. If this value is strictly greater than the boundary value, the truncated product is incremented. (In the case where it overflows the available significands, the output exponent is one greater than the provisional exponent.) To avoid lengthy tests, these boundaries are indexed by a composition of bitfields 4r þ 2s þ l, where r is the rounding mode, s is the sign of the input, and l is the least significant bit of the truncated exponent.
Binary-Decimal Conversion Algorithm
Consider converting from a binary format with precision p to a decimal format with d decimal digits. An important difference from decimal-binary conversion is that the exponent in the output is supposed to be chosen as close to zero as possible for exact results and as small as possible for inexact results. The core algorithm is given as follows:
. Handle NaN, infinity and zero, and extremely large and small inputs. . Strip off the sign, so the input is 2 e m with 0 < m < 2 p . . Normalize the input so that 2 pÀ1 m < 2 p (this will usually be the case already). . Filter out and handle exact results where the main path would give the wrong choice of output exponent (albeit numerically the same). . Pick provisional decimal exponent f based on m and e. . Form scaled product and add 1 to the provisional exponent. . Check for overflow, underflow, and inexactness and combine with the sign. We consider some of the main steps in more detail, starting with the filtering out of exact cases where we cannot rely on the main path of the algorithm to correctly force the exponent toward zero. We let through either inexact cases or those where the main path will behave correctly, e.g., large integers beyond the output coefficient range such as 2 200 . To maximize code sharing in this part of the algorithm, we always treat the input as a quad precision number, so 2 112 m < 2 113 , and the exponent is adjusted accordingly. We can pass through any inputs with e > 0 since in this case, the input is ! 2 113 , which is too large for the significand of any of the decimal formats. So we can assume that e 0; let us write t for the number of trailing zeros in m. Now, there are two cases:
. If e þ t ! 0, the input is an integer, so we treat it specially iff it fits in the coefficient range. That is, x lies strictly between the smallest normalized decimal number with exponent f À 1 and the next beyond the largest with exponent f. We therefore have only two possible choices for the "provisional exponent" f. (We say provisional because rounding may still bump it up to the next decade, but this is easy to check at the end.) And that is without any knowledge of m, except the assumption that it is normalized. Moreover, we can pick the exact m min that acts as the breakpoint between these possibilities:
so 2 e m min < 10 fþdÀ1 2 e ðm min þ 1Þ. Thus, we can pick the provisional exponent to be f À 1 for all m m min and f for all m > m min , and we are sure that 10 f 10 dÀ1 x < 10 f 10 d . In the future, we will just use f for this correctly chosen provisional exponent. Now, we aim to compute n with 10 f n % 2 e m, i.e., the lefthand side should be a correctly rounded approximation of the right-hand side. Staying with the provisional exponent f, this entails finding a correctly rounded n % 10 6 2 À51:32 , i.e., q ! 57. When q is chosen satisfying these constraints, we just need to consider the product mr to make our rounding decisions. The product with the lowest q bits removed is the correctly truncated significand. Bit q À 1 is the rounding bit, and the bottom q À 1 bits (0 to q À 2) can be considered as a sticky bit: zero out the lowest p bits and test whether the result is zero or nonzero or, in other words, test whether the bitfield from bit p to bit q À 2 is nonzero. Using this information, we can correctly round in all rounding modes. The approach uses exactly the same boundary tables as in the decimal-binary case discussed previously.
NONHOMOGENEOUS DECIMAL FLOATING-POINT OPERATIONS
The IEEE 754R Standard mandates that the binary and the decimal floating-point operations of addition, subtraction, multiplication, division, square root, and fused multiplyadd have to be capable of rounding results correctly to any supported floating-point format, for operands in any supported floating-point format in the same base (2 or 10). The more interesting cases are those where at least some of the operands are represented in a wider precision than that of the result (such a requirement did not exist in IEEE . There are thus a relatively large number of operations that need to be implemented in order to comply with this requirement. Part of these are redundant (because of the commutative property), and others are trivial, but several where at least some of the operands have a larger precision N 0 than the precision N of the result require a more careful implementation. One solution is to implement each operation separately, for example, for Decimal128 þ Decimal64 ¼ Decimal64, Decimal128 þ Decimal128 ¼ Decimal64, etc. However, since these nonhomogeneous operations will occur quite rarely in practice, an alternate solution was chosen. If a floating-point operation has to be performed on operands of precision N 0 and the result has to be rounded to precision N < N 0 , the implementation is based on 1. an existing operation of the same type that accepts operands of precision N 0 and rounds the result to precision N 0 and 2. an existing floating-point conversion operation from precision N 0 to precision N. This is a more economical solution than implementing an independent operation with operands of precision N 0 and result of precision N, especially since the number of pairs ðN 0 ; NÞ can be quite large for practical purposes when several floating-point formats are supported (the same method is applicable to binary floating-point operations, where the precisions may include formats with 24, 53, 64, and 113 bits in the significand). However, this two-step method works correctly only with directed IEEE rounding modes: rounding down, rounding up, and rounding toward zero. If the rounding method is to nearest-even or to nearest-away, then in rare cases, double rounding errors (of 1 ulp) may occur (rounding to nearest-away is similar to the rounding to nearest-even, with one difference: midpoints between two consecutive floating-point numbers are rounded to the floating point of larger magnitude of the two by the former and to the floating-point with an even significand by the latter). To correct against the possibility of such errors, a third component is added to the implementation:
3. correction logic against double rounding errors. The conditions that cause double rounding errors and the logical equations for their correction will be presented next-first for rounding to nearest-even and then for rounding to nearest-away.
In rounding-to-nearest-even mode, double rounding errors may occur when the exact result res Ã 0 of a floatingpoint operation with operands of precision N 0 is rounded correctly (in the IEEE 754R sense) first to a result res 0 of precision N 0 , and then, res 0 is rounded again to a narrower precision N. Sometimes, the result res does not represent the IEEE-correct result res 0 that would have been obtained were the original exact result res Ã 0 rounded directly to precision N. In such cases, res differs from res 0 by 1 ulp, and the error that occurs is called a double rounding error. Only positive results will be considered in this context, as the treatment of negative results is similar because rounding to nearest is symmetric with respect to zero. A double rounding error for rounding to nearest-even can be upward (when the result res is too large by 1 ulp) or downward (when the result res is too small by 1 ulp).
For example, consider the addition operation of two decimal floating-point values in 128-bit format ðN 0 ¼ 34Þ, with the result rounded correctly to 128-bit format ðN 0 ¼ 34Þ, as specified by the IEEE 754R Standard:
(The notation x RN;34 indicates the rounding of x to nearest-even to 34 decimal digits.) Next, the result res 0 is rounded to 64-bit format ðN ¼ 16Þ:
This result is different from what would have been obtained by rounding a + b directly to precision N ¼ 16; a double rounding error of 1 ulp, upward, has occurred:
However, not all nonhomogeneous operations can lead to double rounding errors. For example, of the eight possible variations of addition (each argument and the result can be in any of the three formats), only three required the presence of correction logic, where some operands are in Decimal128 format, and the result is in Decimal64. In general, correction of double rounding errors can be achieved by using minimal additional information from the floating-point operation applied to operands of precision N 0 with the result rounded to precision N 0 and the conversion of the result of precision N 0 to precision N, where N 0 > N. This extra information consists of four logical indicators for each of the two rounding operations (N 0 to N 0 and then N 0 to N), which can be used for correcting the result when a double rounding occurs (C notation is used; MP denotes a midpoint between two consecutive floating-point numbers): inexact lt MP; inexact gt MP; MP lt even; MP gt even:
A fifth indicator can be derived from the first four:
This information is used to either add 1 or subtract 1 to/ from the significand of the second result, in case a double rounding error occurred. This can be an efficient solution for implementing correctly rounded operations with operands of precision N 0 and results of precision N ðN 0 > NÞ, instead of having complete implementations for these less frequently used but mandatory operations defined in IEEE 754R. (It can be noted, however, that for certain floating-point operations, under special circumstances, double rounding errors will not occur; in most cases, however, when performing two consecutive rounding operations as explained above, double rounding errors can occur. For example, if a binary floating-point division operation or square root operation is performed on an operand(s) of precision N 0 and the result of precision N 0 is then rounded a second time to a smaller precision N, then double rounding errors are not possible if
The logic for correcting double rounding errors when rounding results to nearest-even is based on the observation that such errors occur when 1. the first rounding pulls up the result to a value that is a midpoint situated to the left of an even floatingpoint number (as seen on the real axis) for the second rounding (double rounding error upward) or 2. the first rounding pulls down the result to a value that is a midpoint situated to the right of an even floatingpoint number (on the real axis) for the second rounding (double rounding error downward). In C-like pseudocode, the correction method can be expressed as follows (0 identifies The correction logic modifies the result res of the second rounding only if a double rounding error has occurred; otherwise, res 0 ¼ res is already correct. Note that after correction, the significand of res 0 can be smaller by 1 ulp than the smallest N-digit significand in base b. The correction logic has to take care of this corner case too. For the example considered above, l 0 ¼ 9, r 0 ¼ 9, s 0 ¼ 0, l ¼ 1, r ¼ 5, and s ¼ 0. It follows that add1 ¼ 0 and sub1 ¼ 1, and so, the double rounding error upward will be corrected by subtracting 1 from the significand of the result. (But in practice, for the BID library implementation, it was not necessary to determine l, r, and s in order to calculate the values of these logical indicators.)
The logic shown above remedies a double rounding error that might occur when rounding to nearest-even and generates the correct value of the result in all cases.
However, in certain situations, it is useful to know also the correct rounding indicators from the second rounding. For example, they can be used for correct denormalization of the result res 0 if its exponent falls below the minimum exponent in the destination format. In this case, the complete logic is expressed by the following pseudocode, where in addition to correcting the result, the four rounding indicators from the second rounding are also corrected if double rounding errors occur: In rounding-to-nearest-away mode, double rounding errors may also occur just as they do in rounding-tonearest-even mode: when the exact result res Ã 0 of a floatingpoint operation with operands of precision N 0 is rounded correctly (in the IEEE 754R sense) first to a result res 0 of precision N 0 , and then, res 0 is rounded again to a narrower precision N. Sometimes, the result res does not represent the IEEE-correct result res 0 that would have been obtained were the original exact result res Ã 0 rounded directly to precision N. In this case too, only positive results will be considered, because rounding to nearest-away is symmetric with respect to zero. A double rounding error for rounding to nearest-away can only be upward.
The same example considered above for rounding to nearest-even can be used for rounding to nearest-away. In this case, we have 
Again, correction of double rounding errors can be achieved by using minimal additional information from the floating-point operation applied to operands of precision N 0 and result rounded to precision N 0 and the conversion of the result of precision N 0 to precision N, where N 0 > N.
The additional information consists in this case of just three logical indicators for each of the two rounding operations (N 0 to N 0 and then N 0 to N): inexact lt MP; inexact gt MP; MP:
A fourth indicator can be derived from the first three:
The method for correction of double rounding errors when rounding results to nearest-away is based on a similar observation that such errors occur when the first rounding pulls up the result to a value that is a midpoint between two consecutive floating-point numbers, and therefore, the second rounding causes an error of 1 ulp upward.
The correction method can be expressed as follows (0 identifies again the first rounding): The correction is applied to the result res of the second rounding based on the value of sub1 but only if a double rounding error has occurred. Otherwise, res 0 ¼ res is already correct. Note that after correction, the significand of res 0 can be smaller by 1 ulp than the smallest N-digit significand in base b. Again, the correction logic has to take care of this corner case too. For the example shown above, the correction signal sub1 ¼ 1 is easily calculated given r 0 ¼ 9, s 0 ¼ 0, r ¼ 5, and s ¼ 0. The double rounding error is corrected by subtracting 1 from the significand of the result.
Again, it may be useful to know also the correct rounding indicators from the second rounding. The complete correction logic is similar to that obtained for rounding to nearest-even.
This method was successfully applied for implementing various nonhomogeneous decimal floating-point operations, with minimal overhead. It is worth mentioning again that the same method can be applied for binary floatingpoint operations in hardware, software, or a combination of the two.
PERFORMANCE DATA
In this section, we present performance data in terms of clock cycle counts needed to execute several library functions. The median and maximum values are shown. The minimum values were not considered very interesting (usually, a few clock cycles) because they reflect just a quick exit from the function call for some special-case operands. To obtain the results shown here, each function was run on a set of tests covering corner cases and ordinary cases. The mix of data has been chosen to exercise the library (including special cases such as treatment of NaNs and infinities) rather than to be a representative of a specific decimal floating-point workload.
These preliminary results give a reasonable estimate of worst case behavior, with the median information being a good measure of the performance of the library.
Test runs were carried out on four different platforms to get a wide sample of performance data. Each system was running Linux. The best performance was on the Xeon 5100 Intel164 platform, where the numbers are within one to two orders of magnitude from those for possible hardware solutions. Table 1 contains clock cycle counts for a sample of arithmetic functions. These are preliminary results as the library is in the pre-beta development stage. A small number of optimizations were applied, but significant improvements may still be possible through exploitation of specific hardware features or careful analysis and reorganization of the code. Table 2 contains clock cycle counts for conversion functions. It is worth noting that the BID library performs well even for conversion routines to and from string format. Table 3 contains clock cycle counts for other miscellaneous IEEE 754R functions, which can be found in [20] under slightly different names. For example, rnd_integral_ away128 and q_cmp_lt_unord128 are our implementations of, respectively, the IEEE 754R operations roundToIntegralTiesAway and compareQuietLess Unordered on the Decimal128 type.
It is interesting to compare these latencies with those for other approaches. For example, the decNumber package [9] run on the same Xeon 5100 system has a maximum latency of 684 clock cycles and a median latency of 486 clock cycles for the add64 function. As a comparison, the maximum and median latencies of a 64-bit addition on the 3.0-GHz Xeon 5100 are 133 and 71 clocks cycles, respectively. Another example is that based on its median latency (71 clock cycles for add64 in Table 1 ), the 64-bit BID add is less than four times slower than a four-clock-cycle single-precision binary floating-point add operation in hardware on a 600-MHz Ultra Sparc III CPU of just a few years ago.
CONCLUSION
In this paper, we presented some mathematical properties and algorithms used in the first implementation in software of the IEEE 754R decimal floating-point arithmetic, based on the BID encoding. We concentrated on the problem of rounding correctly decimal values that are stored in binary format while using binary operations efficiently and also presented briefly other important or interesting algorithms used in the library implementation. Finally, we provided a wide sample of performance numbers that demonstrate that the possible speedup of hardware implementations over software may not be as dramatic as previously estimated. The implementation was validated through testing against reference functions implemented independently, which used, in some cases, existing multiprecision arithmetic functions.
As we look toward the future, we expect further improvements in performance through algorithm and code optimizations, as well as enhanced functionality, for example, through addition of transcendental function support. Furthermore, we believe that hardware support can be added incrementally to improve decimal floatingpoint performance as demand for it grows.
APPENDIX ALGORITHMS FOR BEST RATIONAL APPROXIMATIONS
The standard problem in Diophantine approximation is to investigate how well a number x can be approximated by rationals subject to an upper bound on the size of the denominator. That is, what is the minimal value of jx À p=qj next mediant exceeds the denominator limit. In practice, one can sometimes reach pathological situations where an infeasible number of mediant steps are to be taken. So rather than make these steps one at a time, we can condense multiple "go left" and "go right" operations into one. (This is effectively like a traditional continued fraction algorithm, except that we are more delicate about the limits on the numerator and the denominator.) The following defines the number of left and right steps we can take when finding left approximations before we either need a step of the other kind or exceed the denominator bound: Of course, to find the best approximation to a number, we see which of its best left and right approximations is closer and pick either if they are equally close. (If the number is itself rational, we may, depending on context, also include the number if its denominator is within range.)
Another advantage of distinguishing left and right approximations is that we can very easily generate the closest k left or right approximations. For example, let p 1 =q 1 be the best left approximation to x with denominator bound N. The next-best left approximation to x with denominator bound N must in fact be the best left approximation to p 1 =q 1 (since if it were > p 1 =q 1 , it would contradict the best approximation status of p 1 =q 1 ), and so forth. In this way, we can also enumerate all rationals subject to the denominator bound that are within some given > 0 of x. Of course, depending on x, N, and , the number of such approximations may mean that they are not feasibly enumerable.
