Abstract-In this work, we present a radix-10 division unit that is based on the digit-recurrence algorithm. The previous decimal division designs do not include recent developments in the theory and practice of this type of algorithm, which were developed for radix-2 k dividers. In addition to the adaptation of these features, the radix-10 quotient digit is decomposed into a radix-2 digit and a radix-5 digit in such a way that only five and two times the divisor are required in the recurrence. Moreover, the most significant slice of the recurrence, which includes the selection function, is implemented in radix-2, avoiding the additional delay introduced by the radix-10 carry-save additions and allowing the balancing of the paths to reduce the cycle delay. The results of the implementation of the proposed radix-10 division unit show that its latency is close to that of radix-16 division units (comparable dynamic range of significands) and it has a shorter latency than a radix-10 unit based on the Newton-Raphson approximation.
INTRODUCTION
T HE need for floating-point decimal arithmetic in hardware has been reported in several publications (some recent references being [1] , [2] , [3] , [4] ). Moreover, radix-10 units are included in some processors (although not floating-point) [5] and some recent designs for adders, multipliers, dividers, and square-root units have been described in [3] , [4] , [5] , [6] , [7] .
In this work, we present a radix-10 division unit that is based on the digit-recurrence algorithm. Digit-recurrence algorithms for division and square root give probably the best trade-off in delay area, do not require the use of a multiplier, and easily provide exact rounding. Previous designs of decimal dividers are reported in [4] , [5] and in several patents [8] , [9] , [10] , [11] , [12] , [13] . The previous digit-recurrence designs are relatively old and do not include recent developments in the theory and practice of this type of algorithm. In particular, the recent IBM z900 processor includes a decimal divider using the restoring algorithm [5] . On the other hand, [4] describes a recent implementation based on the Newton-Raphson approximation of the reciprocal. We compare with this implementation in Section 5 and show that the digit-recurrence approach might produce a lower latency.
The design described in this paper includes the following novelties, with respect to previous designs:
. Adaptation of the techniques developed for radix-2 k dividers such as 1. signed-digit redundant quotient, 2. carry-save representation of the residual, 3. decomposition of the quotient digit, 4. realization of the quotient-digit selection by preloading the selection constants and then comparing them with an estimate of the residual, 5. overlapped implementation of the selection of both components of the quotient digit, 6. retiming and separate implementation of the most significant digits (slice), and 7. on-the-fly conversion, normalization, and rounding. . More specifically, decomposition of the quotient digit into one radix-5 component and one radix-2 component, with the radix-5 component having values fÀ2; À1; 0; 1; 2g. This results in a redundancy factor of 7/9 and requires only multiples of the divisor d, 2d, and 5d. . Radix-2 algorithm and implementation of the most significant slice to reduce the critical path delay. The retiming ensures that the critical path is in this slice so that an efficient binary implementation reduces the cycle time. With respect to the floating-point format, several recent papers discuss the issue [1] , [2] . Of particular interest is the specification that has been added to the revision of the IEEE 754 Standard [14] . In this paper, we assume normalized and fractional significands for operands and result. Moreover, we use a binary coded decimal (BCD) coding for the decimal digits. For other representations and/or coding, it is possible to use conversions or to adapt the algorithm and implementation. Specifically, [4] describes an implementation of the modules required for compliance with the IEEE 754 Standard.
We show the proposed algorithm and architecture for the divider unit. We then implement the design by using a 90-nm complementary metal-oxide semiconductor (CMOS) standard cells library and Synopsys tools. Moreover, we use the same design tools and cells to implement two radix-16 units, namely, the basic retimed radix-16 and a faster radix-16 unit, as presented in [15] , and show that the radix-10 divider has a latency similar to that of the basic retimed radix-16 unit.
Finally, we compare with the design based on NewtonRaphson iterations [4] and conclude that the digit-recurrence approach that we propose offers the shortest latency for decimal division among the units presented so far.
OPERATION AND ALGORITHM
We perform the division operation Q ¼ X=D in which the operands and result have a floating-point representation such that X ¼ ðÀ1Þ
and Q ¼ ðÀ1Þ SÁ 10 E q . The corresponding structure of the operation is shown in Fig. 1 . We now concentrate on the digit-recurrence algorithm to obtain the significand of the quotient. To be specific, we perform decimal fractional division, with x and d normalized such that 0:1 x; d < 1 (see Table 1 for the notation used). We also produce a normalized quotient. The modification of the algorithm and the implementation for other cases are straightforward.
Background
We now summarize the theory and development of the radix-r digit-recurrence algorithm, as described in more detail, for example, in [16] . This will be the basis for the radix-10 case that we propose.
Given the significands of the dividend x and of the divisor d, we produce the significand of the quotient q such that
If the quotient has a maximum relative error of 1 ulp, then rem d Á ulp. Moreover, rounding (round-half-even mode) produces a quotient with a maximum relative error of 0.5 ulp.
In the digit-recurrence algorithm, the quotient is produced one digit per iteration of the following radix-r recurrence on the residual:
with initial condition w½0 ¼ x and with the quotient digit obtained by the selection function
where d rw½j and b d are estimates of rw½j and d, respectively. 
TABLE 1 Symbols and Notation
To allow for the use of estimates in the selection function, it is necessary to use a redundant digit set for the quotient digit. For a symmetrical (and continuous) digit set Àa q j a, the redundancy factor is defined as ¼ a=ðr À 1Þ. To assure convergence, the selection function should satisfy the range of the residual, namely, jw½jj d. Moreover, to speed up the subtraction, a redundant adder is used.
For the selection function, the divisor is partitioned into subranges ½d i ; d iþ1 Þ such that, for subrange i, the selection function is defined by the selection constants m k ðiÞ so that the quotient digit has value k if
The selection constants are selected so that the range of the residual is bounded. This is achieved by the condition
where ½L k ; U k is the selection interval for
Moreover, e is the maximum error introduced by using c rw instead of a truncated rw½j. This expression is specific to the case in which the estimate is obtained by truncating the carry-save representation of rw½j. In this case, 0 truncðrw½jÞ À d rw½j e.
Proposed Algorithm
We now describe the algorithm that we propose. As is usually done for algorithms using signed values and an addition or subtraction-based recurrence, we use a rangecomplement (equivalent to two's complement in radix 2) representation of these signed values. For the radix-10 case, to represent a signed value with n decimal digits, the representation has n decimal digits and one additional digit (bit) with a value of 0 or 1 so that a negative fraction x is represented by 2 À jxj.
To reduce the delay of the subtraction in the recurrence, we use a radix-10 carry-save representation of the residual, as shown in Fig. 2 , with w sðiÞ 2 ½0; 9 and w cðiÞ 2 ½0; 1. As a consequence of this representation, the subtraction is done using a radix-10 carry-save adder, which adds a carry-save operand and a conventional operand to produce a carrysave result, and the delay of the addition corresponds to the delay of one radix-10 digit.
The direct implementation of the radix-10 recurrence has two complications: 1) The selection function is complex and has a large delay and 2) the divisor multiples required for the recurrence are complicated to generate. To overcome these issues, we decompose the quotient digit as
with digit sets q Hj 2 fÀ1; 0; 1g and q Lj 2 fÀa L ; . . . ; a L g so that the recurrence is performed as follows:
We choose a L ¼ 2, which produces À7 q j 7, and a redundancy factor ¼ 7=9. This choice of the digit set of q L simplifies the generation of the multiples q Lj d.
Since, for convergence, jw½jj ð7=9Þd, we initialize w½0 ¼ x=100, which satisfies the range for the worst case, where x ¼ 1:0 and d ¼ 0:1.
Therefore, the recurrence is performed in the following steps (we later show how we can implement this faster):
. Compute
where d 10w½j and b d are estimates of 10w½j and d, respectively, as discussed further later. . Select 5q Hjþ1 d ¼ q Hjþ1 Á ð5dÞ from the precomputed À5d, 0, and þ5d and compute
with v½j also in radix-10 carry-save format.
and þ2d and compute
The above-mentioned recurrence requires the following:
. selection functions for q Hjþ1 and q Ljþ1 (described in Section 2.2.1), . precomputation of 5d and 2d, and . radix-10 carry-save subtractions. The multiples 2d and 5d are easy to generate. Specifically, the generation of 2d requires a carry propagation of only one digit and the generation of 5d can be performed by multiplying by 10 (a wired left shift) and dividing by 2, which requires a carry to the adjacent less significant position.
Although, in general, the range-complement representation of w½j would require an additional bit for the sign, in this case, since the most significant digit of the magnitude of w½j is 7, the most significant bit of the most significant digit acts as the sign bit.
Selection Function
We now determine the selection functions, namely,
The notation and process follow [16] for the general radix-r case.
The estimates d 10w and b v are obtained by using a limited number of digits of the carry-save representation. Although, in principle, this number could be different for 10w than for v, we use the same number because, as described later, we use a radix-2 representation for this slice of digits. Let us call t the number of fractional digits of the estimates. Moreover, as we will see in the implementation, we do not include in the estimate the carry bit of digit t. Then, the error due to the estimate is bounded by 1:12 Â 10
Àt , as shown in Fig. 3 . Then, the conditions for the selection constants m k are
where ½L k ; U k are the corresponding selection intervals.
To obtain the selection intervals, we need the ranges of 10w½j and v½j. The range of 10w½j is ½À10d; þ10d, which, for the selected digit set, becomes ½Àð70=9Þd; þð70=9Þd.
Note that, to simplify the notation, in the development that follows, we use the same notation, ½L k ; U k , for both q H and q L although they have different values and the index k corresponds to different sets.
Selection intervals for q H . We now obtain the selection intervals ½L k ; U k for q H ¼ k, with k 2 fÀ1; 0; 1g. From the recurrence,
and, since v½j ð25=9Þd, by replacing 10w½j by the upper limit of the selection interval, we obtain
Similarly, since v½j ! Àð25=9Þd, and L k is the lower limit of the interval,
Selection intervals for q L . Similarly, we obtain the interval ½L k ; U k for q L ¼ k, with k 2 fÀ2; À1; 0; 1; 2g. From the recurrence,
Introducing w½j þ 1 ð7=9Þd and the upper limit of the interval U k , we obtain
and, with w½j þ 1 ! Àð7=9Þd and the lower limit
Bound for t. To obtain a bound on t, the number of fractional digits of d 10w, we use the minimum overlap condition:
Àt ;
which occurs for the first subrange of d and the maximum k.
If this subrange has a width of 10 À , then we obtain, for q H ,
so that, for integer t and for ! 2, we obtain t ! 2.
Similarly, for q L ,
again resulting in t ! 2.
Since the delay of the selection function directly depends on t, we choose the minimum value t ¼ 2. Moreover, as explained next, we adjust the interval width Ád ¼ 10 À to reduce the number of intervals of d.
Selection constants. We now determine suitable selection constants. The usual method is to divide the interval of d into subintervals of width Ád and then obtain selection constants for each interval. However, in this design, we proceed as follows:
. Define selection constants that are integers. That is,
we define the selection function by the integer selection constants such that
. Obtain the selection constants from (4) transformed so that the selection constants are integers and for t ¼ 2. Moreover, from the possible values satisfying the condition, select the selection constant that is a multiple of 2 p for the largest possible p. This reduces the number of bits of the binary representation of the selection constant, which is convenient for the radix-2 implementation presented later. . Make m Àkþ1 ¼ Àm k (for both H and L) so that only the constants for positive k have to be determined fromd. This way, for each b d, we need to compute m H ðiÞ ¼ fm H1 g and m L ðiÞ ¼ fm L2 ; m L1 g. . Use the same subintervals for both functions and, beginning from d ¼ 0:1, use the largest possible subinterval. This minimizes the number of subintervals and, therefore, the number of products in the implementation. This procedure produces the constants, as described in Table 2 .
DIVIDER ARCHITECTURE
We now describe the architecture of the divider. Of particular interest are 1) the retiming of the recurrence, 2) the organization of the selection function, and 3) the radix-2 implementation of the most significant slice. We now discuss these aspects.
Retiming of the Recurrence
To be able to reduce the delay of the most significant slice, we start with a retimed version of the iteration in which q j and w½j À 1 are inputs and q jþ1 and w½j are outputs. That is, an iteration is described by the following expressions (see the 16-digit implementation in Fig. 4) :
As explained in Section 3.2, b v½j þ 1 is computed speculatively inside the SEL block (see Fig. 5 ). It is necessary to initialize w and q. As indicated before, to ensure convergence, we initialize w½0 ¼ x=100. Moreover, from the recurrence and q 0 ¼ 0, we obtain q 1 ¼ SELð d 10w½0; mðiÞÞ. Consequently, in Fig. 4 , this initialization is performed in the first cycle by selecting x=100 and initializing q H ¼ q L ¼ w c ¼ 0. The intialization effectively scales the dividend by 10 À2 and therefore produces a quotient in the range ð10 À3 ; 10 À1 Þ, that is, q ¼ q 0 :q 1 q 2 . . . ¼ 0:0xxx . . . or 0:00xxx . . . . Moreover, in the redundant (signed-digit) representation of the quotient, since the range of a digit is À7 to þ7, the first fractional digit needs to be 0 or 1 (the value of 1 is needed when the second fractional digit of the final quotient is larger than 7). During the conversion to nonredundant representation (Section 4), this value 1 is converted to 0. Table 3 shows some iterations of the division algorithm executed in the unit in Fig. 4 .
Organization of the Selection Function
We consider now two issues in the organization of the selection function, namely, 1) the preloading of the selection constants and selection by comparison and 2) the overlapping of a conditional q L with q H . That is, as shown in Fig. 5,   1 . As done in [17] and [15] for the binary case, we implement the selection function by preloading the selection constants, depending on the value of b d, Fig. 4 . Basic implementation of the radix-10 recurrence. Fig. 5 . Implementation of the selection function with overlapping and speculation.
and we perform the selection of q Hjþ1 and q Ljþ1 by comparing the corresponding constants with d 10w½j and b v½j þ 1, respectively. The preloading of the constants takes the corresponding delay out of the delay of the iteration. Moreover, the comparisons are done by a carry-save subtraction and a sign detection. This allows the balancing of the paths for delay reduction. 2. As is usually done for radix-16 dividers, we overlap the computation of a conditional q L with the computation of q H and then select the correct q L . Therefore, the inputs to the SELs q L are and, remembering that m LÀ1 ¼ Àm L2 and m L0 ¼ Àm L1 , set the inputs for each of the SELs q L as (see Fig. 5 ):
Radix-2 Implementation of the Most Significant Digits
Since the estimate used in the selection function is obtained from the three most significant digits of w½j, we separately implement the slice consisting of those digits. This has been done previously for binary division: first, in the 1960s, when the recurrence was using a carry-propagate adder [18] and then recently to be able to reduce the energy consumption of the not-critical part [19] and/or to reduce the delay by balancing the delay paths [15] . Moreover, we implement the most significant slice by using a radix-2 representation. This reduces the delay of the additions (both carry-save and carry-propagate) and reduces the width of the slice.
In the description of the most significant slice, it is convenient to refer to integer values. Specifically, since the most significant slice has three fractional digits, we define where w ð4Þ ½j À 1 is the integer corresponding to the fractional digit 4 of w½j À 1, and v cð3Þ ½j and w cð3Þ ½j are the carries of v½j and w½j into the fractional digit 3.
Combining the two previous expressions, we get 
A block diagram of the implementation of the most significant slice is shown in the lower part of Fig. 6 , also including the selection function. Note that, in this slice, we do not explicitly compute ðv½jÞ m since this value is computed speculatively in the selection function for q L . The critical path of this implementation is large, mainly because of the delay of the radix-10 addition in the most significant slice (most likely implemented by radix-10 carrysave adders: two in the recurrence, labeled "CSA tree" in Fig. 6 , and one in the selection function). To reduce this delay, we implement this slice in radix 2. The disadvantages of this are the multiplication by 10 (which now requires a 4-to-2 adder) and the addition of w ð4Þ ½j À 1 (which now cannot be done by concatenation). However, we will perform a retiming of the recurrence so that these operations are outside the critical path.
The radix-2 recurrence is obtained directly from (6) . To distinguish from the radix-10 case, we use uppercase notation for the radix-2 variables (and eliminate the subscript m). That is, Since W ½j is the integer corresponding to the three most significant digits of the carry-save representation of w½j and jw½jj 7=9, the range of W ½j is ½À779; þ777, so it is represented in radix 2 (two's complement) by 11 bits.
The corresponding radix-2 implementation is shown in Fig. 7 and requires . precomputation of the radix-2 representation of AEd m , AEð2dÞ m , and AEð5dÞ m (that is, AED, AE2D, and AE5D, respectively),
. multiplication by 10 (this is implemented as 10z ¼ 8z þ 2z by using a 4-to-2 (radix-2) carry-save adder), and . additions in the recurrence by using radix-2 carrysave adders. There are two paths that potentially are the critical paths, namely, 1) the path in the radix-2 part corresponding to the sum of the delays in the multiplication by 10, the addition of w ð4Þ , the addition of the multiples of the divisor, and the selection function or 2) the path in the radix-10 part that produces w cð3Þ ½j.
The path in the radix-2 part is reduced by retiming since the multiplication by 10 and the addition of w ð4Þ are outside the q j À q jþ1 loop. Consequently, they can be performed in parallel with the (previous) selection function. For the path in the radix-10 part, we perform the selection without including w cð3Þ ½j and add this value after the multiplication by 10 as 10w cð3Þ ½j. The selection function has already been designed to accommodate this postponement.
The radix-2 part has to be initialized with the corresponding portion of x=100, namely, 1; 000ðtrunc 3 ðx=100Þ. This is the integer corresponding to the first digit of x, that is, x ð1Þ .
The corresponding diagram is shown in Fig. 8 and the resulting implementation in Fig. 9 .
The registers are placed in such a way as to balance the different paths. Consequently, the best placement depends on a specific implementation. Fig. 9 shows the placement obtained for the implementation presented in Section 5. The registers are called q H , q L , Y , w s , and w c . Furthermore, to keep the precomputation of 2d, 5d, and their radix-2 equivalents out of the critical path, we latch those values as well. Note that those precomputed values are not needed in the first iteration, as explained next.
Operation Execution
The computation of the 16-digit normalized and rounded significand of the quotient requires 19 iterations if x ! d or 20 iterations if x < d, as explained in Section 4.2.
To summarize, the architecture shown in Fig. 9 implements the algorithm in Table 4 .
CONVERSION, NORMALIZATION, AND ROUNDING

On-the-Fly Conversion
The on-the-fly-conversion algorithm [16] performs the conversion from the signed-digit representation of the quotient digit q j to b j , with the conventional representation in BCD (radix 10). The conversion is done as the digits are produced and does not require a carry-propagate adder.
The partial quotient is stored in two registers: Q, holding the converted value of the partial quotient Q½j, and QM, holding Q½j À 10 Àj . The registers are updated in each iteration by shift-and-load operations and the final quotient is chosen between those two registers during the rounding.
In the radix-10 case, the implementation of the two registers Q and QM, with a precision of n digits, requires 2 Â ðn Â 4Þ flip-flops. This results in a large area and a highpower dissipation due to the shifting during each iteration.
To reduce both power dissipation and area, we modify the on-the-fly algorithm, as described in [19] . Specifically:
1. We load each digit in its final position. This way, we avoid shifting digits along the registers. To determine the load position, we use an n-bit ring counter C, one bit for each digit to load. 
2.
We reduce the partial-quotient registers from two to one by eliminating register QM and by including in register Q a digit decrementer controlled by the ring counter C (see [19] for more details). With these modifications, we reduce the number of flipflops in the conversion unit from 2 Â ðn Â 4Þ to n þ ðn Â 4Þ.
On-the-Fly Normalization
The load-digit-in-place mechanism can be used to normal- 1. Although normalization is not required in the IEEE standard for decimal representation if the quotient is exact, even in this case, it might be better to provide a normalized quotient to the compliance modules. This is especially true since the normalization has no additional cost.
Because of the initialization, the quotient obtained corresponds to Q ¼ q=100. Consequently, the converted quotient is multiplied by 100 (shifted left, two digits), resulting in Q 0:0yxxx . . . q y:xxxxx . . . ;
where we indicate with y the digit of weight 10 À2 in Q.
Therefore, to obtain a normalized result, the two cases are listed as follows: Table 5 ).
Calling L the number of digits of the final quotient, we produce L þ 1 digits (one additional quotient digit for the rounding). Consequently, the number of cycles ðN þ 1Þ is L þ 3 for case A and L þ 4 for case B. Specifically, for L ¼ 16, we get 19 and 20 cycles, respectively.
Rounding
In the implementation of the unit, we only consider the round-half-even rounding mode, but the other modes can be implemented in a similar manner.
For the normalized quotient, the rounding is always performed in position L þ 1 in register Q (see Fig. 10 ). 
Clearly, a borrow ðborrow RND Þ or a carry ðcarry RND Þ might be generated into position L. Although the propagation of the borrow is eliminated by the on-thefly conversion, the propagation of the carry (which occurs, for instance, when, before conversion, we have . . . 3 0 0 0 À1 q N ! . . . 2 9 9 9 9 b ðLþ1Þ ) has to be addressed separately. This carry propagation can be avoided if the conversion of digit L is performed including the carry from the rounding (in this case, in the conflicting situation above, the converted digit becomes 0). That is, the conversion of digit L has to be postponed one cycle. This is achieved by freezing the update of the marked digits when iteration L is reached. The update of digits in register Q is resumed only after the outcome of (8) is known. An example is shown in Table 6 .
IMPLEMENTATION AND COMPARISONS
In this section, we present the results of the evaluation of the proposed design and a comparison with a doubleprecision radix-16 digit-recurrence division unit which has roughly the same dynamic range for the normalized significand ð2 53 < 10 16 < 2 54 Þ. Furthermore, we compare our results with those obtained in [4] for a divider based on the Newton-Raphson approximation.
We performed a synthesis of the decimal divider, as shown in Fig. 9 , by using the STM 90-nm CMOS standard cells library [20] and Synopsys Design Compiler. From the synthesis, we estimated the critical path (including estimations at the netlist level of wire load) and the area. The critical path is highlighted in Fig. 9 (dotted line) and reported in detail in Table 7 .
The results are compared with those of [15] and reported in Table 8 . The data in Table 8 show that the decimal divider has a latency close to that of a standard radix-16 and its area is close to that of the faster radix-16 unit in [15] .
With respect to the implementation in [4] , it is quite difficult to compare between the two different algorithms. However, the implementation in [4] requires 150 cycles to compute a 16-digit quotient (as in our case) and has a critical path of 0.7 ns in a 0.11 m standard cells library. Assuming a constant field technology scaling 2 [21] , the critical path of 0.7 ns scales approximately to 0:7 Á ð90=110Þ ¼ 0:57 ns. The resulting latency is 150 Â 0:57 ¼ 85:5 ns, which is more than four times the latency of the divider presented in this work. Furthermore, the expression given in [4] for the number of cycles for a generic implementation of the Newton-Raphson algorithm is n: cycles ¼ 10 þ P ð3 þ 2mÞ, where P is the number of cycles needed to perform a decimal multiplication and m is the number of iterations needed for the approximation of 1=d (Newton-Raphson iterations). Because, to our knowledge, decimal multiplication requires several cycles (n þ 4 cycles, where n is number of digits, in [3] , for example), the digitrecurrence approach proposed in this work seems quite advantageous in terms of latency.
CONCLUSIONS
In this work, we presented a radix-10 division unit that is based on the digit-recurrence algorithm. We developed the algorithm and the divider architecture and implemented it in standard cells. The unit implemented (16-digit BCD) has a dynamic range of the significand that is comparable to that of the IEEE double-precision standard.
Since there were no comparable digit-recurrence radix-10 units, we compared with a radix-16 divider because the radices are similar. We concluded that the proposed radix-10 divider has a latency close to that of a basic retimed radix-16 divider, which a floating-point-unit designer might use when considering a fast implementation for medium-range radices. Moreover, we compared the proposed divider with a recent implementation of a decimal 2. In [4] , the supply voltage is V DD ¼ 1:2 V, whereas, in the library that we used, it is V DD ¼ 1:0 V. Therefore, the ratio of the feature lengths 110=90 ¼ 1:2 equals the ratio of the supply voltages. [15] divider based on the Newton-Raphson approximation [4] . Although a comparison between such different approaches depends on too many parameters, considering the timing reported for the two implementations, we are quite confident that the design presented here shows a latency (execution time for a 16-digit division) about four times shorter than that of the unit based on the Newton-Raphson approximation.
