Abstract-A few classes of algorithms to implement division in hardware have been used over the years: division by digit-recurrence, by reciprocal approximation by iterative methods and by polynomial approximation. Due to the differences in the algorithms, a comparison among their implementation in terms of performance and precision is sometimes hard to make. In this work, we use power dissipation and energy consumption as metrics to compare among those different classes of algorithms. There are no previous works in the literature presenting such a comparison.
I. INTRODUCTION
The quotient q of the division
can be computed directly or by multiplication of the reciprocal of d and the dividend x
The digit-recurrence algorithm [1] is a direct method to compute the quotient q. On the other hand, the reciprocal of d can be computed by iterative approximation (Newton-Raphson) or by polynomial approximation [2] . Those algorithms differ in a number of aspects as explained later.
Power dissipation has become a major concern in the design on integrated circuits for its impact on costs (packaging, cooling systems, power bills) and battery lifetime for portable devices.
Division is implemented in hardware in all general purpose CPUs, in most of processors used in embedded systems and it is part of arithmetic co-processors used in advanced hearing aids. Therefore, having low power division is important to lower the costs of multicore chips powering servers, to increase their reliability and to extend the battery lifetime of portable and wearable devices.
In this work, we compare in terms of power dissipation and energy consumption the three main algorithms used to compute division in hardware: division by digitrecurrence and division by approximation of the reciprocal with the Newton-Raphson (NR) method and by quadratic polynomial approximation. We compare both single and double precision division units. For the digitrecurrence division, we also present a low-power version of the algorithm based on the methods of [3] .
II. DIVISION BY DIGIT-RECURRENCE
The radix-r digit-recurrence division algorithm for double-precision significands described in detail in [1] , for radix-4, which is a standard implementation of the algorithm, is implemented by the residual recurrence
with the initial value w[0] = x and with the quotientdigit selection
where d δ is d truncated after the δ-th fractional bit (δ = 3 for radix-4) and the estimated residual, y = 4w S t + 4w C t , is truncated after t fractional bits (t = 3 for radix-4). The residual w[j] is kept in carrysave format to have a shorter cycle time. The divider is completed by a on-the-fly conversion unit, described in [1] , which converts the quotient digits q j+1 from the signed-digit to the conventional representation, and performs the rounding based on the sign of the remainder computed by the sign-zero detect (SZD) block. The conversion is done as the digits are produced and does not require a carry-propagate adder. The scheme of the unit is depicted in Fig. 1 and the results of its implementation are listed as r4-std in Table I .
A. Low-power implementation
Starting from the scheme implemented in Fig. 1 , we applied a number of design techniques [3] to reduce the power dissipation without increasing the cycle time. We consider two main portions: the recurrence and the conversion and rounding (C&R).
1) Retiming the recurrence:
Retiming is the circuit transformation that consists in re-positioning the registers in a sequential circuit without modifying its external behavior [4] . By retiming the recurrence we limit the cells on the critical path to the most-significant bits of the recurrence. The idea is to create a slack in the timing paths to replace high speed (HS) gates with slower and low power (LL) gates. The retiming is done by moving the selection function from the first part of the cycle to the last part of the previous cycle (cfr. Fig. 1 and Fig. 2) . We have to introduce a new register to store the quotient digit, but the register q j is quite small (4 bits) and it does not compromise the energy saving obtained by retiming. After the retiming, the critical path is limited to the 8 most-significant bits in the recurrence. Since the path through the least-significant bits of the multiple generator and the CSA does not include the selection function, these bits can be redesigned for low-power by changing HS cells into LL cells.
With this modifications, a significant reduction in power dissipation (-30% for dynamic and -70% for static power) is obtained. Fig. 3 shows the change in the HS and LL cell mix before and after the retiming.
2) Changing the Redundant Representation:
Since the contribution of flip-flops to both energy dissipation and area is significant, it is useful to change the redundant representation of the residual (w S and w C ) to reduce the number of flip-flops in the registers. By using a radix-4 carry-save representation with two sum bits and one carry bit for each digit (instead of two), we can reduce the number of flip-flops. With this modification we only need to store one carry bit for each digit.
The change in the redundant representation requires a redesign of the carry-save adder to propagate the carry inside the digit (Fig. 4) . The propagation of the carry increases the delay so that this modification cannot be made for those cells (digits of w) that are in the critical path. After the recurrence has been retimed, the critical path is limited to the 8 MSBs and in the remaining 2-bit digits we can use radix-4 CSAs.
3) Disabling the SZD unit:
The modification consists in switching-off blocks which are not active during several cycles. This is the case for the sign-zero-detection block (SZD), which is only used in the rounding step to determine the sign of the final remainder and if it is zero. The SZD can be switched off by forcing a constant logic value at its inputs during the recurrence steps.
4) On-the-fly conversion algorithm modification:
The on-the-fly convert-and-round (C&R) algorithm [1] performs the conversion from the signed-digit representation to the conventional representation in 2's complement. The partial quotient is stored in two registers: Q holding the converted value of the partial quotient, and QM holding Q-1. The registers are updated in each iteration by shift-and-load operations, and the final quotient is chosen between those two registers during the rounding. The large amount of power dissipated in the unit is mainly due to the shifting during each iteration and to the number of flip-flops, used to implement the registers. The power dissipation is reduced by: 1) We load each digit in its final position. In this way we avoid to shift digits along the registers.
To determine the load position we use a 28-bit ring counter C, one bit for each digit to load. 2) We reduce the partial-quotient registers from two to one by eliminating Register QM and by including in Register Q a digit decrementer controlled by the ring-counter C (see [3] for more detail). 3) We switch off the clock signal (clock gating) for the flip-flops that do not have to be updated in a given iteration.
The results of this low power implementation are listed as r4-lp in Table I .
III. DIVISION BY NEWTON-RAPHSON 1/d APPROXIMATION
The division q = x/d can also be implemented by the approximation of the reciprocal R = 1/d, followed by the multiplication q = Rx. By determining R[0] as the first approximation of 1/d, R can be approximated in m steps by the Newton-Raphson approximation
Each iteration requires two multiplications and one subtraction. The convergence is quadratic and the number of iterations m needed depends on the initial approximation R[0] (implemented by a look-up table in our case). The values of the initial approximation are reported in [8] . More detail on how to determine R[0] and the approximation accuracy is given in [5] . The implementation of the division by NR is sketched in Fig. 5 To have rounding compliant with IEEE standard, an extra iteration (cycle) is required to compute the remainder and perform the rounding according to the specified mode [2] :
IV. DIVISION BY 1/d POLYNOMIAL APPROXIMATION
Alternatively, the reciprocal 1/d can be obtained by polynomial approximation. This approximation is normally applied for operations in single-precision (or smaller) as the hardware complexity increases in excess for larger precisions. An example of unit implementing the piecewise quadratic polynomial approximation of 1/d is reported in [6] .
A look-up table is used to retrieve optimized coefficients and the polynomial is evaluated by a high speed datapath. We followed the method proposed in [7] to generate the coefficients which can result in a smaller table than that of [6] . The function to compute 1/d is approximated as follows:
where d * is the mid-point in each interval.
The look-up table implements m, K y , K m and K p with precision of 6, 12, 17, and 27 bits, respectively, corresponding to a table size of 64 · (2 · 12 + 17 + 27) = 4288 bits. The error to approximate 1/d is smaller than * 2 −24 . The approximation tables with the values of K y , K m and K p are reported in [8] .
The approximation unit is depicted in Fig. 6 . A squarer is used to compute
The multipliers recode the second operand into radix-4 digits before generating the partial products. Once the individual terms are ready, they are aligned and a 3-to-2 CSA sum them up. An additional multiplication of x and 1/d is required to obtain the quotient.
The results of this implementation are listed as poly in Table I .
V. ENERGY METRICS
Because the algorithms are different and the latency of the operations varies from case to case, it is convenient to have a measure of the energy dissipated to complete an operation. This energy-per-operation is given by
where t op is the time elapsed to perform the division. Divisions are usually performed in more than one cycle (in n cycles) of clock period T C and the expression of t op is typically t op = T C × n. By dividing the energyper-operation by the number of cycles we obtain the energy-per-cycle
This term is proportional to the average power dissipation that can be expressed in its equivalent forms:
where V DD is the unit supply voltage and I ave its average current. By combining (2) and t op we obtain
The term P ave has an impact on the sizing of the power grid in the chip and on the die temperature gradient, while the term E op impacts the battery lifetime.
VI. RESULTS OF EXPERIMENTS
Because the polynomial approximation of Section IV can only be implemented for single-precision operands, we first perform a comparison for double-precision operands and then we compare division algorithms for single-precision operands.
The units are implemented in the STM 90 nm library of standard cells [9] and the power dissipation has been computed by Synopsys Power Analyzer based on the annotated switching activity of random generated vectors.
A. Double-precision division
The upper part of Table I shows the comparison of the units described in Fig. 1 (r4-std) , Fig. 2 (r4-lp) and Fig. 5 (NR) for double-precision operands.
The results of the experiments show that the latency of the whole division (including rounding) is about 30% shorter for the Newton-Raphson approach at expenses of area
In terms of energy consumption, Table I shows that unit implementing the digit-recurrence algorithm consume less energy of unit implementing division by NR: the r4-lp consume about one fourth with respect to the NR per division. As for the average power dissipation, the digit-recurrence units are significantly better as well: the power dissipation of r4-lp is about one third than that of the NR unit. 
DOUBLE-PRECISION DIVISION

B. Single-precision division
For the digit-recurrence and the NR division, we estimated the power dissipation for single-precision operands by using the datapath for double-precision division. We reduced the operands size from double to single precision, we reduced the number of iterations, and eliminated the steps required for rounding.
Clearly, parts of the datapath keep switching even if the operands have reduced bit size, and therefore, the estimates for NR, r4-std, and r4-lp can be further optimized for single-precision operand.
The values in the lower part of Table I show that poly, the unit implementing division by polynomial approximation, has the shortest latency (1 clock cycle, corresponding to 3.5 ns), but larger area and power dissipation that the digit-recurrence division. Quite surprisingly, the energy-per-division (E op ) for the poly implementation is smaller than that of r4-lp (88.6 pJ vs. 93.6 pJ).
VII. CONCLUSIONS AND FUTURE WORK
The results of this survey on different approaches to the implementation of division in hardware show that methods based on the digit-recurrence algorithm gives the lowest power dissipation and energy-per-cycle. The implementation of division by polynomial approximation of the reciprocal has the highest power dissipation, but surprisingly, because of the reduced latency, consumes also the smallest energy for the whole singleprecision division.
The method based on the approximation of the reciprocal by Newton-Raphson is the less favorable in terms of area and power/energy consumption. The division by NR has the shortest latency for double-precision, while for single-precision the radix-4 digit-recurrence implementation has a shorter latency than the NR.
The lower energy consumption per operation in the poly approach should be further investigated for very low energy devices, maybe, by trading off some speed for lower power dissipation.
