Abstract-This paper presents a simple hardware architecture for computing the seed values for reciprocal and square root reciprocal. These seeds are used in the initialization of floatingpoint division and square root software iterations. The proposed solution is based on polynomial approximation with specific coefficients and a table lookup.
INTRODUCTION
In general-purpose processors, floating-point division and square root are often performed using iterative methods such as the Newton-Raphson algorithm [1] , [2] , [3] . In order to reduce the number of iterations, dedicated hardware tables are frequently used to store medium accuracy initial values, or seeds, for the iterations. In this work, we focus on the computation of such seeds in hardware.
The proposed solution is based on a polynomial approximation with specific coefficients and a table lookup. The corresponding architecture is very simple and leads to small and fast circuits. There are several parameters in seed hardware operator: the approximated function (1/x or 1/ √ x), the argument width, internal accuracy requirements and optimization parameters. We have developed a VHDL code generator, called seedgen, to support all the possible parameters of our method. This program generates an optimized and synthesizable VHDL description of the seed operator based on our method. Compared to direct lookup table implementation on FPGAs (with content optimization using the synthesis tools), the proposed method leads to 2.5 reduction factor in size and 40% speed improvement.
I. BACKGROUND

A. Newton-Raphson Iteration
The Newton-Raphson algorithm evaluates a function by iteratively improving an initial approximation. This algorithm is based on a general method to obtain a single zero α of function f (i.e., f (α) = 0 and f (α) = 0). If x 0 is close enough to α, the following iteration converges towards α:
where f (x) denotes the derivate of f with respect to x. The convergence of the method is quadratic, which means that the number of bits of accuracy roughly doubles after each iteration. In order to speed up the overall computation, the first iterations that only provide a very small number of bits of accuracy should be avoided. For instance, if one uses an initial value x 0 with only one bit of accuracy, the computation requires at least 5 iterations (1 → 2 → 4 → 8 → 16 → 32) for 32-bit values. Using an initial seed with 8 bits of accuracy, it only requires 2 iterations (8 → 16 → 32).
In general, by using an initial approximation with a relative error not greater than 2 −n , the number of iterations to obtain the result with a relative error not greater than 2 −m is
Therefore, the choice of the initial approximation is critical for the speed of the algorithm. On the other hand, a seed with a large number of bits is costly to obtain. Consequently, finding a simple method of obtaining the seed value is of great practical interest in implementing the Newton-Raphson method. This speed-up method also applies in the Goldschmidt iteration for division and square root [2] .
Direct lookup table implementations of such seeds are possible for small accuracy. For instance, the IBM 360/91 processor used generated 10-bit seed from a table (implemented as a ROM) addressed by 7 bits of the normalized divisor [4] .
For larger accuracies, the bipartite method is more and more used [5] . For instance, in the AMD-K7 processor an optimized and simplified reciprocal and square root reciprocal estimate interpolation unit is implemented [6] . The reciprocal estimate provides at least 14.94 bits of accuracy while the square root reciprocal is accurate to at least 15.84 bits. This unit requires 69Kb of ROM and 3 cycles of latency.
B. Division Iteration
For the computation of the quotient q = a/d, one can use the following 2-step method:
1) Evaluate t = 1/d using the Newton-Raphson iteration based on the function f (x) = 1 x − d. The corresponding iteration is:
2) Compute the quotient using q = t × a. Each iteration requires two multiplications and one addition. The second step requires an additional full-length multiplication.
A table is often used to get the initial value x 0 based on a few most significant bits of d. As an example, on the Itanium processor, the frcpa instruction gives a seed for reciprocal (1/d) with an accuracy of 8.886 bits (see [3] for details).
C. Square Root Iteration
Directly evaluating √ c using the Newton-Raphson algorithm with the function f (x) = x 2 − c is not a good idea. The corresponding iteration is x i+1 = . The corresponding iteration is:
2) Multiplication by c to get √ c. Each iteration requires three multiplications and one addition.
In this case also, a table is often used to get the initial value x 0 based on a few most significant bits of c. As an example, on the Itanium processor, the frsqrta instruction gives a seed for 1/ √ c with an accuracy of 8.831 bits (see [3] for details).
D. Minimax Polynomial Approximation
The initial approximations used in the following are based on the minimax polynomial approximation as a starting point. The degree-d minimax polynomial approximation to f on [a, b] is the polynomial P * that satisfies:
where P d is the set of polynomials with real coefficients and degree at most d and
Minimax approximations can be computed using a wellknown algorithm due to Remes [7] (available in the Maple numapprox package, for instance).
II. RECIPROCAL SEED
The degree-1 minimax polynomial of f (x) = 1/x with x ∈ [1, 2[ (the range of the mantissa of a floating-point number) computed using Maple is
This polynomial provides an approximation with 4.5 bits of accuracy.
In this work, we use the minimax polynomial as a starting point because it is convenient and well implemented (e.g., Maple minimax function). But for degree-1 polynomial approximation, another method is possible to get the polynomial coefficients exactly, for details see [2, page 373] . Using this method one can get the polynomial (3/4+ √ 2/2)−x/2 which is theoretical polynomial that correspond to the minimax polynomial (7) . Theoretically, this is the same polynomial, the slight difference in the constant coefficient is due to the rounding error during the Remez algorithm numerical execution. So, the two polynomials can be used as a starting point for our method since they provide similar accuracy and coefficients.
In the polynomial (7), the degree-1 coefficient is a power of 2. The constant coefficient is close to the value 1.5. So the following "close" polynomial can be used to approximate 1/x:
This modified polynomial p approximates 1/x on [1, 2[ with 3.5 bits of accuracy. We will use this very simple polynomial as a starting point to compute the seed value for the reciprocal function.
The evaluation cost of this polynomial is very small in practice. It is illustrated in Figure 1 . Using 2's complement the value −x/2 is obtained by complementing all bits of x, shifted one position to the right, and adding a 1 in the LSB position. As x ∈ [1, 2[, the first bit of x is 1 (at position 0). If the binary representation of x is (1.x 1 x 2 x 3 . . . x n ) 2 , then the binary expansion of −x/2 is (1.0x 1 x 2 x 3 . . . x n ) 2 plus 1 LSB. So the computation of 3/2 − x/2 only costs one carry propagation corresponding to the additional 1 LSB. This kind of a very simple polynomial approximation was proposed in [8] . In order to improve the result of this approximation (p only provides 3.5 bits of accuracy), we add a correcting term t(x) to the result of this modified polynomial. Our method is based on this very simple polynomial approximation (8) and a table to store the correcting terms. So the value of the seed is:
where the correcting term t(x) is the difference between the actual function 1/x and the result of p(x) rounded to the output width:
The corresponding architecture is presented in Figure 3 .
The table used to the correcting terms has n-bit addresses (the bits of the operand x) and w-bit words. The final result (seed) s(x) has n + g bits where g is the number of guard bits (g > 1 for the reciprocal function). The alignment of the n bits of p and the w bits of t is illustrated in Figure 2 using an example. The main hardware cost of our solution is one (2 n × w)-bit table and one (n + g)-bit addition with input carry. Optimizations at circuit level are presented in Section IV.
In addition to the VHDL description of the architecture, seedgen generates a plot of the approximation results. The corresponding plot is presented Figure 4 in the case n = 3 bits and g = 1 bit for the reciprocal function. The generated plot is based on a gnuplot script [9] . The accuracy (corresponding to the maximum error) provided by our architecture is n+g+1 bits for the reciprocal. The choice of the correcting terms t(x) can be done to ensure the seed value s(x) is within a 1/2ulp distance to the theoretical value of 1/x. The plot presented in Figure 5 illustrates the approximation error corresponding to the case n = 3 bits and g = 1 bit. This plot is also generated using seedgen. The polynomial p overestimates the actual value of f (see Figure 4 for an illustration). Then the sign of the difference between p and f is always greater than or equal to zero. At the implementation level, the correction can be implemented using two solutions:
• store positive offsets in the table t and perform a subtraction p − t; • store negative offsets in the table t and perform an addition p + t. The best solution depends on the basic cells available at low level. Figures 3 and 2 show that the alignment of p and t operands allow to improve the adder architecture used to perform p + t. Indeed, in some cases one can use incrementer cells instead of adder cells for the w − g LSBs. In practice, this optimization is provided at the synthesis level by most of standard tools.
III. SQUARE ROOT RECIPROCAL SEED
A similar solution is proposed for the square root reciprocal. The degree-1 minimax polynomial of f (x) = 1/ √ x with x ∈ [1, 2[, and 5.7 bits of accuracy is
This polynomial is close to:
which gives an approximation to 1/ √ x on [1, 2[ with 4 bits of accuracy. The evaluation cost of this polynomial is also very small in practice as illustrated on Figure 6 . Using 2's complement the value −x/4 is obtained by complementing all bits of x shifted 2 bits to the right and add 1 LSB. As x ∈ [1, 2[, the binary expansion of −x/4 is (1.10x 1 x 2 x 3 . . . x n ) 2 plus 1 LSB. The computation of 5/4 − x/4 only costs one carry propagation corresponding the additional 1 LSB. So the same kind of architecture presented in Figure 3 can be used in case of the square root reciprocal. As the x is shifted two positions to the right the number of guard bits is g ≥ 2 in the square root reciprocal case. Now the generator has to compute the correct constant bits of −x/4 and the correcting terms 1 t(x) with
The accuracy (corresponding to the maximum error) provided by our architecture is also n + g + 1 bits (1/2 LSB) for the square root reciprocal.
IV. VHDL GENERATOR
We have developed a program that generates optimized and synthesizable VHDL descriptions of seed architectures using our method. This program, called seedgen, is a C program in text mode and distributed under the GPL license. It is available on the Web [10] . All the seedgen parameters are given on the command line. Table I , given in the last page of the paper, reports the main parameters. 1 The correcting terms t here are specific to the square root reciprocal functions.
V. ACCURACY RESULTS
Our solution provides seeds (for reciprocal and square root reciprocal) with an accuracy of at most 1/2 LSB with words on n + g bits. This corresponds to a maximum error of at most 2 −(n+g+1) . In order to verify this accuracy, we present in Table II , given in the last page of the paper, the minimum accuracy (corresponding to the maximum error) and the average accuracy (corresponding to the average error) reported by the generator (using an exhaustive check).
A. Comparison to Direct Lookup Table
In order to compare our method with standard solutions, we implemented seeds based on direct lookup table on FPGAs. The generated direct lookup table architecture is a ROM description with n-bit addresses and n + g + 1-bit words. The stored values provide 1/2 LSB accuracy for all inputs as in operators generated by seedgen. When this ROM description is implemented on FPGAs, the synthesis tool performs a lot of logical optimization based on the content of the ROM. This leads to significantly smaller and faster architecture than using unoptimized table with 2 n × (n + g + 1) bits. 
CONCLUSION
A method for the initial approximation to reciprocal and square root reciprocal functions in hardware was proposed. These initial approximations or seeds are used in the initialization of floating-point division and square root software iterations in order to speedup the computation.
The proposed solution is based on polynomial approximation with specific coefficients and a table lookup generates optimized and synthesizable VHDL operators for various parameters and hardware constraints. Compared to direct lookup table implementation on FPGAs (with content optimization using the synthesis tools), the proposed method leads to 2.5 reduction factor in size and 40% speed improvement.
