Abstract-A new method for the high-speed computation of double-precision floating-point reciprocal, division, square root, and inverse square root operations is presented in this paper. This method employs a second-degree minimax polynomial approximation to obtain an accurate initial estimate of the reciprocal and the inverse square root values, and then performs a modified Goldschmidt iteration. The high accuracy of the initial approximation allows us to obtain double-precision results by computing a single Goldschmidt iteration, significantly reducing the latency of the algorithm. Two unfolded architectures are proposed: the first one computing only reciprocal and division operations, and the second one also including the computation of square root and inverse square root. The execution times and area costs for both architectures are estimated, and a comparison with other multiplicative-based methods is presented. The results of this comparison show the achievement of a lower latency than these methods, with similar hardware requirements.
Index Terms-Computer arithmetic, Goldschmidt iteration, table-based methods, double-precision operations, division, square root, inverse square root. ae 1 INTRODUCTION R ECIPROCAL, division, square root, and inverse square root are important operations for several applications such as digital signal processing, multimedia, computer graphics, and scientific computing [7] , [12] , [18] , [22] . Although they are less frequent than the two basic arithmetic operations, the poor performance of many processors when computing these operations can make their overall execution time comparable to the time spent performing addition and multiplication [16] .
For a low precision computation of these functions, it is possible to employ direct table look-up, bipartite tables [2] , [21] (table look-up and addition), or low-degree polynomial or rational approximations [11] , [15] , [23] , [24] , but the area requirements become prohibitive for table-based methods when performing high precision computations (such as the 53-bit accuracy double-precision floating-point format). Iterative algorithms are employed for these calculations. On one hand, digit-recurrence methods [4] , [8] , such as the SRT algorithm, result in small units, but their linear convergence sometimes leads to long latencies and makes them inadequate methods for these computations. High-radix digit-recurrence methods [13] result in faster but bigger designs. On the other hand, multiplicative-based methods [3] , [6] , [7] , such as the Newton-Raphson and Goldschmidt algorithms, have quadratic convergence, which leads to faster execution times, usually at the expense of greater hardware requirements.
These methods employ an initial table-based low-precision approximation (seed value) of the final result to reduce the number of iterations to be performed, thus reducing the overall latency of the algorithm.
A new multiplicative-based method is proposed in this paper. Our method combines an efficient second-degree minimax polynomial approximation [20] to obtain the seed values and a modified Goldschmidt iteration to provide double-precision floating-point results. The high accuracy (about 30-bits of precision) of the second-degree approximation allows the computation of a single iteration, significantly reducing the overall latency of the standard Goldschmidt algorithm. On the other hand, the modification performed in the iteration allows an important reduction on the hardware requirements. We have chosen this algorithm instead of Newton-Raphson due to its higher intrinsic parallelism, which leads to lower execution times.
The second-degree minimax polynomial approximation [20] consists of three look-up tables storing C 0 , C 1 , and C 2 , the coefficients of a second-degree polynomial. This approximation combines the speed of linear approximations and the reduced area of the second-degree interpolations. The lookup tables are addressed with the upper part of the significand input, X 1 (having a wordlength of m 1 ¼ 9 bits), and the evaluation of the quadratic polynomial is carried out by a specialized squaring unit and a multioperand adder. When obtaining the initial approximation for several different functions, only the look-up tables storing the coefficients must be replicated because the squaring unit and multioperand adder can be shared for the different computations.
Since there are two different implementations of the Goldschmidt algorithm for division and square root functions, we propose two architectures in this paper. The first one deals only with the computation of reciprocal and division operations, with a very low latency and reasonable hardware requirements. The second proposed architecture also deals with the computation of square root and inverse square root, with an increased latency (but still a fast execution time compared with previous methods) and about the same amount of hardware as the first scheme. The modification performed in the Goldschmidt iteration for computing square root and inverse square root allows the reutilization of the logic blocks employed in the first architecture, saving an important amount of area.
The rest of this paper is structured as follows: In Section 2, a brief description of the algorithm is presented, including an overview of the second-degree minimax approximation; the two unfolded architectures are proposed and implementation details are explained in Section 3; estimates of the execution time and hardware requirements for our proposed architectures are presented and a comparison with some previous multiplicative-based methods is outlined in Section 4; finally, the main contributions made in this work are summarized in Section 5.
ALGORITHM
The method proposed in this paper deals with the computation of the reciprocal function (1=X), division (Y =X), square root ( ffiffiffiffi ffi X p ), and inverse square root (1= ffiffiffiffi ffi X p ) for input operands in the IEEE double-precision floatingpoint format. With this format, a floating-point number M is represented using a sign bit s m , an 11-bit biased exponent e m , and a 53-bit significand X. If M is a normalized number, it represents the following value:
where X ¼ 1 þ f m , 1 X < 2, and f m is the fractional part of the normalized number (the 52-bit stored word). The computation of these functions is performed only for the input significand, Z ¼ fðXÞ, since the sign and exponent treatment is straightforward and can be performed in parallel.
Our method consists of the following steps:
. Computing an initial approximation R f % fðX XÞ, withX X a truncated version of the input operand, accurate to 30 bits for the reciprocal and to 29 bits for the inverse square root, 1 by employing a seconddegree minimax polynomial approximation. . Performing a modified Goldschmidt iteration, employing R f as a seed, to produce the final doubleprecision result Z ¼ fðXÞ. In the next subsection, we briefly describe the seconddegree minimax approximation employed (a detailed description can be found in [20] ). After this description, the modified Goldschmidt algorithm for the computation of reciprocation/division and also the one for the computation of square root/inverse square root will be presented.
Second-Degree Minimax Polynomial Approximation
A new method for the computation of powering function (X p ), in a single-precision floating-point format, by means of a second-degree minimax polynomial approximation with table look-up and multioperand accumulation, has been proposed in [20] . This algorithm allows an important reduction in the total area regarding linear approximations, with no delay increase. It combines the speed of first-degree approximation methods [23] and the reduced size of second-degree interpolation algorithms [1] , [10] . Since reciprocal and inverse square root functions are specific cases of powering, this second-degree approximation can be effectively employed to obtain accurate initial estimates for both functions. These estimates are the seed values required for the modified Goldschmidt algorithms (R d ¼ 1=X X will be the 30-bit seed value for division and reciprocal computation, while R s ¼ 1= ffiffiffiffi f X X p will be the 29-bit initial value employed for the square root and inverse square root computation). Since the minimax approximation is known to be the optimal polynomial approximation of a function [15] , we take advantange of its high accuracy to significantly reduce the wordlengths of the coefficients to be employed, reducing the size of the required look-up tables.
As shown in Fig. 1 , the n-bit binary input significand X is split into an upper part X 1 , a middle part X 2 , and a lower part X 3 . 1 . These values will be justified in Sections 2.2 and 2.3, where the error computation for the Goldschmidt algorithm is explained. 
An approximation to X p in the range X 1 X < X 1 þ 2 Àm 1 can be performed evaluating the expression
where the coefficients C 0 , C 1 , and C 2 are obtained employing the computer algebra system Maple [25], which performs minimax approximations using the Remez algorithm. X 3 is not involved in any of these computations, having no effect on the value of the initial approximation R f . The values of these coefficients depend only on the value of X 1 , the m 1 most significant bits of X, and on the parameter p. Therefore, C 0 , C 1 , and C 2 can be stored in lookup tables of 2 m 1 input values.
2
Apart from the table look-up, the quadratic polynomial (1) has to be efficiently evaluated. There are two main problems to be overcome: the squaring (X 2 2 ) and the two multiplications plus two additions to be computed. Rather than having a multiplier to compute the squaring function, some important properties can be used to significantly reduce the amount of area of the resulting unit and its associated delay. The techniques employed to obtain this area and delay reduction are considering the leading zeros of X 2 , rearranging the matrix of partial products, and truncating this matrix within the error bounds. These strategies allow the achievement of lower delays and of a total area slightly over half of the area of the corresponding multiplier unit [20] . For the computation of the quadratic polynomial, we propose employing a unified multioperand accumulation tree. The inputs of this adder are the partial products of both C 1 X 2 and C 2 X 2 2 , plus the coefficient C 0 . Signed-digit radix 4 (SD-4) representation [14] of the multiplier operands is employed to reduce the number of partial products to be accumulated.
Error Computation
The precision of the initial estimates obtained by computing the second-degree minimax approximation can be calculated taking into account the two main error sources: the error in the minimax approximation itself ( approx ) and the error due to the employment of finite arithmetic in the computation of the quadratic polynomial. The error in the initial approximation is:
Since the error due to the finite wordlength of the coefficients is obtained as a result of the Maple program implementing our method [20] :
therefore:
The minimum value of m 1 that allows the achievement of the target precision ( R d < 2 À30 and R s < 2 À29 ) is m 1 ¼ 9 for both cases. Once this value has been set, the minimax approximation is computed by employing Maple, minimizing the wordlengths of the coefficients C 0 , C 1 , and C 2 within the error bounds.
The wordlength of the coefficients is 30, 20, and 12 bits, respectively, when computing the reciprocal approximation, while 29, 18, and 10 bits when computing the inverse square root. Therefore, the bus sizes have to be set to 30, 20, and 12 bits. The size of the tables to be employed for the reciprocal computation is 2 9 Â ð30 þ 20 þ 12Þ ¼ 31Kb, i.e., less than 4KB. For the inverse square root computation, since the input operand X is in the range 1 X < 4, two sets of tables are needed, resulting in a table size of 2 Â 2 9 Â ð29 þ 18 þ 10Þ ¼ 57Kb, i.e., about 7KB. The total table size is therefore of 31 þ 57 ¼ 88Kb, i.e., 11KB.
The target precision in the worst case (reciprocal approximation, with jC 1 j max ¼ jC 2 j max ¼ 1) sets a minimum value both of m 2 and of the position j where the squaring operation X 2 2 is truncated [20] . These values are m 2 ¼ 33, which makes X 2 to have a wordlength of m 2 À m 1 ¼ 24 bits and j ¼ 34, which means that no bits of X after the position with weight 2 À25 are required for the squaring computation. The wordlength of X 2 2 will therefore be 16 bits. In spite of the fact that the wordlengths of the coefficients C 1 and C 2 are shorter than those of X 2 and X 2 2 , respectively, we employ these values as multipliers in the partial product generation for various reasons: The most important one is that the assimilation of X 2 2 from CS to binary representation can be avoided and substituted by a CS to SD-4 recoding; it is also interesting that the recoding of X 2 from binary to SD-4 representation can be performed in parallel with the squaring operation and the table look-up, introducing no extra delay in the critical path. The size of the fused accumulation tree remains the same, regardless of which values are employed as multipliers and which ones as multiplicands.
The total number of partial products to be accumulated is 21 (eight from the C 2 X 2 2 product, 12 from C 1 X 2 , plus the independent coefficient C 0 ), which causes the multioperand adder tree to have four levels of carry-save adders, as shown in Fig. 1 , with a total delay of
with t pp gen the delay of the partial product generation stage. The result is provided in CS representation.
Modified Reciprocal/Division Goldschmidt Algorithm
Assume two n-bit inputs Y and X satisfying 1 Y ; X < 2. The Goldschmidt algorithm [3] for computing the division operation (Z ¼ Y =X) consists of finding a sequence
and, therefore,
The reciprocal (Z ¼ 1=X) can be computed as a specific case of division:
The high accuracy (30 bits) of the first factor
obtained by employing the second-degree minimax approximation, guarantees that a single Goldschmidt iteration suffices to obtain double-precision results. This is due to the fact that the error i on the iteration i follows the quadratic relation [3] :
for the division operation. Our method for computing reciprocal and division operations consists of a reorganization of the steps to be performed in the traditional Goldschmidt algorithm and can be summarized as follows:
. Obtaining a seed value R d ¼ K 1 by performing a second-degree minimax approximation of the reciprocal (1=X X). Since R d has an accuracy of 30 bits,
when computing the reciprocal. After performing the last computation, the maximum absolute error Z is proportional to
, but also includes the error ( comp ) due to the employment of finite arithmetic in the computation of the steps of our method:
Let Gt , V t , and Zt be the errors introduced by employing units with wordlength t. The value of t has to be set so that double-precision results accurate to 1 ulp are guaranteed. The value of comp is:
Since the range of the results for the division operation is 0:5 < Z < 2, the error in the final result must be bounded by:
Assuming that Z is obtained in the last step after performing rounding to the nearest of the final result ( Zt 2 À55 ), the target precision can only be met, according
The bounds on Gt and V t can be met by employing truncation on the multipliers which carry out with G d and V d computation if a t ¼ 57-bit wordlength is employed. Another alternative is employing rounding to nearest and a t ¼ 56-bit wordlength, which guarantees the same bound on the errors.
Modified Square Root/Inverse Square Root Goldschmidt Algorithm
Assume an n-bit input operand X satisfying 1 X < 2. In this case, a sequence K 1 ; K 2 ; K 3 ; . . . should be found such that
We follow the same strategy in this case as that employed for the reciprocal/division computation: providing an accurate initial approximation obtained by performing a second-degree minimax approximation and then performing a single Goldschmidt iteration to obtain the double-precision results Z ¼ ffiffiffiffi ffi
. This is possible due to the quadratic convergence of the Goldschmidt algorithm [3] :
for inverse square root computation and
for square root computation. In the traditional Goldschmidt algorithm for square root/inverse square root computation, both the seed value
and its square 1=X X are required. Two look-up tables can be employed to obtain these values, but these tables must be designed so that each table look-up value 1=X X corresponds at full target precision to the square of the 
while, for the inverse square root, we have
The worst case in the error computation corresponds to the inverse square root operation. Since the output range for this function is 0:5 < Z 1, the maximum absolute error Z is bounded in this case by:
In this expression, the effect of employing a finite wordlength datapath has already been taken into account.
Since the values of Zt 2 À55 , Gt 2 À57 , and V t 2
À57
have been set by the error bounds in the division and reciprocal computation, the only bound to be determined here is R s . According to (5), a maximum absolute error in the initial approximation of Rs < 2 À29 suffices for guarantees the achievement of double-precision results with an error of less than 1 ulp for the four functions to be computed.
ARCHITECTURE
In this section, we propose an unfolded architecture for the computation of reciprocal and division operations and another unfolded architecture for the computation of all four functions: reciprocal, division, square root, and inverse square root. The design of both architectures is based on the method and error analysis presented in the previous section.
Pipelining could be applied to both methods, resulting in the corresponding pipelined architectures. Furthermore, both methods could also be implemented in a pipelined multiplier-based processor, such as the AMD-K7 or similar [17] .
The final results produced are guaranteed to be accurate to 1 ulp. As happens to other multiplicative-based methods, it is not possible to directly obtain exactly rounded results without an important overhead (an error less than 2 À2n or 2 À3n should be provided, for n-bit operands, depending on the function to be computed). Another alternative is to produce a result accurate to 1 ulp (as we guarantee) and determine the exactly rounded value by computing the corresponding remainder. This strategy is employed by the AMD-K7 [17] , computing rem ¼ X Â Z À Y for the division operation and rem ¼ Z Â Z À X for the square root. The sign of rem, together with the information on the rounding mode to be employed, determines the rounding direction and guarantees the computation of exactly rounded IEEE double-precision results.
Reciprocal/Division Unit
The steps to be performed for computing division and reciprocal operations can be summarized as:
where G d and V d can be computed in parallel and G f ¼ G d when computing division operation, while G f ¼ R d when computing the reciprocal 1=X. Fig. 2 shows the block diagram of the unit implementing this algorithm. The m 1 ¼ 9 most significant bits of the input operand X are employed to address the second-degree approximation look-up tables. The squaring operation is performed in parallel with the table look-up and all partial products of polynomial (1) are accumulated with a multioperand adder, whose output R d is in CS representation.
The 30-bit operand R d is employed as a multiplier in the computation of both G d and V d . Therefore, a CS to SD-4 recoding unit is required. When computing the reciprocal operation, R d is also employed as an addend and multiplicand in the computation of Z, so its two's-complement representation is needed. This assimilation to nonredundant representation is carried out by a CPA in parallel with the CS to SD recoding operation and the computation of G d and V d .
The computation of G d is carried out by a 30 Â 53 bit multiplier unit with SD representation for the multiplier operand and two's-complement rounded to nearest output. Nonredundant representation is used because G d is going to be employed as addend and multiplicand in the multiply-add unit which computes Z. The computation of V d is carried out by a 30 Â 53 bit multiply-complement unit, also with an SD representation for the multiplier and a two's-complement output. Since jV d j 2 À29 , it has at least 29 leading zeros (or ones), only one of them needed as sign bit. Therefore, V d has only 56 À 28 ¼ 28 significant bits and an important amount of area can be reduced by omitting the 29 most significant columns of the multiplication matrix. Besides, the short wordlength of V d allows the employment of a smaller multiply-add unit for the computation of Z.
A 56-bit 2:1 multiplexer is employed to select between G d and R d . When computing division, G f should be chosen as G d , so the control signal ct1 has to be set to the appropriate value (ct1 ¼ 0), while ct1 ¼ 1 sets G f ¼ R d and the computed operation is Z ¼ 1=X.
The computation of Z is carried out by a 28 Â 56 þ 56 bit multiply-add unit with two's complement inputs and output. The recoding from nonredundant representation to SD-4 is performed internally in this unit. The output of this unit is the 53-bit final result Z ¼ Y =X, when computing division, or Z ¼ 1=X, when computing the reciprocal operation.
There are two main paths in this architecture since G d and V d are computed in parallel. Anyway, as both units are 30 Â 53 bit multipliers, the critical path consists of the second-degree approximation look-up tables, the fused accumulation tree with CS result, the CS to SD recoding unit, a 30 Â 53 bit multiplier (G d computation), a 2:1 multiplexer, and a 28 Â 56 þ 56 bit multiply-add unit (Z computation).
Reciprocal/Division/Square Root/Inverse
Square Root Unit
The steps to be performed for computing the square root or the inverse square root operation with our method can be summarized as:
where G s and V s obviously cannot be computed in parallel and G f ¼ G s when computing the square root operation, while G f ¼ R s when computing the inverse square root 1= ffiffiffiffi ffi X p . The modified Goldschmidt iteration for square root and inverse square root computation has been designed so that it involves similar steps as the ones required in the reciprocal and division computation. The logic blocks employed in the previously proposed architecture can thus be reutilized in a complete architecture computing the four operations. A diagram block of this architecture (i.e., the architecture that deals with the computation of reciprocal, division, square root, and inverse square root operations) is shown in Fig. 3 , together with a summary of the control signals encoding for an appropriate behavior of the unit. Table 1 shows the operations performed in each logic block, depending on the function to be computed.
The second-degree approximation of the inverse square 
SD representation of
The computation of G s requires a 30 Â 53 bit multiplication, as the one in the reciprocal/division architecture. Since R f is selected by ct0 and G ¼ R f H, another multiplexer controlled by ct0 is needed to select between H ¼ Y and H ¼ X as multiplicand operand. The output result is a 56-bit rounded to the nearest value in nonredundant representation.
A 30 Â 56 bit multiply-complement unit is required for the computation of V . For the computation of square root/ inverse square root, V s ¼ 1 À R s G s , while, for reciprocal/ division computation, we had
is already selected at the output of the accumulation tree, it is possible to use the same control signal ct0 to select between W ¼ X and W ¼ G s as multiplicand operand for this unit, as shown in Table 1 .
The final 28 Â 56 þ 56 bit multiply-add operation required for the computation of Z is Z ¼ G f þ G f V s =2 for the square root/inverse square root computation, while we had
Since the computation of V is on the critical path of the new architecture, the best solution is inserting multiplexers which select between G f and G f =2 as the multiplicand and addend operand. G f takes the value of R f or G, depending on the control signal ct1, and the input operand to the unit is selected between G f or G f =2, depending on the control signal ct0. Therefore, as shown in Fig. 3 and Table 1 , the combination of the control signals ct0 and ct1 allows the computation of
. The critical path of this architecture consists of the second-degree approximation look-up tables, a multiplexer to select between reciprocal and inverse square root approximation coefficients, the fused accumulation tree with CS result, the CS to SD recoding unit, a 30 Â 53 bit multiplier (G computation), a 30 Â 56 bit multiplier-complement unit (V computation), and a 28 Â 56 þ 56 bit multiplier-add unit (Z computation). The execution time of this unit is slower than the one for reciprocal/division architecture since the computation of V is now included in the critical path, but the hardware requirements remain similar due to the reutilization of the main logic blocks for the new computations. 
EXECUTION TIME AND AREA COST ESTIMATES AND COMPARISON
In this section, estimates of the execution time and the area costs of the proposed architectures are presented and a comparison with some of the most efficient previous multiplicative-based methods for the computation of reciprocal, division, square root, and inverse square root operations is outlined. These estimates are based on a rough model for the cost and delay of the main logic blocks employed, taken from [5] , [26] . In this model, the unit employed for the delay estimates is (, the delay of a complex gate, such as a full-adder, while the area cost estimates are expressed as a multiple of fa, the area of one full-adder. The interconnections between logic blocks are not included in this model. A detailed explanation of the assumptions made can be found in [19] . Table 2 shows the delay and area estimates for the main blocks employed in our design and the architectures to be compared. These logic blocks are conventional multipliers, look-up tables, the multioperand adder, and the specialized squaring unit employed in our minimax approximation, recoding units, multiplexers, and SD adders and multipliers. The conventional multipliers are supposed to employ SD-4 recoding [14] of the multiplier operand, a Wallace accumulation tree, and a final CPA assimilation stage, when needed. The delay estimates for the look-up tables are taken from a previous model [5] and have been validated by synthesis results obtained from implementation using a family of standard gates from the AMS 0:35"m CMOS library. These logic blocks are usually the most difficult ones to model since they cannot be easily described in terms of full-adders.
The actual execution times and hardware requirements of the compared methods depend both on the technology employed and on the actual implementation, but, since all the schemes employ similar logic blocks, 3 the relative values may express a general trend among the different designs and make a good first-order approximation to the actual execution times and area cost values.
Execution Time Estimates
Recalling that the critical path of the unit for reciprocal/ division computation consists of the initial 9-input bit lookup table, the fused accumulation tree with CS output, a CS to SD recoding unit, a 30 Â 53 bit multiplier (G d computation), a 2:1 multiplexer, and a 28 Â 56 þ 56 bit multiply-add unit (Z computation), the execution time can be estimated as 34:5(, as shown in Fig. 4 . The critical path for the complete unit includes an extra multiplexer to select between the reciprocal and inverse square root initial approximations and another 30 Â 56 bit multiplication (V computation). Summarizing, this critical path consists of the initial look-up table, a 2:1 multiplexer, the fused accumulation tree with CS output, a CS to SD recoding unit, two 30 Â 56 bit multipliers, another 2:1 multiplexer, and a 28 Â 56 þ 56 bit multiply-add unit and, therefore, its execution time can be estimated as 45(, as shown in Fig. 4 .
Area Cost Estimates
Recalling the size of the look-up tables and the combinational logic blocks employed in the architecture implementing our method for the computation of all four functions (reciprocal/division/square root/inverse square root), we can summarize its hardware requirements as two 9-input bit sets of look-up tables (initial approximation), a specialized squaring unit, a fused accumulation tree, two 30 Â 56 bit multipliers, and a 28 Â 56 þ 56 bit multiply-add unit. One of the 30 Â 56 bit multipliers (the multiply-complement unit employed for V computation) can have an important area reduction (about 30 percent) by taking advantage of the bound jV j 2 À28 . This property allows the elimination of the 28 most significant colums of the multiplication matrix.
According to the estimates given in Table 2 , the total area required by our method can be estimated as around 7; 345fa, with a contribution of 3; 080fa from the 11KB look-up tables and of 4; 265fa from the combinational logic blocks employed, as shown in Fig. 5 . Table 3 shows the execution time and area cost estimates for the following methods: Newton-Raphson traditional algorithm (NR), Wong and Goto (WG) [26] , Ito, Takagi, and Yajima (ITY) [9] , Ercegovac, Muller, Lang, and Tisserand (EMLT) [5] , and AMD-K7 implementation (AMD) [17] , together with the estimates for our architectures. All these methods provide results accurate to 1 ulp. For obtaining exactly rounded results, a remainder should be computed. 4 Fig . 4 details the execution time 3. All compared methods employ a similar number of tables and multipliers of similar sizes. Therefore, comparison results are relatively technology-independent, and the impact of considering interconnections between blocks can be neglected.
Comparison with Other Methods
4. The AMD-K7 actual implementation includes this remainder computation and final rounding stage. For comparison reasons, the estimates shown here do not include both features and a nonpipelined (unfolded) architecture is considered.
estimates for these methods, while Fig. 5 gives an explanation of the area cost ones.
In the NR algorithm that we are considering for comparison, bipartite tables [2] are employed to obtain a seed value and then two NR iterations are performed. The WG method employs a polynomial approximation combined with table look-up to obtain double-precision results. The ITY method employs a linear approximation to obtain the initial value and a single iteration of a third-degree convergent algorithm is performed, involving three 56 Â 56 bit multiplications. In ELMT, small multipliers (15 Â 15 and 15 Â 45 bit multipliers) are employed, after an initial reduction stage, to compute the functions. The main drawback of this method is the size of the required lookup tables, which makes its hardware requirements prohibitive (its area cost is about three times bigger than the other ones). The AMD-K7 method is an implementation of the traditional Goldschmidt algorithm: It employs bipartite tables to obtain the initial approximation and then computes two Goldschmidt iterations to calculate the double-precision results. A detailed description of any of these methods can be found in [19] .
Hardware reutilization is a strategy employed for some of the previous methods (NR, WG, ITY, and AMD), and can also be employed for our algorithm, allowing a significant reduction in the total area at the expense of increasing the 
This strategy leads to a reduced total area of about 5; 030fa (3; 080fa for the look-up tables, 600fa for the accumulation tree and the squaring unit of the minimax approximation, and 1; 350fa for the reused 30 Â 56 þ 56 multiply-add unit), as shown in Fig. 5 .
Our method presents a speed-up of about 25 percent over the previous methods when computing the reciprocal function. The speed-up increases to about 40 percent when computing division, due to the fact that most methods require an extra multiplication for division computation, while our method and WG provide the division result with the same execution time as for the reciprocal computation. Our execution times for square root and inverse square root computation are also faster than the previous ones (with the only exception of ELMT, but the area requirements of this method are prohibitive). AMD is the method with lower hardware requirements (taking into account that it employs an existing multiplier, as NR and ITY). However, our method has a similar area cost, especially if we employ a single multiply-add unit to compute G, V , and Z.
As has been said, the actual speed-ups and area ratios depend on the technology employed and the implementation, but the estimated values presented here are reliable approximations since all compared methods employ similar logic blocks and look-up tables of similar sizes (except for ELMT). This means that the speed-ups obtained come directly from the reduction on the latency achieved by performing a single Goldschmidt iteration.
CONCLUSION
A new method for the efficient computation of doubleprecision floating-point reciprocal, division, square root, and inverse square root functions has been presented in this paper. This method employs a second-degree minimax polynomial approximation, with table look-up, a specialized squaring unit, and a multioperand adder, to obtain an accurate (around 30 bits) initial approximation R f of the reciprocal and the inverse square root values. The specialized squaring unit and the accumulation tree are shared for the computations of the reciprocal (R d ) and inverse square root (R s ) initial approximations. The size of the employed look-up tables is 11KB.
After obtaining R f , a modified Goldschmidt iteration is performed to obtain the double-precision result. A single iteration is required, due to the high accuracy of the initial approximation, which significantly reduces the latency of the algorithm. The modified Goldschmidt iterations for computing reciprocal/division and for computing square root/inverse square root have been combined to allow the design of a single architecture efficiently computing the four operations. A faster architecture computing only reciprocal and division has also been proposed.
A rough model for the delay and area cost of the main logic blocks employed has been outlined and the execution time and area costs for both architectures have been estimated: 34:5( for reciprocal/division computation and 45( for square root/inverse square root computation. The total amount of area required for the complete architecture is about 5; 030fa when employing hardware reutilization strategies.
For comparison purposes, other efficient methods for the computation of these functions have been described and their execution times and area costs have also been estimated. Comparison results show that our method achieves an important speed-up over previous methods, with similar hardware requirements. . For more information on this or any computing topic, please visit our Digital Library at http://computer.org/publications/dlib.
