Abstract-A table-based method for high-speed function approximation in single-precision floating-point format is presented in this paper. Our focus is the approximation of reciprocal, square root, square root reciprocal, exponentials, logarithms, trigonometric functions, powering (with a fixed exponent p), or special functions. The algorithm presented here combines table look-up, an enhanced minimax quadratic approximation, and an efficient evaluation of the second-degree polynomial (using a specialized squaring unit, redundant arithmetic, and multioperand addition). The execution times and area costs of an architecture implementing our method are estimated, showing the achievement of the fast execution times of linear approximation methods and the reduced area requirements of other second-degree interpolation algorithms. Moreover, the use of an enhanced minimax approximation which, through an iterative process, takes into account the effect of rounding the polynomial coefficients to a finite size allows for a further reduction in the size of the look-up tables to be used, making our method very suitable for the implementation of an elementary function generator in state-ofthe-art DSPs or graphics processing units (GPUs).
INTRODUCTION
T HE increasing speed and performance constraints of computer 3D graphics, animation, digital signal processing (DSP), computer-assisted design (CAD), and virtual reality [7] , [12] , [14] , [26] , have led to the development of hardware-oriented methods for high-speed function approximation. An accurate and fast computation of division, square root, and, in some cases, exponential, logarithm, and trigonometric functions has therefore become mandatory in platforms such as graphics processing units (GPUs), digital signal processors (DSPs), floating-point units of generalpurpose processors (FPUs), or application-specific circuits (ASICs) [18] , [19] , [27] , [39] .
The method presented in this paper is a table-driven algorithm based on an enhanced minimax quadratic approximation which allows such high-speed computations in single-precision (SP) floating-point format. High-fidelity audio and high-quality 3D graphics or speech recognition, among other applications, require the use of singleprecision FP computations [7] . Thus, while 10-16 bit approximations were accurate enough in early graphics cards, with the evolution of graphics applications, higherprecision computations have become mandatory.
For instance, a frequent operation in 3D rendering is the interpolation of attribute values across a primitive, colors and texture coordinates being common attributes requiring such interpolation. To obtain perspective correct results, it is necessary to interpolate a function of the attributes, rather than the attributes directly, and the interpolated result is later transformed by the inverse function to get the final value at a desired point in screen space [11] . Thus, the attributes at the primitive's vertices must first be divided through by the homogeneous coordinate w (the coordinate w itself is replaced by 1=w) and these modified attributes are then linearly interpolated to determine the correct value at a given pixel within the primitive. At each pixel, the newly interpolated value is then divided by the interpolated 1=w, via multiplication by w, to obtain the correct final value. This set of operations requires that, per-pixel, the newly interpolated inverse homogeneous coordinate 1=w must be reciprocated to reform w. The precision of this reciprocal operation must be as close as possible to the working precision of the attributes being interpolated to guarantee correct results. In the case of per-pixel texture coordinate interpolation, the coordinates in modern GPUs are computed and stored as single-precision floating-point values. The reciprocal operation itself, therefore, must return a value close to the exactly-rounded single-precision floating-point result. In practice, an error in the computation near 1 ulp for singleprecision is sufficient, while errors greater than this can result in visible seams between textured-primitives.
Other transcendental functions are frequently used in modern vertex and pixel shading programs. Normalizing vectors using the reciprocal square root operator is a common function in lighting shaders. The instruction sets of modern GPUs and APIs, such as Microsoft's DirectX9 [22] , also incorporate exp 2, log 2, sin , and cos . These operations are important to many classes of shading programs and they are all required to produce results as close as possible to the processor's working precision.
The proposed method is suitable for approximating any function fðXÞ and, therefore, can be applied to the computation of reciprocal, square root, square root reciprocal, logarithms, exponentials, trigonometric functions, powering 1 (with a fixed exponent p), or special functions. Furthermore, this method can be used for fixed-point DSP applications with a target precision of 16 to 32 bits as well or to obtain initial approximations (seed values) for higher precision computations (as done in [30] for doubleprecision computation of reciprocal, division, square root, and square root reciprocal).
The input operand X is split into two parts, the most and least significant fields: X ¼ X 1 þ X 2 , X 1 being 6-7 bits wide, and the function fðXÞ is approximated as a quadratic polynomial C 2 X 2 2 þ C 1 X 2 þ C 0 . For each input subinterval, the coefficients of the minimax approximation C 2 , C 1 , and C 0 , which only depend on X 1 , are obtained through an iterative process that takes into account the effect of rounding such coefficients to a finite wordlength and are then stored in look-up tables. The evaluation of the polynomial is performed using redundant arithmetic and multioperand addition, while X 2 2 is generated by a specialized squaring unit with reduced size and delay regarding a standard squarer. The multioperand addition is carried out by a fused accumulation tree, which accumulates the partial products of C 2 X 2 2 and C 1 X 2 , together with C 0 , and requires less area than a ðn Â nÞ-bit standard multiplier while having about the same delay. The rounding scheme is an enhancement of rounding to the nearest and consists of injecting a function-specific rounding-bias before truncating the intermediate result to the target precision. The output results can be guaranteed accurate to 1 ulp, 2 ulps, or 4 ulps, depending on the accuracy constraints of a specific application for any of the functions fðXÞ considered.
The organization of this paper is the following: Background information about hardware-oriented methods is given in Section 2; the main features of our method, error computation, iterative method for selecting the polynomial coefficients, and efficient polynomial evaluation, are described in Section 3; an architecture for the computation of reciprocal, square root, exponential, and logarithm, together with delay and area cost estimates, is shown in Section 4; in Section 5, a comparison with bipartite tables methods and linear and quadratic approximations is presented; finally, the main contributions made in this work are summarized in Section 6.
BACKGROUND
Hardware-oriented methods for function approximation can be separated into two main groups: iterative and noniterative methods. In the first group belong digit-recurrence and online algorithms [8] , [9] , based on the subtraction operation and with linear convergence, and hardware implementations of functional iteration methods, such as Newton-Raphson and Goldschmidt algorithms [10] , [27] , which are multiplicative-based and quadratically convergent. This type of method can be used for both lowprecision and high-precision computations. On the other hand, we have direct table look-up, polynomial and rational approximations [17] , [23] , and table-based methods [2] , [5] , [6] , [15] , [34] , [40] , [42] , suitable only for low-precision operations and usually employed for computing the seed values in functional iteration methods.
When aiming at single-precision (SP) computations, table-based methods have inherent advantages over other types of algorithms. First of all, they are halfway between direct table look-up, whose enormous memory requirements make them inefficient for SP computations, and polynomial/rational approximations, which involve a high number of multiplications and additions, resulting in long execution times. Table- driven methods combine the use of smaller tables with the evaluation of a low-degree polynomial, achieving both a reduction in size regarding direct table look-up and significant speed-ups regarding pure polynomial approximations. The fact of being noniterative makes table-based algorithms also preferable regarding digit-recurrence methods, due to the linear convergence of these conventional shift-and-add implementations, which usually leads to long execution times. Something similar happens to functional iteration methods, despite being quadratically convergent, since, for each iteration, a number of multiplications are involved, resulting again in long execution times. Moreover, functional iteration methods are oriented to division and square root computations only and do not allow computation of logarithm, exponential or trigonometric functions. Table- based methods can be further subdivided, depending on the size of the tables employed and the amount of computations involved, into compute-bound methods [41] , [42] , table-bound methods [6] , [24] , [34] , [35] , [38] , and inbetween methods [1] , [2] , [5] , [15] , [40] .
. Compute-bound methods use table look-up in a verysmall table to obtain parameters which are used afterward in cubic or higher-degree polynomial approximations [41] , [42] . The polynomial is usually evaluated using Horner's rule and an important amount of additions and multiplications is involved. This type of method is favored with the availability of a fused multiply-add unit, as happens in architectures such as PowerPC and Intel IA64 (now IPF) [7] , [20] , [21] . . Table- fast, but their use is limited to computations accurate up to 16-bits (maybe 20-bits) with current VLSI technology, due to the growth in the size of the tables with the accuracy of the result.
2
. In-between methods use medium size tables and a significant yet reduced amount of computation (e.g., one or two multiplications or several small/rectangular multiplications). This type of method can be further subdivided into linear approximations [5] , [40] and second-degree interpolation methods [1] , [2] , [15] , [31] , [36] , depending on the degree of the polynomial approximation employed. The intermediate size of the look-up tables employed makes them suitable for performing single-precision computations, achieving fast execution times with reasonable hardware requirements. As explained above, the best alternative when aiming at SP computations are in-between methods (compute-bound methods may be preferable if the area constraints of a specific application are really tight). Within this group, the main advantage of linear approximations is their speed since they consist of a table look-up and the computation of a multiply-add operation [5] (or a single multiplication, if a slight modification of the input operand is performed [40] ), while the main advantage of quadratic approximations is the reduced size of the look-up tables (around 12-15Kbits per function [1] , [2] , [15] , [36] versus 50-75Kb [5] , [40] in linear approximations).
The synthesis of the main advantages of both linear and conventional quadratic approximations can be reached by using the small tables of a quadratic interpolator and by emulating the behavior of linear approximations, computing the second-degree polynomial with a similar delay to that of an SP multiply operation. Such acceleration can be achieved [29] , [31] , [36] by combining the use of redundant arithmetic, a specialized squaring unit, and the accumulation of all partial products in a multioperand adder.
MINIMAX QUADRATIC INTERPOLATOR
In this section, we propose a table-based method for highspeed function approximation in single-precision floatingpoint format. We consider, in this paper, the computation of reciprocal, square root, square root reciprocal, exponential (2 X ), logarithm (log 2 X), and sine/cosine operations, although the proposed method could be used for the computation of other trigonometric functions (tan , arctan , . . . ), powering (with a fixed exponent p), or special functions. The results can be guaranteed accurate to 1 ulp, 2 ulps, or 4 ulps for any of these operations, depending on the accuracy constraints of a specific application. This algorithm is a generalization and optimization of those previously proposed by Piñ eiro et al. [29] , [31] and by Muller [25] .
Range Reduction and Notation
Three steps are usually carried out when approximating a function [17] , [42] : 1) range reduction of the argument to a predetermined input interval, 2) function approximation of the reduced argument, and 3) reconstruction, normalization, and rounding of the final result. Thus, the approximation is performed in an input interval ½a; bÞ and range reduction is used for values outside this interval. For the elementary functions there are natural selections of the intervals ½a; bÞ which simplify both the steps of range reduction and function approximation, leading to smaller tables and faster execution times, and also allow avoidance of singular points in the function domain [4] , [13] , [36] .
Our method deals with the step of function approximation within a reduced input interval and standard techniques are applied for the range reduction and reconstruction steps. A summary of these well-known range reduction schemes is shown in Fig. 1 , where the following notation is used: An IEEE single-precision floating-point number M x is composed of a sign bit s x , an 8-bit biased exponent E x , and a 24-bit significand X, and represents the value:
where 1 X < 2, with an implicit leading 1 (only the fractional 23-bits are stored). Leading zeros may appear when approximating the logarithm if E x ¼ 0, causing a loss of precision. In that case, 1 log 2 ðXÞ=ðX À 1Þ < 2 is computed instead since it eliminates the leading zeros and the result is later multiplied by ðX À 1Þ, processing the exponent of the result to account for this normalization [36] .
Overview of the Method
As explained in Section 2, a thorough analysis of the design space of hardware-oriented methods for function approximation shows that, when aiming at single-precision computations, the best trade off between area and speed is obtained by using a quadratic interpolator, provided that the degree-2 polynomial is efficiently computed [29] , [31] , 2 . The size of the tables required for SP computations with the SBTM method is over 1; 300Kbits per function [34] . [36]. Therefore, our method for high-speed function approximation consists of a quadratic polynomial approximation, followed by a fast evaluation of the polynomial. A specialized squaring unit, redundant arithmetic, and a multioperand adder are employed to carry out with such evaluation. On the other hand, the high accuracy of an enhanced minimax approximation, which takes into account the effect of rounding the coefficients to a finite wordlength, allows for a significant reduction in the size of the tables required to store the polynomial coefficients regarding similar quadratic interpolators.
As shown in Fig. 2 , in our method, the n-bit binary significand of the input operand X is split into an upper part, X 1 , and a lower part, X 2 :
An approximation to fðXÞ in the range
Àm can be performed by evaluating the expression
where the coefficients C 0 , C 1 , and C 2 are obtained for each input subinterval using the computer algebra system Maple [43] . The method for obtaining those coefficients, corresponding to an enhanced minimax approximation, will be explained in Section 3.4 and is one of the fundamental contributions of this work.
The values of C 0 , C 1 , and C 2 for each function fðXÞ depend only on X 1 , the m most significant bits of X. Therefore, these polynomial coefficients can be stored in look-up tables of 2 m entries. 3 The exact value of m depends on the function to be computed and on the target precision. For instance, it will be shown in Section 3.4 that, for approximating the square root and exponential, with an accuracy of 1 ulp in a singleprecision format, m ¼ 6 suffices, while m ¼ 7 is necessary for computing the reciprocal, logarithm, and reciprocal square root. Note that minimizing m is crucial to the area requirements of an interpolator since a high-accuracy approximation that allows for using m 0 ¼ m À 1 would require look-up tables with 2 mÀ1 entries, that is, half the number of entries in the table, and, therefore, tables with roughly half the size of the original ones.
The main features of our method can be summarized as:
. An enhanced minimax approximation is computed to obtain the sets of coefficients of a quadratic interpolation for the considered function, within the error bounds set by a specific application. Such an approximation consists of 3-passes of the minimax approximation in order to compensate for the rounding errors introduced by having finite-size coefficients. This enhanced minimax approximation is performed with the computer algebra system Maple.
The value of m is minimized to allow for a significant reduction in the size of the look-up tables storing the coefficients.
-
The wordlengths of the polynomial coefficients are also minimized, after m has been set, to allow for a further reduction in the table size and also in the size of the logic blocks to be used in the polynomial evaluation.
The range of the coefficients is noted in order to safely remove, if possible, some initial bits from the stored coefficients. These bits are treated as implicit bits and can be concatenated at the output of the look-up tables, allowing for a further slight reduction in their size. . A high-speed evaluation of the degree-2 polynomial is carried out in order to obtain execution times similar to those of linear interpolation methods, which only require the computation of one multiplication (and, possibly, one addition).
-A specialized squaring unit is used for generating X 2 2 . Since this value is used as a multiplier operand in the generation of the partial products of C 2 X 2 2 , its assimilation to nonredundant representation (which would increase the overall delay of the polynomial evaluation) is avoided by using CS to SD recoding instead.
-SD radix-4 recoding of the multipliers is used in the generation of the partial products of C 2 X 2 2
and C 1 X 2 , which allows the reduction by half of the number of partial products to be accumulated. The alternative of using SD radix-8 recoding instead is discussed in Section 3.5.
The accumulation of all partial products of C 2 X 2 2
and C 1 X 2 , together with the degree-0 term C 0 , is carried out by a multioperand adder. -Assimilation into nonredundant representation is performed just once, by a fast adder (a CPA), at the output of the fused accumulation tree. If 3. In some cases, a bigger range is covered for ease of implementation and some extra bit(s) may be necessary for addressing the tables. For instance, when approximating the square root or square root reciprocal, the least significant bit of the exponent is employed to select between the tables covering the intervals ð1; 2Þ and ð2; 4Þ. two multiplications and two additions were performed, several sequential assimilations would be necessary, slowing down the polynomial computation significantly. . Rounding is performed by injecting a function-specific rounding bias before performing truncation of the result to the target precision. This is an enhancement of rounding to the nearest since it subsumes the former, but also allows for reducing the maximum absolute error for each specific function and compensating for the effect of truncation errors in the polynomial evaluation, such as using finite wordlength in the computation of X 2 2 . This rounding scheme is not suitable for higher-precision computations since exhaustive simulation is required for the choice of the rounding bias. The results of the function approximation obtained with our quadratic interpolator can be guaranteed accurate to 1 ulp, which allows faithful rounding for the computation of reciprocal, square root, square root reciprocal, and exponential. 4 For many applications, guaranteeing faithful rounding suffices and speeds up the computations, while significantly reducing the hardware requirements of the circuit. For instance, it is a common practice in computer graphics applications not to perform exact rounding since the inherent loss of precision does not lead to degradation in the quality of the results, usually employed only for displaying purposes [14] , [18] .
Error Computation
The total error in the final result of the function approximation step can be expressed as the accumulation of the error in the result before rounding, " interm , and the rounding error, " round :
where r depends on the input and output ranges of the function to be approximated and on the target accuracy and defines a specific bound on the final error. The error on the intermediate result comes from two sources: the error in the minimax quadratic approximation itself, " approx , and the error due to the use of finite arithmetic in the evaluation of the degree-2 polynomial:
The error of the approximation, " approx , depends on the value of m and on the function to be interpolated. As pointed out in Section 3.2, the minimum value of m compatible with the error constraints must be used in order to allow for reduced size look-up tables storing the polynomial coefficients.
The Maple program we use for obtaining the coefficients (see Section 3.4) gives the contribution to the intermediate error of the minimax approximation performed with rounded coefficients and, therefore, we can define:
and, since " X 2 ¼ 0, in this case:
with " squaring ¼ jC 2 j" X 2 2 . On the other hand, " round depends on how the rounding is carried out. Conventional rounding schemes are truncation and rounding to the nearest. If using truncation of the intermediate result at position 2 Àr , the associated error would be bounded by " round 2
Àr , while, if performing rounding to the nearest by adding a one at position 2 ÀrÀ1 before such truncation, the rounding error would be bounded by 2
ÀrÀ1 instead. The rounding scheme employed in our minimax quadratic interpolator is an enhancement of rounding to the nearest, made possible by exhaustive simulation, and consists of adding a function-specific rounding bias before performing the truncation of the intermediate result at position 2
Àr . A judicious choice of the bias, based on the error distribution of the intermediate result for each specific function, allows reduction of the maximum absolute error in the final result while compensating for the truncation carried out in the squaring unit. The injection of such a bias constant can be easily performed within the accumulation tree, as will be explained in Section 3.5.
Summarizing, the total error in the final result can be expressed as:
with AE" round instead of þ" round due to the ability of our rounding scheme to compensate in some cases for the truncation errors in both the squaring computation and the enhanced minimax approximation.
Enhanced Minimax Approximation
In this section, we describe the algorithm for obtaining the coefficients of the quadratic polynomial to be stored in the look-up tables. This algorithm consists of three passes of the minimax approximation in order to compensate for the errors introduced by rounding each of the polynomial coefficients to a finite precision. More information about the minimax approximation can be found in [23] , [25] , [29] , [31] , [33] . Fig. 3 describes the heuristic employed to obtain the minimum value of m which guarantees results accurate to the target precision, the set of coefficients C 0 ; C 1 ; C 2 with minimum wordlengths ðt; p; qÞ, and the function-specific rounding bias. As Step 2 in our heuristic, a 3-passes hybrid algorithm is computed by using Maple, as shown in Fig. 4 . This hybrid algorithm will be explained in detail below. As
Step 3, exhaustive simulation across the approximation interval is performed using a C program which describes the hardware functionality of our quadratic interpolator. Note that exhaustive simulation is possible due to the fact of dealing with single-precision computations.
When the configuration which leads to a lower table size has been obtained (optimal configuration), the binary representation of the coefficients is generated to be used in the synthesis process. All coefficients are kept in fixedpoint form and the range of the coefficients is noted in order to safely remove some bits from the stored coefficients (to be dealt with as implicit initial bits, which are later concatenated at the output of the tables), allowing a slight further reduction in the table sizes.
The parameter has been introduced in the heuristic described in Fig. 3 in order to be aggressive in the minimization of m and ðt; p; qÞ since it prevents early discarding of configurations which might be allowed later in combination with a specific rounding bias. A value of which allows obtaining a good degree of optimization is ¼ 2
ÀrÀ2 . Note that, at the end of Step 2 in Fig. 3 , there may be several combinations of ðt; p; qÞ which are candidates to be the optimal configuration. In such a case, priority could be given to those combinations which minimize ðp þ qÞ, allowing a higher value for t since the coefficient C 0 is not involved in any partial product generation and, therefore, it has a slightly lower impact on the size of the structures used within the accumulator tree. Alternatively, the bit-patterns of the coefficients for each configuration (in binary representation) can be analyzed to see if there is any configuration which allows treating a higher number of bits as implicit, thus saving some extra bits of storage.
An interesting artifact can occur for some configurations with specific functions: All values of a coefficient share some initial common bits (this may include the sign bit as well) except the one corresponding to the first subinterval. In such a case, the conflictive value can be speculatively substituted by the closest binary number sharing those initial common bits and the exhaustive simulation must be performed again, assuming this new value, to check whether the maximum absolute error has not increased. This trick allows the saving of extra storage of implicit bits in the look-up tables without affecting the accuracy of the approximation. In order to illustrate this artifact, the implicit bits for the optimal configurations obtained with 1 ulp accuracy for some elementary functions are shown in Table 1 . It happens that the value of the coefficient C 0 in the first subinterval for sinðXÞ, with m ¼ 6 and configuration (27; 18; 13) , is À2:235 Á 10 À8 , while all other values of C 0 are positive and belong to the interval ½0; 1Þ. Keeping such a value for the first interval would make storage of both the sign and the integer bits for all C 0 coefficients in the corresponding look-up table necessary. However, this value can be safely substituted by 0 and, since the maximum error of the approximation does not belong in that first subinterval, a pattern þ0:xxxxx can be assumed for C 0 in that case. Thus, the initial þ0: need not be stored since it can be concatenated at the output of the table with the stored :xxxxx bits.
Three-Passes Hybrid Algorithm
A minimax approximation with polynomial coefficients (a 0 , a 1 , a 2 ) of unrestricted wordlength yields very accurate results. However, the coefficients must be rounded to a finite size to be stored in look-up tables and such rounding may significantly affect the precision of the original approximation. Let us illustrate this with an example: Consider the computation of the function 1= ffiffiffiffi ffi X p on the interval ð1; 2Þ, for a random value of m, say 8. At address i in the table, we will find the coefficients of the approximation for the interval ½1 þ i=256; 1 þ ði þ 1Þ=256Þ. Let us see how to compute coefficients for i ¼ 37.
The Maple input line
asks for the minimax approximation in that domain (the variable x represents X 2 , the lower part of the input operand X). We get the approximation > 0.93472998018 + (-0.40834453917 + 0.26644775593 x ) x, with an error " approx ¼ 3:61 Â 10 À9 . However, if the linear coefficient a 1 is rounded to p ¼ 14 bits to become C 1 and the quadratic coefficient a 2 is rounded to q ¼ 6 bits to form C 2 , the approximation error using C 1 and C 2 turns out to be "
À8 , which is much larger than " approx , the error of the approximation using a 1 and a 2 .
A second pass of the minimax approximation allows for the computation of the best approximation among the polynomials with p-bit C 1 coefficients (this new polynomial has coefficients a 0 0 and a 0 2 ). After rounding the newly obtained coefficient a 0 2 to q ¼ 6 bits to form C 2 , the approximation error has been reduced to " 0 approx ¼ 5:01 Â 10 À8 . A third pass of the minimax approximation is necessary to take into account the effect of this rounding as well. The idea is now obtaining the best polynomial among those polynomials whose order-1 coefficient is C 1 and whose order-2 coefficient is C 2 . This is done as follows:
À8 , which is half the error of the original approximation with a single pass of minimax, when the effect of rounding the coefficients to finite wordlengths was not taken into account. Finally, in both cases, the degree-0 coefficient must also be rounded to t bits to be stored in the look-up table corresponding to C 0 .
In Piñ eiro et al. [29] , [31] , an algorithm was proposed consisting of performing these 3-passes of the minimax approximation. However, some numerical problems arose in such an algorithm when considering specific configurations and/or specific functions, originated by the minimax approximation performed as second step. 5 In such a step, an
The conclusion is that certain functions are not readily amenable to being approximated by automated methods since the function needs to be expressable in a particularly simple form.
Those numerical problems referred to were later addressed by Muller [25] by showing that the degree-1 minimax approximation to ffiffiffi ffi L p in the interval ½0; 2 À2d corresponds to
with an error in the approximation of 2 ÀdÀ3 , which makes available an analytical expression for performing the second pass. The algorithm proposed in [25] , however, does not provide a higher overall accuracy since it performs only 2-passes of the minimax approximation instead of three.
Summarizing, we propose here a hybrid algorithm consisting of 3-passes of the minimax approximation, with the second pass performed by making use of the analytical expressions introduced in [25] 1. Using minimax to find the original approximation, with nontruncated coefficients, and rounding the degree-1 coefficient to p bits to obtain C 1 . 
where a 1 , a 2 are the degree-1 and degree-2 coefficients of the original approximation, and rounding a 0 2 to q bits to obtain C 2 . 3. Using minimax to compute the degree-0 coefficient, based on C 1 and C 2 above, and then rounding it to t bits to obtain C 0 . Fig. 4 shows a Maple program which implements our method, for the computation of square root, with m ¼ 6, within the interval ð1; 2Þ. The number of correct bits are obtained as goodbits and the error of the approximation, " 0 approx , is shown as errmax. Table 2 shows a summary of the parameters to be employed for the approximation of some elementary functions with our method, with an accuracy of 1 ulp. The error bounds in the prenormalized result are shown, together with the minimum values of m for each function, the number of fractional bits of the polynomial coefficients in the optimal configurations, the accuracies yielded by such optimal configurations, and the corresponding look-up table sizes for each function.
Table Sizes Obtained with Our Algorithm
Remember that the number of fractional bits of the coefficients (configuration) is not coincident in all cases with the number of stored bits since the ranges of some coefficients allow removal of some initial bits, which are left as implicit bits, while, in other cases, it is necessary to store an integer bit as well, as shown in Table 1 .
Smaller tables could be employed for approximations with looser accuracy constraints, as shown in Table 3 for the cases of 2 ulps and 4 ulps. Note that, when a decrement by one in the value of parameter m is made possible by lower accuracy constraints, a reduction by a factor of about 2 can be achieved in the size of the look-up tables. This happens, for instance, for reciprocal computations when going from 2 ulps to 4 ulps and for exponential (2 X ) or reciprocal square root approximation when moving from 1 ulp to 2 ulps.
Fast Evaluation of the Quadratic Polynomial
As explained in Section 2, many quadratic interpolators [1] , [2] , [15] have, as a main drawback, their long execution times, in spite of the advantage of using small tables for storing the polynomial coefficients. In order to emulate the behavior of linear approximations, the polynomial evaluation must be carried out in an efficient manner. For that purpose, we use a specialized squaring unit, redundant arithmetic, and multioperand addition, as shown in Fig. 2. 
Specialized Squaring Unit
Rather than having a multiplier to compute X 2 2 , two techniques can be used to design a unit with significantly reduced area and delay [16] , [37] : rearranging the partial product matrix and considering the leading zeros of X 2 . The former is well-known and we refer the reader to [3] , [29] , [36] for more information. In the latter, advantage is taken of the fact that X 2 ¼ ½:x mþ1 . . . x n Â 2 Àm (i.e., X 2 has m leading zeros) and, therefore, X 14 bits must be kept for the square with m ¼ 7. Only two levels of 4:2 adders, plus an initial and stage (x ij ¼ x i Á x j ), are required to generate the CS representation of X 2 2 , which is directly recoded to SD-4, to be used as multiplier operand in the fused accumulation tree.
When sharing the squaring unit for the approximation of several functions with different values of m, the squarer must be designed for the lowest m since all other cases are subsumed just by inserting leading zeros, as will be shown in Section 4.
Fused Accumulation Tree
Apart from the squaring evaluation, the other main problem to be overcome in order to guarantee high-speed function approximations is the computation of the quadratic polynomial once the values of C 0 , C 1 , and C 2 have been read from the look-up tables and X 2 2 has been calculated in parallel using the specialized squaring unit.
Following [36] , we propose employing a unified tree to accumulate the partial products of both C 1 X 2 and C 2 X 2 2 , plus the coefficient C 0 . The use of redundant arithmetic can help to significantly reduce the number of partial products to be accumulated: SD radix-4 reduces such a number by half, while SD radix-8 may do it by a factor of 2/3.
The total number of partial products to be accumulated is the sum of the number of partial products of C 1 X 2 (generated by X 2 ), the number of partial products of C 2 X 2 2 (generated by X 2 2 ), plus 1 (the degree-0 coefficient C 0 ). Thus, if the total number of partial products goes from 10 to 12, a first level of 3:2 adders, plus two levels of 4:2 adders are necessary. If the range is 13 to 16 pps, three levels of 4:2 CSA adders are necessary and, for higher values, at least four levels of adders must be employed in the accumulation tree. Table 4 shows different combinations of SD-4 and SD-8 representation for X 2 and X 2 2 which may be considered. When m ¼ 7, the best alternative consists of using SD-4 for both X 2 and X 2 2 since their wordlengths of 16 and 14 bits, respectively, lead to the generation of 8 þ 7 þ 1 ¼ 16 partial products to be accumulated. No speed-up can be obtained from using SD-8 unless it is employed for both X 2 and X 2 2 , which would increase the table sizes significantly since the multiples ð3ÂÞ for C 1 and C 2 must be also stored in order not to increase the delay of the partial product generation scheme.
When m ¼ 6, X 2 has 17 significant bits and X 2 2 has 12 leading zeros, with a wordlength of 16 bits if truncation at position 2 À28 is performed in the specialized squaring unit. The first alternative, following [29] , [31] , consists of again using SD-4 representation for both X 2 and X 2 2 , which results in the generation of nine partial products to compute C 1 X 2 and eight partial products to compute C 2 X 2 2 . The number of operands to be accumulated is, therefore, 18 (9 þ 8 þ 1, the last one being the coefficient C 0 ) and, therefore, four levels of adders are required, but a minimum size for the look-up tables is guaranteed. The second alternative consists of using SD-8 representation for X 2 2 , reducing the contribution of C 2 X 2 2 from eight to six partial products, which leads to a total number of 16 pps to be accumulated and helps to reduce the delay of the accumulation tree by adjusting to just three levels of 4:2 adders. The selection between both alternatives depends on implementation and technology constraints.
In our method, the first alternative is employed because it allows a minimum table size, while the accumulation of partial products can be arranged so that the effective delay corresponds to only three levels of 4:2 adders. Note that, as will be shown in Section 4, the generation of the partial products of C 2 X 2 2 usually takes longer than the generation of those corresponding to C 1 X 2 , due to the computation of the squaring and later recoding into SD-4 representation. Thus, a first level of three 3:2 CSA adders can reduce the nine partial products of C 1 X 2 into six pps, in parallel with the recoding of X 2 2 and generation of C 2 X 2 2 , and, therefore, out of the critical path of the interpolator. Note also that, in the resulting accumulation tree, only the leftmost carry-save adders in levels second to fourth (see Fig. 6 ) must have full singleprecision wordlength, while all other adders benefit from the reduced wordlength of the polynomial coefficients.
By using this strategy, the effective delay of the accumulation tree for m ¼ 6 is exactly the same as that for m ¼ 7, that is:
with t pp gen the delay of the partial products generation stage, while the delay of a standard ð24 Â 24Þ-bit multiplier corresponds to
Therefore, the proposed fused accumulation tree is only about 0:5 slower than a standard single-precision multiplier ( is the delay of a full-adder), with less area because of the reduced wordlengths of C 2 and C 1 , the coefficients of our enhanced minimax polynomial approximation. Fig. 5 shows the arrangement of the matrix of partial products to be accumulated for m ¼ 6 and gives an idea of the effect of reducing the wordlengths of these two coefficients on the total area of the accumulation tree (C 2 TABLE 4 Use of Redundant Arithmetic in the Generation of the Partial Products affects eight partial products, while C 1 is involved in the generation of nine). Sign extension of the operands [28] is performed as shown in Fig. 5 , with the circles at the beginning of each word meaning a complement of the sign bit. Bits c i and d j correspond to the LSB to be added to partial products i or j when multiples À1 and À2 are generated. c i s correspond to the partial products of C 1 X 2 , while d j s are related to the partial products of
The injection of the function-specific rounding bias can be easily accomodated within the accumulation tree by a variety of techniques. As shown in Fig. 5 , we have chosen to split such bias into upper and lower fields, adding the upper field to C 0 before storing it in the corresponding look-up table and appending to it the lower field within the accumulation tree.
ARCHITECTURE
In this section, an architecture for the computation of reciprocal, square root, exponential, and logarithm, in single-precision floating-point format is proposed. The block diagram of such an unfolded architecture, 6 which implements our table-based method for function approximation, is shown in Fig. 6 . Sign, exponent, and exception logic are not shown.
When implementing a single unit approximating several functions, only the look-up tables storing the polynomial coefficients must be replicated since the squaring unit, recoders, CPA, and multioperand adder can be shared for the different computations. Some kind of selection logic must be inserted to distinguish among the polynomial coefficients corresponding to different functions or, alternatively, a single ROM structure could be employed to store all coefficients, with the addressing acting as the multiplexing scheme.
Since m ¼ 6 for square root and exponential and m ¼ 7 for reciprocal and logarithm, the input operand X is split into X 1 and X 2 as follows:
with x 7 not affecting the coefficient selection for square root and exponential, and
with x 7 set to 0 when approximating reciprocal and logarithm. The size of the buses, the squaring unit, and the accumulation tree have been set according to the error computation in order to guarantee an accuracy of 1 ulp for all functions considered: 27 bits for C 0 (one integer and 26 fractional bits, with an implicit positive sign), 18 bits for C 1 (one sign, one integer, and 16 fractional bits), and 13 bits for C 2 (one sign, one integer, and 11 fractional bits), as shown in Fig. 6 . On the other hand, the squaring unit and accumulation tree employed are those corresponding to m ¼ 6 since the case m ¼ 7 is thus subsumed.
The look-up tables storing the polynomial coefficients are addressed using X 1 , two control bits (op) which select among the four functions to be approximated, and, for square root computations, also the least significant bit of the input exponent E x , which determines whether the exponent is even or odd. In parallel, X 2 is recoded from binary representation to SD-4 and the computation of X 2 2 is carried out in carry-save form and then recoded to SD-4 representation. The sign, exponent, and exception logic also operate in parallel with the table look-up. The total table size is about 22:2Kb, with a contribution of 2 Â 3:0625KB for the square root, 3:1875Kb for the exponential, 6:375Kb for the reciprocal, and 6:5Kb for the logarithm, as shown in Table 2 .
All partial products are accumulated together using the multioperand adder, with a first level of 3:2 adders reducing the number of partial products of C 1 X 2 from nine to six, and then three levels of 4:2 adders. Such a structure, as explained in Section 3.5, accomodates the delay of the required extra level of 3:2 adders and compensates for the delay of the path consisting of squaring computation and recoding. A function-specific compensation constant (rounding bias) is injected within the accumulation tree in order to perform the rounding while minimizing the maximum 6 . Note that pipelining this architecture into a three-stage structure is straightforward, as shown in [29] , [31] . absolute error for each function. Finally, the assimilation from CS into nonredundant representation and normalization of the result are carried out by a fast adder.
Evaluation
Estimates of the cycle time and area costs of the proposed architecture are presented now, based on an approximate technology-independent model for the cost and delay of the main logic blocks employed. In this model, whose description can be found in [29] , [30] , the unit employed for the delay estimates is , while the area estimates are expressed as a multiple of fa, the delay and area, respectively, of a full-adder. Table 5 shows the area and delay estimates for the main logic blocks in our architecture, together with those corresponding to the logic blocks employed in the methods included in the comparison performed in Section 5. The delay estimates for the look-up tables already include an extra 0:5 to account for the selection scheme required when several functions are approximated.
The critical path in our architecture is the slowest between two main candidates: 
According to the delay estimates of the individual components shown in Table 5 , the critical path in our architecture corresponds to path B and the cycle time can be estimated as t path B ¼ 14:5.
According to the area estimates of the individual components shown in Table 5 , the total area can be estimated as 1; 291fa, with a contribution of 776fa from the 22:2Kbit lookup tables and of 515fa from the combinational logic blocks employed: a specialized squaring unit with m ¼ 6 (52fa), a recoder from CS to SD-4 (25fa), a recoder from binary to SD-4 (5fa), the fused accumulation tree corresponding to m ¼ 6 (400fa), a CPA (21fa), and registers (12fa). The exponent and sign logic, and the range reduction schemes, are not included in the evaluation.
COMPARISON
A comparison of the minimax quadratic interpolator with existing methods for function approximation is presented in this section. The area and delay estimates have been obtained using the same technology-independent model used for the evaluation of our architecture [29] , [30] , and are explained in detail in [32] . Table 5 shows the estimates for the main logic blocks employed in the compared methods. All considered interpolators employ similar tables and multipliers and, therefore, comparison results are technology-independent, with the relative values expressing trends among the different designs and making good first-order approximations to the actual execution times and area costs.
The comparison is organized in two parts. On the one hand, we compare the minimax quadratic interpolator with an optimized bipartite table method (SBTM) [34] and two linear approximation algorithms (DM97 and Tak98) [5] , [40] , when computing a single operation: reciprocal. This is due to the fact that DM97 [5] has been proposed only for performing such an operation and that Tak98 [40] has been proposed for powering computation with a fixed exponent (X p , thus including the reciprocal when p is set to À1), but neither one is a general method for elementary function approximation. On the other hand, we compare the minimax interpolator with other quadratic methods (CW97, CWCh01, JWL93, and SS94) [1] , [2] , [15] , [36] when employed as an elementary function generator, that is, when computing several operations (in this case, four operations) with the combinational logic shared for the different computations and a replicated set of look-up tables. In JWL93 [15] the functions approximated are reciprocal, square root, arctangent, and sine/cosine, CW97 [1] and CWCh01 [2] approximate reciprocal, square root, sin/cos and 2 X , and, in both SS94 [36] and our quadratic minimax interpolator, the functions computed are reciprocal, square root, exponential (2 X ), and logarithm. We have chosen to conform to the original methods in order to obtain the area and delay estimates based on the table sizes reported by their respective authors. Table 6 shows the table sizes, estimated area costs, and execution times of the minimax quadratic interpolator for the computation of the reciprocal when compared with those corresponding to SBTM [34] , DM97 [5] , and Tak98 [40] . Some assumptions have been made to allow for a fair comparison: Table sizes correspond to a final result accurate to 1 ulp and exponent and exception logic are not included in the area estimates.
The main conclusion to be drawn from this comparison is that bipartite table methods are simply unfeasible for single-precision computation due to the huge size of the tables to be employed. It is also noticeable that fast execution times can be obtained with linear approximation methods, but their hardware requirements are two to three times higher per function than those corresponding to the minimax quadratic interpolator, which, on the other hand, allows for similar execution times. When several functions must be approximated, the look-up tables storing the coefficients need to be replicated and the benefits of a fast quadratic interpolator regarding bipartite tables and linear approximation methods are significantly increased.
In order to compare the minimax quadratic interpolator with other second-degree approximation methods, when used as an elementary function generator, the total table size, execution time, and area costs, both from the look-up tables and from the combinational logic, are shown in Table 7 . To allow for a fair comparison, table sizes correspond in all cases to final results accurate to 1 ulp. Note also that exponent and exception logic are not included in the area estimates and that execution times correspond to reciprocal and square root computation (for exponential, logarithm, or trigonometric functions, the delay of the range reduction and/or reconstruction steps should be added).
The analysis of approximations with Chebyshev polynomials of various degrees performed in [36] shows that a cubic approximation allows some area savings for 24-bits of precision and exactly rounded results. However, when results accurate to 1 ulp are allowed, a quadratic approximation is preferable in terms of both area and speed (see [36, Fig. 14]) . Therefore, we have included in our comparison their quadratic interpolator (SS94) for the approximation of reciprocal, square root, exponential (2 X ), and logarithm, with the table sizes reported in [36, Fig. 18 ], which correspond to a target precision of 24-bits, with 1 ulp accuracy.
The analysis of the estimates shown in Table 7 points out some interesting trends. Regarding execution time, JWL93, CW97, and CWCh01 show the main drawback of quadratic interpolators when the degree-2 polynomial is evaluated using a conventional scheme consisting of performing two serial multiplications and additions: low speed. Conversely, SS94 and the minimax quadratic interpolator use optimized schemes to deal with the polynomial evaluation, resulting in execution times similar to those achieved by linear approximation methods (about 14). In SS94, the two multiplications and additions are concurrently computed and the final result is obtained by using a 3-input multioperand adder, while, in our minimax quadratic interpolator, all partial products of the multiplications are added together using merged arithmetic and a final CPA assimilates the output of the fused accumulation tree into binary representation.
In terms of area, CW97 and CWCh01 methods show reduced hardware requirements (around 1; 400fa). This is due to the fact that those methods use a hybrid scheme where some function values are stored in the look-up tables instead of the polynomial coefficients, which allows a reduction in the total table size. However, such a reduction comes at the expense of computing the coefficients on-thefly, which adds extra delay to the critical path and results in long execution times. The SS94 and JWL93 methods require bigger tables per function and, therefore, their hardware requirements are significantly larger.
The main difference between SS94 and the minimax quadratic interpolator is subtle, but important, and explains the different table size required for each method. While both methods rely on accurate mathematical approximations which outperform Taylor approximations (although, as shown by Piñ eiro et al. [29] , [31] and Muller [23] , [25] , the use of minimax yields slightly better accuracy than Chebyshev), the approximation in SS94 is computed without accounting for the effect of rounding the polynomial coefficients to a finite wordlength. However, three passes of the minimax approximation are performed in our algorithm, which allows compensating in each pass for the effect of rounding a coefficient to a finite wordlength (see Section 3.4). This property, combined with the intrinsic higher accuracy of the minimax approximation, results in a more accurate departing polynomial approximation, and makes feasible a reduction by one of the value of m for most functions. Note that a decrement in m by one results in a reduction by half in the size of the look-up tables to be employed, which grow exponentially with m (each table  has 2 m entries), and, therefore, has a strong impact on the hardware requirements of the architecture. The final step of performing exhaustive simulation in order to minimize the width of the coefficients to be employed, which is common to both methods, has a much lower impact on the table size since marginal improvements can be achieved at that stage. Such a step is necessary, however, since any reduction in the wordlength of the coefficients may also help in reducing the size of the accumulation tree to be employed for the polynomial evaluation.
We can conclude, therefore, that our enhanced minimax interpolator shows the lowest overall area requirements (about 1; 291fa), similar to those of the best quadratic interpolators, which, on the other hand, have execution times over two times longer than those of our method. Such an area reduction, made possible by the high accuracy of the minimax approximation and to the 3-passes iterative algorithm employed to compute the polynomial coefficients, can be achieved without compromising speed since similar execution times to those of the linear approximations have been obtained. Moreover, when implementing an interpolator for the computation of different operations, the impact of reducing the table size in the overall area increases significantly since one set of look-up tables storing C 0 , C 1 , and C 2 must be used per function.
CONCLUSION
A table-based method for high-speed function approximation in single-precision floating-point format has been presented in this paper. Our method combines the use of an enhanced minimax quadratic approximation, which allows the use of reduced size look-up tables, and a fast evaluation of the degree-2 polynomial, which allows us to obtain execution times similar to those of linear approximations. An architecture implementing our method has been proposed for the computation of reciprocal, square root, exponential, and logarithm, with an accuracy of 1 ulp.
The significand of the input operand is split into two fields, X 1 and X 2 , and an iterative algorithm which consists of 3-passes of the minimax approximation to compensate for the effect of rounding each coefficient to a finite wordlength is performed for each input subinterval. The coefficients of such quadratic approximation, C 0 , C 1 , and C 2 , are stored in lookup tables addressed by the m-bit word X 1 (m being 6 for square root and exponential, while m ¼ 7 for reciprocal and logarithm computation). In parallel with the table look-up, a specialized squaring unit with low area cost and delay performs the calculation of X accumulation tree has less area and a similar delay as a standard ð24 Â 24Þ-bit multiplier.
Estimates of the execution times and area costs for the proposed architecture have been obtained and a comparison with other quadratic interpolators, linear approximations, and bipartite tables methods has been shown, based on a technology-independent model for the area and delay of the main logic blocks employed. The conclusion to such a comparison is that our method combines the main advantage of linear approximations, the speed, and the main advantage of the second-degree approximations, the reduced size of the circuit, achieving the best trade off between area and performance.
The proposed method can be applied not only to the computation of reciprocal, square root, exponentials, and logarithms, but also for square root reciprocal, trigonometric functions, powering, or special functions, in singleprecision format with an accuracy of 1 ulp, 2 ulps, or 4 ulps. It can also be employed for fixed-point computations or to obtain high-accuracy seeds for functional iteration methods such as Newton-Raphson or Goldschmidt, allowing significant latency reductions for those algorithms. All these features make our quadratic interpolator very suitable for implementing an elementary function generator in state-ofthe-art DSPs or GPUs. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
