Abstract-We present a new micro-architecture for evaluating functions based on piecewise quadratic interpolation. The micro-architecture consists mainly of a look-up table and two multiply-accumulate units. Previous micro-architectures based on piecewise quadratic interpolation have been shown to be efficient for small precision (e.g., single precision) computations. Moreover, they are as fast as piecewise linear interpolation while requiring smaller tables. Our main contribution is in circumventing the need for the additional squaring unit that appears in previous micro-architectures.
I. INTRODUCTION
Many applications, such as computer graphics and DSP require single-precision approximation of basic functions such as 1/x, √ x, 1/ √ x, 2 x , log 2 x, and trigonometric functions. According to several recent papers [1] , [2] , [3] , [4] , [5] , [6] , approximation by quadratic interpolation can be computed within a single clock cycle 1 . Such designs require small lookup tables and a small amount of logic.
A block diagram of a typical piecewise quadratic interpolator is depicted in Fig. 1 . The input X is partitioned into two parts; the upper bits are denoted by X 1 and the lower bits are denoted by X 2 . The coefficients a 0 , a 1 , a 2 of the polynomial p(X 2 ) = a 0 + a 1 X 2 + a 2 X 2 2 are read from a look-up table indexed by the upper bits X 1 . The circuit evaluates an approximation of p(X 2 ) . Note that the range of inputs is partitioned into 2 m subintervals where m denotes the length of X 1 . Within each subinterval, the function is approximated by a quadratic polynomial, hence the term piecewise quadratic interpolation.
The evaluation of the quadratic polynomial takes place by one squaring (Z ← X 2 2 ), and two multiplications (a 2 Z and a 1 X 2 ). The latency of the squaring unit is smaller than that of the lookup and hence, roughly speaking, the latency is the sum of the delay of the look-up table and the delay of a multiplier. In addition, symmetry and truncation are employed to reduce the area of the squaring unit. In Walters and Schulte [6] , the issue of truncating the multipliers is explored with the aim of reducing the amount of logic required for squaring and multiplications. In Piñeiro et al. [5] a systematic method is presented for computing the coefficients of the quadratic polynomial. In addition, in [5] a unified micro-architecture is presented for evaluating multiple single-precision functions.
In this paper we suggest to evaluate the quadratic polynomial using Horner's method that requires only two multiplications. Namely, p(X 2 ) = a 0 + X 2 · (a 1 + a 2 · X 2 ). A block diagram of the suggested micro-architecture is depicted in Fig. 2 . The aim of the proposed micro-architecture is to reduce the amount of logic so as to reduce area and power consumption. The design has a higher latency and is suited for situations where the clock period allows for the extra latency.
For the purpose of simplicity we demonstrate the proposed micro-architecture with a detailed description for single precision reciprocal approximation (1/x). The argument X is in the range [1, 2) , and is represented by 23 bits to the right of the binary point. The output is also represented in the same way. This implies that unless X = 1, we actually compute 2/X, i.e., the reciprocal is normalized to the range [1, 2) . Hence, the maximum allowed error is 2 −24 .
II. MICRO-ARCHITECTURE
A detailed diagram of the proposed micro-architecture for single precision reciprocal approximation is depicted in Fig. 3 . The lengths of the coefficients are 10, 18, and 26 bits. Note that the truncation boxes are trivial (i.e., zero cost and zero delay); they are depicted to emphasize the fact that bits in lower positions are discarded. The outputp [1 : 24] satisfies |1/x −p[1 : 24]| < 2 −24 . The extra precision of one bit is required due to the normalization shift that guarantees that 2/x ∈ (1, 2).
III. DESIGN
In this section we deal with three issues. First, we bound the error introduced by the truncated multipliers. The truncation positions of both multipliers are determined by this error analysis. Second, we present the method used for computing the coefficients that are stored in the look-up tables. This method is based on [5] . Finally, the bias correction is computed by exhaustive search. The bias correction consists of MAC-1 two parts: the most significant part is added to the coefficient a 0 , and the least significant part (4 bits) is a constant that is hardwired to the addition tree of the MAC-1 unit. We use the following notation for truncation and roundup. Let ⌊z⌋ k and ⌈z⌉ k denote the round-down and round-up of a number z to the closest multiple of 2 −k .
A. Truncation: analytic bounds
In this section we describe simple bounds on the errors introduced by truncations. These bounds are subtracted from the required maximum error 2 −24 to determine the allowed error caused by the quadratic polynomial approximation. The bounds presented here are pretty tight; this means that at best one could hope to further truncate the multipliers by at most one position.
Consider a truncated multiplier whose multiplicands are
If the multiplier is truncated in positions to the right of position t, then it computes the truncated product P t = ∑
The error introduced by truncation is bounded in the following claim.
Claim 1:
To save hardware we evenly divide the error between the two MACs and the quadratic polynomial approximation with bounded length coefficients. Namely, the min-max error of each MAC and the quadratic polynomial is bounded by roughly 2 −24 /3.
The lengths of the coefficients for single precision reciprocal approximation reported in [5] 
Numerical evaluation implies that the multiplier in MAC-2 can be safely truncated in positions to the right of t = 21. We preferred to reduce the length of the input to MAC-1 to 20 bits, and hence had to move the truncation position in MAC-2 to the right by one position, i.e. 
Numerical evaluation implies that the multiplier in MAC-1 can be safely truncated in positions to the right of t = 30.
B. Initial computation
The three step process for computing the truncated coefficients is listed as Algorithm 1. The parameters of the algorithm are the coefficient lengths n 0 , n 1 , n 2 and X 1 (which determines the subinterval). In each phase, the Remez algorithm for min-max approximation is applied [7] . In the first phase (Line 1), a quadratic polynomial q(x) = q 0 + q 1 x+ q 2 x 2 that approximates 1/(X 1 + x) is computed. In Line 2 the coefficient q 1 is truncated in position n 1 to obtain a 1 (X 1 ). The second phase proceeds as follows. Having fixed a 1 (X 1 ), we substitute z = X 2 2 to obtain the target function
2 is computed by exhaustive simulation. This error should be at most roughly 2 −24 /3.
Algorithm 1
Compute coefficients a 0 , a 1 , a 2 with lengths n 0 , n 1 , n 2 with respect to X 1 .
C. Bias Correction -Computation
The use of truncated multipliers and truncation of the outputs of the MAC units introduces errors that can be partly corrected by fine tuning the coefficient a 0 . This fine tuning reduces the error by more than 2 −25 .
Let p(x) denote the quadratic polynomial p(X 2 ) = a 0 + a 1 X 2 + a 2 X 2 2 based on the truncated coefficients (a 0 , a 1 , a 2 ) that were computed for the prefix X 1 of X. Letp(X) denote the number computed by the circuit. Letp 30 (X) denote the number computed by the 2 : 1-adder before the truncation to 24 bits (see Fig. 3 ). Note thatp(X) does not equal p(X) due to the truncated multipliers and the truncation of a 1 + a 2 · X 2 .
The error of the final 24-bit result is bounded by 2 −24 if 1
(Recall that the precision of the MAC-1 unit is 30 bits to the right of the binary point).
For every value of X, define min-bias(X) and max-bias(X) as follows:
min-bias(X)
Hence, if bias(X) ∈ [min-bias(X), max-bias(X)], then setting a 0 (X) ← a 0 (X) + bias(X) guarantees thatp 30 (X) satisfies Eq. 1. Since we wish to add the same bias to all numbers in the same subinterval, we need to check, for each X 1 , that the intersection X 2 [min-bias(X 1 + X 2 ), max-bias(X 1 + X 2 )] is not empty. If this is the case, then we add a bias to each subinterval that is determined by X 1 .
Another problem that we need to address is that one might need more than 26 bits to represent the bias; otherwise, we would need to increase the number of bits stored in the table for a 0 + bias. To avoid the need to increase the table, we partition the bias into two parts bias [1 : 26] and bias[27 : 30]. Our goal is to find a common 4-bit suffix bias [27 : 30] for all values of X. We find such a fixed suffix by exhaustive search, if one exists, and include it in the row of a 0 in the adder tree of MAC-1. A similar technique is briefly outlined in [5] .
IV. EVALUATION AND COMPARISON

A. Table size
The length of the coefficients that we able to obtain are 10, 18 and 26 bits. This is two bits more than the lengths reported in [5] . There are tradeoffs between coefficient lengths and truncation positions. In light of the huge gap between the area of a full-adder and the area of a bit stored in the table (i.e, 35 full-adders per kilo-bit), we preferred less logic over smaller tables.
B. Number of partial products
In Table I we compare the number of partial products computed in our design with the number of partial products generated in previous micro-architectures [6] , [5] . The previous designs compute the square Z ← X 2 2 and then perform two multiplications a 1 · X 2 and a 2 · Z. The products are then added with a 0 to produce the reciprocal approximation. The cost of the squaring unit is relatively small thanks to the symmetry and truncation [6] . We could not reconstruct exactly the number of partial products reported in [6] ; our numbers are smaller probably due to additions that take place in their design which we did not include.
C. Delay analysis
In this section we overview a delay optimized implementation of the proposed micro-architecture and analyze it delay. The MACs are based on truncated Booth radix 4 multipliers [8] to save both cost, power, and delay.
WS05 [6] POMB05 [5] here We apply the delay model used in [5] to a delay optimized version of the proposed micro-architecture. According to this model, the basic unit of delay is denoted by τ and it corresponds to the delay of a full-adder. The microarchitecture is optimized as follows:
1) The MAC-2 unit that computes (a 1 + a 2 · X 2 ) is designed as follows. In parallel to the table look-up, the multiplier X 2 is Booth radix 4 recoded. This reduces the number of rows in the addition tree to 1 + ⌈ 16+1 2 ⌉ = 10 (the additional row is needed to add a 1 ). The 10 rows are reduced to a borrow-save number by a sequence of 4 : 2-adder, 3 : 2-adder, and 4 : 2-adder. Thus the delay associated with this computation is max{t table ,t recode }(3.5τ) + t pp−gen (1τ) + 2 · t 4:2 (3τ) + t 3:2 (1τ) = 8.5τ.
2) The term (a 1 + a 2 · X 2 ) is output in borrow-save representation. We recode it to a Booth-4 representation using the equivalent of 3 half-adders [9] . After recoding, truncate the Booth-4 number and consider only the 11 most significant Booth-4 digits. We now compute a 0 + X 2 · (a 1 + a 2 · X 2 ). Together with a 0 , the addition tree in the multiply-accumulate unit has 12 rows. These rows are reduced to two rows by one 3 : 2-adder and two 4 : 2-adders. Thus the delay associated with this computation is t recode (1.5τ) + t pp−gen (1τ) + 2 · t 4:2 (3τ) + t 3:2 (1τ) = 6.5τ. We need to take into account the final 2 : 1-addition (whose delay is 3τ) and the register setup time (1τ). We conclude that the total delay of the design is 19τ. If a very short clock cycle is required, then our design can be easily pipelined into two stages, each with a delay of 10τ. The delay of 19τ is longer than the delay of 14.5τ reported in [5] . However, the design in [5] is not amenable to pipelining since all the partial products are generated simultaneously and added in a combined addition tree. No delay analysis is provided in [6] ; in fact, it is not clear how the intermediate products in [6] are rounded.
V. CONCLUSIONS
We presented a micro-architecture for single precision approximations of functions. An implementation for approximating 1/x was studied in detail. We showed that the micro-architecture saves in the number of partial products. A reduction of at least 20% in the number partial products is achieved compared to previous designs. According to [5] , the logic costs roughly 60% of such a design 2 . The increase in the table size is less than 5%. On the other hand, the latency of the our design is 19τ compared to 14.5τ in [5] . However, the design is well suited to pipelining into two stages with a clock period of 10τ.
We anticipate that this design can be useful in situations where the clock period is either at least 19τ (without pipelining) or 10τ (with pipelining). In such cases, the reduced amount of hardware will lead to smaller area and smaller power consumption. 
