Abstract-This paper presents a method for the automatic generation of high-performance and low-power arithmetic operators based on polynomial approximations. It deals with the bit-level representation of the polynomial coefficients, the intermediate computations width, the approximation and the rounding errors. The generated operators are small, fast and numerically validated at design time. Some examples have been implemented on FPGAs.
INTRODUCTION
The design of high-performance and pow-power arithmetic operators is an important issue in application specific integrated circuits (ASICs), systems on chip (SoCs) and fieldprogrammable gate arrays (FPGAs) implementations. Basic operations such addition/subtraction or multiplication have always been implemented as high-performance operators in digital circuits [1] . Some recent applications require fast evaluation of more complex operations such as division, reciprocal, square-root, trigonometric functions, logarithm or exponential. Function approximation is often performed using polynomial evaluation in software as well as in hardware. For instance, elementary functions (sine, cosine, exponential, logarithm...) are often evaluated using polynomials [2] . This paper presents a method for the automatic generation of arithmetic operators based on polynomial approximations. This method has been presented in French in [3] . Here we deal its power consumption advantages. The proposed method produces polynomial approximations "well-suited" for highperformance hardware implementations. It faces with two problems: the generation of the polynomial coefficients that ensure low approximation errors and the sizing of intermediate computations that provide low round-off errors.
Notations and polynomial approximation background are presented in Section I. Some previous works are summarized in Section II. The proposed method is described in Section III and illustrated on some examples in Section IV.
I. NOTATIONS AND BACKGROUND
The target function is f with input and output in 2's complement fixed-point format. f is approximated using the degree-d polynomial p (no range reduction is considered here [2] ). The argument x is in the domain [a, b] and the result p(x) is in the range [a , b ] . Extension to other forms of input/output intervals is straightforward. The argument x is a n x -bit number and the result p(x) is a m-bit number as summarized in Figure 1 . The input argument x is considered as exact. The position of the binary point is fixed by the format of the smallest representation that include both a and b. The number of fractional bits will be computed by the method to fit the target accuracy requirement. The polynomial coefficients
The coefficients p i are represented using 2's complement or borrow-save [1] notations. Several evaluation schemes may be used to compute the value p(x) in practice. In this work we only consider the direct and Horner evaluation schemes:
Evaluation schemes differ on several aspects: computation cost, internal parallelism and accuracy. The direct scheme leads to a cost of d additions and d + log 2 d multiplications while the Horner scheme only requires d additions and d multiplications. The Horner scheme is a sequential structure while the direct scheme allows some internal parallelism. From the accuracy point of view, the Horner scheme is known to be slightly more accurate than the direct scheme.
The approximation to f using the polynomial p deals with two components: the approximation error and the round-off error. The approximation error measures the distance between the mathematical function f and the function used for the approximation, here p. The theoretical approximation error app due to the use of the polynomial p to approximate the function f on [a, b], is measured using:
In 
where P d is the set of polynomials with real coefficients and degree at most d. Minimax approximations can be computed thanks to an algorithm due to Remes [4] , see also [2] . Here we use the Maple minimax command.
The round-off error or rounding error due to the finite precision of the intermediate and final values adds up to the approximation error. This error is small for one single operation, i.e. a fraction of the weight of the least significant bit (LSB). But during a sequence of operations, these small errors may accumulate themselves and significantly degrade the accuracy of the final result. In the following, errors are expressed directly or as equivalent accuracy. The accuracy is the number of correct or significant bits. The relation between the error and the accuracy μ is μ = − log 2 | |.
The notation () 2 denotes the binary representation. We also use the borrow-save representation [1] denoted () bs (i.e. radix-2 redundant representation with digits in {−1 = 1, 0, 1}).
II. SUMMARY OF PREVIOUS WORKS
Here we summarize some previous works on arithmetic operators dedicated to function approximation in hardware. Detailed presentations may be found in [1] for computer arithmetic and in [2] for elementary functions.
A. Tables Based Methods
One of the first methods proposed to approximate functions was to tabulate the values of the target function for each possible input. This leads to exponential size tables: 2 nx words of m bits. In practice the maximum number of address bits of the table n x is in the range 8-12 depending on the technology.
Higher accuracies can be reached by combining tables and arithmetic operations. Architectures based on tables and additions/subtractions are very common for function approximation [5] . The bipartite method uses two tables and only one final addition as illustrated on Figure 2 .
The bipartite method uses a decomposition of x into 3 subwords x 1 , x 2 and x 3 of length n 1 , n 2 and n 3 respectively such that n x = n 1 + n 2 + n 3 . The bipartite table method leads to tables with a total of only 2 2nx/3 address bits compared to the 2 nx address bits of the single table solution. The bipartite table method has been extended to a number of tables larger than 2, it is called the multipartite method. It uses several tables looked-up in parallel and final addition of the all tables' contributions [5] . They allow computing commonly used functions with low accuracy (up to 24 bits) with significantly lower hardware cost than that of a straightforward 
B. Polynomial or Rational Approximations
The main drawback of table based methods is the fact that the tables are dedicated to only one specific function. Sharing tables among several functions is not possible in practice. Using polynomial or rational approximations, the basic operators (mainly adders and multipliers) required for the evaluation can be shared or reused. Function approximation in software is often performed using polynomial or rational approximations [2] . Rational approximations are not frequently used in hardware due to the high latency of the final division. Here we only deal with polynomial approximations.
Polynomial approximations are used in hardware since a long time [6] . The size of the multipliers is a major concern. Several solutions have been investigated to limit their size such as weighted sum methods [7] . The coefficients of the polynomial are distributed at the bit level. Thus the polynomial is rewritten as the sum of a huge set of weighted products of the bits of x. Some of these terms are neglected. Polynomial approximations are also combined with tables [8] .
We will see below that the determination of polynomials with coefficients exactly representable in the target format is one of the main problems in the implementation of polynomial approximations. A general but quite slow method dedicated to this problem is proposed in [9] .
C. Digit-Recurrence Algorithms
Digit-recurrence algorithms, also called shift-and-add algorithms, produce one digit of the result every iteration starting from the most significant digit (MSD) (e.g. the paper and pencil division method) [1] . The two most well known digitrecurrence algorithms are: the SRT algorithm for division, square root and other algebraic functions, and the CORDIC algorithm for elementary functions.
The SRT algorithm, [1] , leads to a small number of iterations using a high radix. It uses a redundant number system for the result digits. This allows correcting some small errors from previous iterations due to the reduced internal precision. The combination of a redundant number system and reduced precision leads to very fast iterations.
The CORDIC algorithm (for COordinate Rotations on a DIgital Computer) only uses additions and shift to approximate some elementary functions, so it is very well suited for lowarea hardware implementations. A complete description of this algorithm and its numerous variations can be found in [2] .
III. PROPOSED METHOD
The proposed method has been presented in French in [3] . It provides an answer to two practical questions about implementation of polynomial approximation. The first one is "what values should be used for the implemented coefficients?". The second one is "what is the minimal size for the intermediate computations?".
The input parameters of the method are:
• f the function to be evaluated,
• the argument format x (size n x bits),
• μ the total maximal target absolute error (the accuracy constraint). The results from the method are:
• d the degree of the polynomial,
. , p d the coefficients values (representable in the target format), • n the coefficient size,
• n the data-path size 1 .
The proposed method consists in 3 main steps and an optional step detailed below.
A. Step 1: Determination of the Initial Polynomial
The first step defines a good starting point for the method. We use a minimax polynomial as a starting point (see Section I) provided by the MAPLE minimax function. We look for the minimax polynomial p * with the smallest degree d and accurate enough to approximate f on The polynomial result p * will be modified in the next steps. The next steps will degrade the accuracy (i.e. lead to errors larger than * app ). Then some "margin" between * app and μ is necessary as seen in the next sections.
B. Step 2: Coefficients Optimization
In this step we look for the size n and the values p i of the coefficients in the target format such that app < μ where
Notice that n may be smaller than the width of the target format.
The proposed solution is based on an exploration over the rounded coefficients. Each coefficient p * i may be rounded up
There are 2 choices per coefficients, then 2 d+1 different polynomials to test as illustrated in Figure 3 . For each possible polynomial p the value app . In our applications d is small (d ≤ 6) then the total exploration time is small.
We designed a MAPLE program that tests the 2 d+1 polynomials corresponding to the different rounding modes of the d+1 coefficients p * i . The program starts with n = − log 2 |μ| , tests all the rounding modes of the d + 1 coefficients for the current n size. If there are solutions, i.e. polynomials such that app < μ, it returns the list of those polynomials for the next step (those where app is minimal). If there is no solution then n is incremented.
C. Step 3: Data-path Optimization
In this step we look for the data-path size n . Only two polynomial evaluation schemes are supported in the method: the direct and the Horner schemes. The idea is very simple. We start with n = n and we check that the data-path of size n fulfills the accuracy constraint μ using GAPPA. If the total error bound computed by GAPPA is less than μ then this step is finished. If not, the size n is incremented and the new data-path should be tested using GAPPA.
The GAPPA software, presented in [10] , allows to evaluate and to produce a proof of mathematical properties on numerical codes. The main useful characteristic of GAPPA used in this work is its capability to tightly bound round-off errors and prove these bounds are below some threshold. GAPPA can generate a file for the Coq proof assistant [11] .
The difference between n and n is called the number of guard bits. Using a coefficient size n smaller than n allows a reduction in the memory size required to store the coefficients without degradation on the final accuracy.
D. Step 4 (optional): Post-Optimizations
One kind of frequent post-optimization is to simplify the hardware when some coefficients are very close to power of 2. As an example, if one coefficient from step 2 is the value 0.5002441406 = (0.100000000001) 2 and the target accuracy μ is about 12 fractional bits, this coefficient may be rounded to 0.5. In that case, the multiplication by 0.5002441406 is replaced by a simple right shift, see Section IV-B for a concrete example of this kind of post-optimization.
E. Loops
From a given step, it may be necessary to move back to a previous step. For instance, there may be no result from a given step or its result may not be considered "good enough".
As an example, the second step may not produce representable coefficients that ensure app < μ. This case is due to insufficient margin between * app and μ in the first step. In that case, one should move back to step 1 and try with a higher degree polynomial (i.e. d ← d + 1).
Another loop occurs after step 3 when n is considered too huge in front of n. Then it may be more efficient to move back to step 2 and try a larger n. By doing this, app will be smaller which leads to more margin for round-off errors.
IV. EXAMPLES
The examples below have been implemented on Xilinx FPGAs XCV200-5 using ISE8.1i tools. Synthesis and place/route have been optimized for an area target with high effort. The reported results include all the resources required for the implementation (logic cells and registers). The power consumption is estimated using the XPower software and solutions are compared on the same set of random input patterns.
A. Radix-2 Exponential over [0, 1]
Here f (x) = 2 x and x ∈ [0, 1]. The target accuracy is 12 bits, i.e. μ = 2 −12 . We report the result from the step 1 for the minimax polynomial for 1 ≤ d ≤ 5. The corresponding accuracy is reported below in number of correct bits. The values obtained for the other polynomial (p 2 = 919 4096 ) are similar. Two solutions have been implemented. The first one corresponds to the standard solution (degree-4 polynomial, n = n = 18). The second one is the result from the generation method (d = 3, n − 1 = 14, Horner evaluation scheme and n = 16). The implementation results are reported in Table I . The generation method leads to 17% smaller circuit and the degree-3 approximation saves 38% of the computation time. 
B. Square Root
Here f (x) = √ x, x ∈ [1, 2] and a target accuracy of 8 correct bits is fixed (μ = 2 −8 ). The first step leads to d = 2 and * app = 0.0007638369 which corresponds to 10.35 correct bits (for d = 1 the accuracy is only 6.81 correct bits).
