Abstract-Polynomial approximation is a very general technique for the evaluation of a wide class of numerical functions of one variable. This article details an architecture generator that inputs the specification of a function and outputs a synthesizable description of an architecture evaluating this function with guaranteed accuracy. It improves upon the literature in two aspects. Firstly, it uses better polynomials, thanks to recent advances related to constrained-coefficient polynomial approximation. Secondly, it refines the error analysis of polynomial evaluation to reduce the size of the multipliers used. An open-source implementation is provided in the FloPoCo project, including architecture exploration heuristics designed to use efficiently the embedded memories and multipliers of high-end FPGAs. High-performance pipelined architectures for precisions up to 64 bits can be obtained in seconds.
I. INTRODUCTION AND MOTIVATION
In this article, we consider real functions f (x) of one real variable x, and we are interested in a fixed-point implementation of this function over some interval. We assume that f is continuously differentiable over this interval up to a certain order. The literature provides many examples of such functions for which a hardware implementation is required.
• Fixed-point sine, cosine, exponential and logarithms are routinely used in signal processing algorithms.
• Random number generators with a Gaussian distribution may be built using the Box-Muller method, which requires logarithm, square root, sine and cosine [1] . Arbitrary distributions may be obtained by the inversion method, in which case one needs a fixed-point evaluator for the inverse cumulative distribution function (ICDF) of the required distribution [2] . There are as many ICDF as there are statistical distributions.
• Approximations of the inverse 1/x and inverse square root 1/ √ x functions are used in recent floating-point units to bootstrap division and square root computation [3] .
• f log (x) = log(x + 1/2)/(x − 1/2) over [0, 1] , and f exp (x) = e x − 1 − x over [0, 2 −k ] for some small k, are used to build hardware floating-point logarithm and exponential in [4] .
• f cos (x) = 1 − cos • s 2 (x) = log 2 (1 + 2 x ) and d 2 (x) = log 2 (1 + 2 x ) are used to build adders and subtracters in the Logarithm Number System (LNS), and many more functions are needed for Complex LNS [6] . Many function-specific algorithms exist, for example variations on the CORDIC algorithm provide low-area, longlatency evaluation of most elementary functions [7] . Our purpose here is to provide a generic method, that is a method that works for a very large class of functions. The main motivation of this work is to facilitate the implementation of a full hardware mathematical library (libm) in FloPoCo, a core generator for high-performance computing on FPGAs 1 . We present a complete implementation in this context, however, most of the methodology is independent of the FPGA target and could apply to other hardware targets such as ASIC circuits.
A. Related work and contributions
Articles describing specific polynomial evaluators are too numerous to be mentioned here, and we just review works that describe generic methods.
Several table-based, multiplier-less methods for linear (or degree-1) approximation have evolved from the original paper by Sunderland et al [8] . See [9] or [7] for a review. These methods have very low latency but do not scale well beyond 20 bits: the table sizes scale exponentially, and so does the design-space exploration time.
The High-Order Table-Based Method (HOTBM) by Detrey and one of us [10] extended the previous methods to higher-degree polynomial approximation. An open-source implementation is available in FloPoCo. However it is not suited to recent FPGAs with powerful DSP blocks and large embedded memories. In addition, it doesn't scale beyond 32 bits.
Lee et al [11] have published many variations on a generic datapath optimization tool called MiniBit to optimize polynomial approximation. They use ad-hoc mixes of analytical techniques such as interval analysis, and heuristics such as simulated annealing to explore the design space. However, the design space explored in these articles does not include the architectures we describe in the present paper: All the multipliers in these papers are larger than strictly needed, therefore they miss the optimal. In addition, this tool is closed-source and difficult to evaluate from the publications, in particular it is unclear if it scales beyond 32 bits.
Tisserand studied the optimization of low-precision (less than 10 bits) polynomial evaluators [12] . He finetunes a rounded minimax approximation using an exhaustive exploration of neighboring polynomials. He also uses other tricks on smaller (5-bit or less) coefficients to replace the multiplication by such a coefficient by very few additions. Such tricks do not scale to larger precisions.
Compared to these publications, the present work has the following distinctive features.
• This approach scales to precisions of 64 bits or more, while being equivalent or better than the previous approaches for smaller precisions.
• We use for polynomial approximation minimax polynomials provided by the Sollya tool 2 , which is the state-of-the-art for this application, as detailed in Section II-B.
• We attempt to use the smallest possible multipliers. As others, we attempt to minimize the coefficient sizes. In addition, we also truncate, at each computation step, the input argument to the bare minimum of bits that are needed at this step. Besides, we will use truncated multipliers [19] , [14] • The resulting architecture evaluates the function with last-bit accuracy. It may be automatically pipelined to a user-specified frequency thanks FloPoCo's pipelining framework [13] .
B. Relevant features of recent FPGAs
Here are some of the features of recent FPGAs that can be used in polynomial evaluators.
• Embedded multipliers features are summed up in Table. I It is possible to build larger multipliers by assembling these embedded multipliers. The DSP blocks include specific adders and shifters designed for this purpose [14] . A given FPGA typically contains a comparable number of memory blocks and multipliers. It therefore makes sense to try and balance the consumption of these two resources. However, the availability of these resources also depends on the wider context of the application, and it is even better to expose a range of trade-offs between them.
II. FUNCTION EVALUATION BY POLYNOMIAL

APPROXIMATION
Polynomial approximation is the generic mathematical tool that reduces the evaluation of a function to additions and multiplications. For these operations, we can either build architectures (in FPGAs or ASICs), or use built-in operators (in processors or DSP-enabled FPGAs). A good primer on polynomial approximation for function evaluation is Muller's book [7] .
Building a polynomial evaluator for a function may be decomposed into two subproblems: 1/ approximation: finding a good approximation polynomial, and 2/ evaluation: evaluating it using adders and multipliers. The smaller the input argument, the better these two steps will behave, therefore a range reduction may be applied first if the input interval is large.
We now discuss each of these steps in more detail, to build the implementation flow depicted on Figure 1 . In this paper we will consider, without loss of generality, a function f over the input interval x ∈ [0, 1).
In our implementation, the user inputs a function (assumed on [0, 1), the input and output precisions (both expressed as LSB weight), and the degree d of the polynomials used. This last parameter could be determined heuristically, but we leave it as a means for the user to trade-off multipliers and latency for memory size.
A. Range reduction
In this work, we use the simple range reduction that consists in splitting the input interval in 2 k sub-intervals, indexed by i ∈ {0, 1, ..., 2 k − 1}. The index i may be obtained as the leading bits of the binary representation of the input:
This decomposition comes at no hardware cost. We now have ∀i ∈ {0, . . . ,
, and we may approximate each f i by a polynomial p i . A table will hold the coefficients of all these polynomials, and the evaluation of each polynomial will share the same hardware (adders and multipliers), which therefore have to be built to accommodate the worst-case among these polynomial. Figure 3 describes the resulting architecture.
Compared to a single polynomial on the interval, this range reduction increases the storage space required, but decreases the cost of the evaluation hardware for two reasons. First, for a given target accuracy ε total , the degree of each of the p i decreases with k. There is a strong threshold effect here, and for a given degree there is a minimal k that allows to achieve the accuracy. Second, the reduced argument y has k bits less than the input argument x, which will reduce the input size of the corresponding multipliers. If we target an FPGA with DSP blocks, there will also be a threshold effect here on the number of DSP blocks used. Many other range reductions are possible, most related to a given function or class of functions, like the logarithmic segmentation used in [2] . For an overview, see Muller [7] . Most of our contributions are independent of the range reduction used.
B. Polynomial approximation
One may use the well-known Taylor or Chebyshev approximation polynomials of arbitrary degree d [7] . These polynomials can be obtained analytically, or using computer algebra systems. A third method of polynomial approximation is Remez' algorithm, a numerical process that, under some conditions, converges to the minimax approximation: the polynomial of degree d that minimizes the maximal difference between the polynomial and the function. We denote ε approx the approximation error, defined as the maximum absolute difference between the polynomial and the function.
Between approximation and evaluation, for an efficient machine implementation, one has to round the coefficients of the minimax polynomial (which has real numbers in theory, and are computed with large precision in practice) to smaller-precision numbers suitable for efficient evaluation. On a processor, one will typically try to round to singleor double-precision numbers. On an FPGA, we may build adders and multipliers of arbitrary size, so we have one more question to answer: what is the optimal size of these coefficients? In [11] , this question is answered by an error analysis that considers separately the error of rounding each coefficient of the minimax polynomial (considered as a realcoefficient one) and tries to minimize the bit-width of the rounded coefficients while remaining within acceptable error bounds.
However, there is no guarantee that the polynomial obtained by rounding the coefficients of the real minimax polynomial is the minimax among the polynomials with coefficients constrained to these bit-width. Indeed, this assumption is generally wrong. One may obtain much more accurate polynomials for the same coefficient bit-width using a modified Remez algorithm due to Brisebarre and Chevillard [15] and implemented as the fpminimax command of the Sollya tool. This command inputs a function, an interval and a list of constraints on the coefficient (e.g. constraints on bitwidths), and returns a polynomial that is very close to the best minimax approximation polynomial among those with such constrained coefficients.
Since the approximation polynomial now has constrained coefficients, we will not round these coefficients anymore. In other words, we have merged the approximation error and the coefficient truncation error of [11] into a single error, which we still denote ε approx . The only remaining rounding or truncation errors to consider are those that happen during the evaluation of the polynomial.
Let us now provide a good heuristic for determining the coefficient constraints. of the least significant bit (LSB) of each coefficient. To reach some target precision 2 −p , we need the LSB of a 0 to be of weight at most 2 −p . This provides the constraint on a 0 . Now consider the developed form of the polynomial, as illustrated by Figure 2 . As coefficient a j is multiplied by y j which is smaller than 2 −kj , the accuracy of the monomial a j y j will be aligned on that of the monomial a 0 if its LSB is of weight 2 −p+kj . This provides a constraint on a j . The heuristic used is therefore the following. Remember that the degree d is provided by the user. The constraints on the d + 1 coefficients are set as just explained. For increasing k, we try to find 2 k approximation polynomials p i of degree d respecting the constraints, and fulfilling the target approximation error (which will be defined in Section II-D). We stop at the first k that succeeds. Then, the 2 k polynomials are scanned, and the maximum magnitude of all the coefficients of degree j provides the most significant bit that must be tabulated, hence the memory consumed by this coefficient.
C. Polynomial evaluation
Given a polynomial, there are many possible ways to evaluate it. The HOTBM method [10] uses the developed form p(y) = a 0 + a 1 y + a 2 y 2 + ... + a d y d and attempts to tabulate as much of the computation as possible. This leads to short-latency architecture since each of the a i y i may be evaluated in parallel and added thanks to an adder tree, but at a high hardware cost.
In this article, we chose a more classical Horner evaluation scheme, which minimizes the number of operations, at the expense of the latency: p(y) = a 0 + y × (a 1 + y × (a 2 + .... + y × a d )...). Our contribution is essentially a fine error analysis that allows us to minimize the size of each of the operations. It is presented below in II-D.
There are intermediate schemes that could be explored. For large degrees, the polynomial may be decomposed into an odd and an even part: p(y) = p e (y 2 ) + y × p o (y 2 ). The two sub-polynomial may be evaluated in parallel, so this scheme has a shorter latency than Horner, at the expense of the precomputation of x 2 and a slightly degraded accuracy. Many variations on this idea, e.g. the Estrin scheme, exist [7] , and this should be the subject of future work. A polynomial may also be refactored to trade multiplications for more additions [16] , but this idea is mostly incompatible with range reduction.
D. Accuracy and error analysis
The maximal error target ε total is an input to the algorithm. Typically, we aim at faithful rounding, which means that ε total must be smaller than the weight of the LSB of the result, noted u. In other words, all the bits returned hold useful information. This error is decomposed as follows: ε total = ε approx + ε eval + ε finalround where • ε approx is the approximation error, the maximum absolute difference between any of the p i and the corresponding f i over their respective intervals. This computation belongs to the approximation step and is also performed using Sollya [17] .
• ε eval is the total of all rounding errors during the evaluation; • ε finalround is the error corresponding to the final rounding of the evaluated polynomial to the target format. It is bounded by u/2. We therefore need to ensure ε approx + ε eval < u/2. The polynomial approximation algorithm iterates until ε approx < u/4, then reports ε approx . The error budget that remains for the evaluation is therefore ε eval < u/2 − ε approx and is between u/4 and u/2.
In p(y) = a 0 + a 1 y + a 2 y 2 + ... + a d y d , the input y is considered exact, so p(y) is the value of the polynomial if evaluated in infinite precision. What the architecture evaluates is p (y), and our purpose here is to compute a bound on ε eval (y) = p (y) − p(y).
Let us decompose the Horner evaluation of p as a recurrence:
This would compute the exact value of the polynomial, but at each evaluation step, we may perform two truncations, one on y, and one on π j . As a rule of thumb, each step should balance the effect of these two truncations on the final error. For instance, in an addition, if one of the addends is much more accurate than the other one, it probably means that it was computed too accurately, wasting resources.
To understand what is going on, consider step j. In the addition σ j = a d−j +π j , the π j should be at least as accurate as a d−j , but not much more accurate: let us keep g defines the truncation of π j , and also the size of σ j (which also depends on the weight of the MSB of a d−j ). Now since we are going to truncate π j = y × σ j−1 , there is no need to input to this computation a fully accurate y. Instead, y should be truncated to the size of the truncated π j , plus a small number g y j of guard bits. The computation actually performed is therefore the following:
In both previous equations, the additions and multiplications should be viewed as exact: the truncations are explicited by the tilded variables, e.g.π j is the truncation of π j to g π j bits beyond the LSB of a d−j . There is no need to truncate the result of the addition, as the truncation of π j serves this purpose already.
We may now compute the rounding error:
Here we have a sum of two errors. The first,π j − π j , is the truncation error on π and is bounded by a power of two depending on the parameter g π j . The second is computed as
Again, we have two error terms which we may bound separately. The first bound is the truncation error on y, which depends on the parameter g y j , and is multiplied by a bound on σ j−1 which has to be computed recursively itself. The second term recursively uses the computation of σ j − σ j , and the bound y < 2 −k . The previous error computation is implemented in C++.
From the values of the parameters g π j and g y j , it decides if the architecture defined by these parameters is accurate enough.
E. Parameter space exploration for the FPGA target
The last problem to solve is to find values of these parameters that minimize the cost of an implementation. This optimization problem is very dependent on the target technology, and we now present an exploration heuristic that is specific to DSP-enabled FPGAs: our objective will be to minimize the number of DSP blocks.
Let us first consider the g y j parameter. The size of this truncation directly influences the DSP count. Here, we observe that once a DSP block is used, it saves us almost nothing to under-use it. We therefore favor truncations which reduce the size of y to the smallest multiple of a multiplier input size that allows us to reach the target accuracy. For Virtex4 and StratixII, the size of y should target a multiple of 17 and 18 respectively. On Virtex5 and Virtex6, multiples of 17 or 24 should be investigated. Consequently, each g y j can take a maximum of three possible values: 0, corresponding to no truncation, and one or two soft spots corresponding to multiples of multiplier input size.
The determination of the possible values of g π j also depends on the DSP multiplier size, as the truncation of π j defines the size of the sum σ j , which is input to a multiplier. There are two considerations to be made: First, it makes no sense to keep guard bits to the right of the LSB ofπ j . This gives us an upper bound on g π j . Secondly, as we are trying to reduce DSP count, we should not allow a number of guard bits that increases the size of σ j over a multiple of the multiplier input size. This gives us a second upper bound on g π j . The real upper-bound in computed as a minimum of the two precomputed upper-bounds.
These upper bounds define the parameter space to explore. We also observe that the size of the multiplications increases with j in our Horner evaluation scheme. We therefore favor truncations in the last Horner steps, as these truncations can save more DSP blocks. This defines the order of exploration of the parameter space. The parameters g π j and g y j are explored using the above rules until the error ε eval satisfies the bound ε eval < u/2 − ε approx . This is a fairly small parameter space exploration, and its execution time is negligible with respect to the few seconds it may take to compute all the constrained minimax approximations. Table II presents the input and output parameters for obtaining the approximation polynomials for several representative functions mentioned in the introduction. The functions f are all considered over [0, 1], with identical input and output precision. Three precisions are given in Table 1 . Table 2 provides synthesis results for the same experiments.
III. EXAMPLES AND COMPARISONS
It is difficult to compare to previous works, especially as none of them reaches the large precisions we attain. Our approach brings no savings in terms of DSP blocks for precisions below 17 bits. We may compare to the logarithm unit in [1] which computes log(1 + x) on 27 bits using a degree-2 approximation. Our tool instantly finds the same coefficient sizes of 30, 22 and 13, and our implementation uses 5 DSP blocks where [1] uses 6: one multiplier is saved thanks to the truncation of y. For larger precisions, the savings would also be larger.
We should compare the polynomial approach to the CORDIC family of algorithm which can be used for many elementary functions [7] , [18] . Table IV compares implementations for 32-bit sine and cosine, using for CORDIC the implementation from Xilinx LogiCore [18] . This table illustrates that these two approaches address different ends of the implementation spectrum. The polynomial approach provides smaller latency, higher frequency and low logic consumption (hence predictability in performance independently of routing pressure). The CORDIC approach consumes no DSP nor memory block. Variations on CORDIC using higher radices could improve frequency and reduce latency, but at the expense of an even higher logic cost. A deeper comparison remains to be done.
IV. CONCLUSION, OPEN ISSUES AND FUTURE WORK
Application-specific systems sometimes need applicationspecific operators, and this includes operators for function evaluation. This work has presented a fully automatic design tool that allows one to quickly obtain architectures for the evaluation of a polynomial approximation with a uniform range reduction for large precisions, up to 64 bits. The resulting architectures are better optimized than what the literature offers, firstly thanks to state-of-the-art polynomial approximation tools, and secondly thanks to a finer error analysis that allows for truncating the reduced argument. They may be fully pipelined to a frequency close to the nominal frequency of current FPGAs. This work will enable the design, in the near future, of elementary function libraries for reconfigurable computing that scale to double precision. However, we also wish to offer to the designer a tool that goes beyond a library: a generator that produces carefully optimized hardware for his very function. Such application-specific hardware may be more efficient than the composition of library components.
Towards this goal, this work can be extended in several directions.
• There is one simple way to further reduce the multiplier cost, by the careful use of truncated multipliers [19] , [14] . Technically, this only changes the bound on the multiplier truncation error in the error analysis of II-D. This improvement should be implemented soon.
• Another way, for large multiplications, is the use of the Karatsuba technique, which is also implemented in FloPoCo [20] . It is even compatible with the previous one.
• Non-uniform range reduction schemes should be explored. The power-of-two segmentation of the input interval used in [2] has a fairly simple hardware implementation using a leading zero or one counter. This will enable more efficient implementation of some functions.
• More parallel versions of the Horner scheme should be explored to reduce the latency.
• Parameter space exploration is tuned for minimizing DSP usage, it should also be tuned to make the best possible usage of available configurations of embedded memory blocks.
• Our tools could attempt to detect if the function is odd or even [21] , and consider only odd or even polynomials for such case [7] , [21] . Whether this works along with range reduction remains to be explored.
• We currently only consider a constant target error corresponding to faithful rounding, but a target error function could also be input.
• Designing a pleasant and universal interface for such a tool is a surprisingly difficult task. Currently, we require the user to input a function on [0, 1), and the input and output LSB weight. Most functions can be trivially scaled to fit in this framework, but many other specific situations exist.
