This paper presents an architecture and a synthesis method for programmable numerical function generators (NFGs) for trigonometric, logarithmic, square root, and reciprocal functions. Our NFG partitions a given domain of the function into non-uniform segments using an LUT cascade, and approximates the given function by a quadratic polynomial for each segment. Thus, we can implement fast and compact NFGs for a wide range of functions. Implementation results on an FPGA show that: 1) our NFGs require only 4% of the memory needed by NFGs based on the linear approximation with non-uniform segmentation; and 2) our NFGs require only 22% of the memory needed by NFGs based on the 5th-order approximation with uniform segmentation. Our automatic synthesis system generates such compact NFGs quickly.
I. INTRODUCTION
Numerical function generators (NFGs) are often used in computer graphics, digital signal processing, communication systems, robotics, astrophysics, fluid physics, etc. The functions realized include trigonometric, logarithmic, square root, and reciprocal functions. High-performance CPUs usually have numerical coprocessors. However, embedded CPUs and CPUs on FPGAs do not have such coprocessors. Thus, FPGA implementation of numerical functions f (x) is needed. Implementation by a single lookup table for f (x) is simple and fast. For low-precision computations of f (x) (e.g. x and f (x) have 8 bits), this implementation is straightforward. For highprecision computations, however, the single lookup table implementation is impractical due to the huge table size. For such applications, the CORDIC (COordinate Rotation DIgital Computer) algorithm [1, 21] has been often used. Although CORDIC is implemented with compact hardware, it is iterative and therefore slow. For numerically intensive applications, faster evaluation of numerical function is required.
For fast evaluation of numerical functions, polynomial approximations have been used [9, 10, 19, 20] . These methods approximate the given numerical functions by piecewise polynomials, and realize the polynomials with hardware. Linear or quadratic approximations offer fast and relatively highprecision evaluation of numerical functions. However, the methods proposed so far are ad-hoc and not systematic. This paper proposes an architecture and a systematic synthesis method for NFGs based on quadratic approximation. By using the LUT cascade [8] , many numerical functions are efficiently approximated by piecewise quadratic functions. Our synthesis method can be automated, so that fast and compact NFGs can be produced by non-experts. Fig. 1 shows the synthesis flow for the NFG. It converts the Design Specification described by Scilab [18] , a MATLAB-like software, into HDL code. The Design Specification consists of a function f (x), a domain for x, and an acceptable error. This system first partitions the domain into segments, and then approximates f (x) by a quadratic function for each segment. Next, it analyzes the errors, and derives the necessary precision for computing units in the NFG. Then, it generates HDL code to be mapped into an FPGA using an FPGA vendor tool. Due to the page limitation, the error analysis for our NFGs is omitted here, but it is available in [14] . This paper extends [17] to quadratic approximations.
II. PRELIMINARIES

Definition 2.1
The binary fixed-point representation of a value r has the form
where d i ∈ {0, 1}, n int is the number of bits for the integer part, and n f rac is the number of bits for the fractional part of r. The representation in (1) is two's complement, and so
Definition 2.2
Error is the absolute difference between the original value and the approximated value. Approximation error is the error caused by a function approximation, and rounding error is the error caused by a binary fixed-point representation. Acceptable error is the maximum error that an NFG may assume. Acceptable approximation error (AAE) is the maximum approximation error that a function approximation may assume.
Definition 2.3
Precision is the total number of bits for a binary fixed-point representation. Specially, n-bit precision specifies that n bits are used to represent the number; that is, n = n int + n f rac. An n-bit precision NFG has an n-bit input.
Definition 2.4
Accuracy is the number of bits in the fractional part of a binary fixed-point representation. Specially, m-bit accuracy specifies that m bits are used to represent the fractional part of the number; that is, m = n f rac. An m-bit accuracy NFG is an NFG with m-bit fractional part of the input, m-bit fractional part of the output, and a 2 −m acceptable error.
III. QUADRATIC APPROXIMATION ALGORITHM
To approximate the numerical function f (x) using quadratic functions, first, we partition the domain for x into segments. For each segment, we approximate f (x) using a quadratic function g(x) = c 2 x 2 + c 1 x + c 0 . In this case, the approximation error depends on the segmentation method and the values of coefficients c 2 , c 1 , and c 0 in the approximation polynomial.
For piecewise polynomial approximations, in many cases, the domain is partitioned into uniform segments [2, 6, 19] . Such methods are simple and fast, but for some kinds of numerical functions, too many segments are required, resulting in large memory.
For a given error, non-uniform segmentation of the domain uses fewer segments than the uniform segmentation [9, 17] . However, a non-uniform segmentation often requires a complicated segment index encoder (see Section IV), and results in larger and slower NFGs. To overcome this problem, a special non-uniform segmentation has been proposed [9] . This method produces a simple segment index encoder by restricting the segmentation points, and results in fewer segments as well as faster and more compact NFGs than produced by uniform segmentation. However, it is ad-hoc and non-optimum for the given function. Our NFG can implement any nonuniform segmentation with a fast and compact segment index encoder by using an LUT cascade [17] with a synthesis method that can be automated.
Selection of the approximation polynomial influences the number of non-uniform segments as well as the approximation error. In this paper, we use the 2nd-order Chebyshev approximation to approximate f (x) with fewer non-uniform segments, and compute the approximated value. Since coefficients of the Chebyshev approximation polynomial are easily computed, it is suitable for automatic synthesis.
A. Segmentation Algorithm
For a segment [s, e] of f (x), the maximum approximation error ε 2 (s, e) of the 2nd-order Chebyshev approximation [11] is given by
where f (3) is the 3rd-order derivative of f . From (2), ε 2 (s, e) is a monotone increasing function of segment width e − s. Using this property, we partition a domain into as wide segments as possible such that the approximation error is less
Process:
1.
If p > b, then let p = b.
4.
Let e i = p and i = i + 1.
5.
If p = b, then let t = i, and stop the process. 6.
Else, let s i = p, and go to step 2. than the specified error. Fig. 2 shows the non-uniform segmentation algorithm. The inputs for this algorithm are a numerical function f (x), a domain [a, b] for x, and an acceptable approximation error ε. Then, this algorithm approximates f (x) with the acceptable approximation error ε, and produces t segments
Such p can be found by scanning values of n-bit input x. However, it requires O(2 n ) search, and is timeconsuming. Therefore, we compute the maximum value p by setting 0 or 1 from MSB to LSB of x such that
is computed by the nonlinear programming algorithm, which is one of the most efficient [7] .
B. Computation of Approximate value
For each [s i , e i ], f (x) is approximated by the corresponding quadratic function g i (x). That is, the approximated value y of f (x) is computed as follows:
where the coefficients c 2i , c 1i , and c 0i are derived from the 2ndorder Chebyshev approximation polynomial [11] . Substituting
In (4),
This transformation reduces the multiplier size.
IV. ARCHITECTURE FOR NFGS Fig. 3 shows the architecture that realizes (5) . It uses 7 units: the segment index encoder that computes the index i for segment [s i , e i ] including the input value x; the coefficients table for −q i , c 2i , c 1i , and c 0i ; the adder for x + (−q i ); the squaring unit; two multipliers; and the final adder.
Segment Index
Encoder (LUT Cascade) Coefficients A segment index encoder converts x into a segment index i. It realizes the segment index function seg f unc(x) : Fig. 4 (a) , where x has n bits, B = {0, 1}, and t denotes the number of segments. In [9] , to simplify the segment index encoder, the values of s i and e i are restricted to what can be produced by a simple combinational logic circuit. Such a segmentation method results in many segments since it does not adapt to the given function. Our synthesis system uses the LUT cascade [8, 15, 16] shown in Fig. 4 (b) to realize arbitrary seg f unc(x). It can be designed by functional decomposition using BDDs (Binary Decision Diagrams) representing seg f unc(x). Our synthesis system uses a nonrestrictive segmentation. It is suitable for automatic synthesis. In LUT cascades, the interconnecting lines between adjacent LUTs are called rails. The size of an LUT cascade depends on the number of rails. The next theorem shows that the segment index functions are realized by compact LUT cascades.
Theorem 4.1 [16] Let seg f unc(x) be a segment index function with t segments. Then, there exists an LUT cascade for seg f unc(x) with at most log 2 t rails.
Our synthesis system uses heterogeneous MDDs (Multi-valued Decision Diagrams) [13] to find compact LUT cascades. Since the LUT cascade is suitable for the pipeline processing, it offers a fast and compact circuit. In Section VI, we will show that our architecture produces fast and compact NFGs for various numerical functions.
V. IMPLEMENTATION WITH FPGA
Modern FPGAs consist of logic elements (LEs) or configurable logic blocks (CLBs), synchronous memory blocks, multipliers (DSP units), etc. Our synthesis system efficiently generates NFGs using these components. Each unit for the NFG shown in Fig. 3 is implemented by the following components in an FPGA: 1) Segment index encoder (LUT cascade) and coefficients table: by synchronous memory blocks; 2) Squaring unit: by logic elements; 3) Multiplier: by DSP units; and 4) Adder: by logic elements. Our synthesis system derives the appropriate bit-width for each component by automatic error analysis.
A. Size Reduction of Multiplier
Although modern FPGAs have dedicated multipliers, large multipliers are slow. In our architecture, the multiplier often has the longest delay time among all the units. Thus, to implement a fast NFG, reducing multiplier size is important. Since the size of multipliers depends on the number of bits for c 2i , c 1i , and x − q i , it is important to reduce the number of bits to represent these values.
First, we consider the case where the absolute values of c 2i and c 1i are large. Our synthesis method uses a scaling method [9] . We represent c 2i and c 1i as c 2i = c 2i × 2 −l 2i × 2 l 2i and c 1i = c 1i × 2 −l 1i × 2 l 1i , respectively. That is, instead of the original values of c 2i and c 1i , we store the values of c 2i × 2 −l 2i , l 2i , c 1i × 2 −l 1i , and l 1i in the coefficients table. In this case, the products c 2i (x − q i ) 2 and c 1i (x − q i ) are computed using multipliers and shifters. The use of l 2i and l 1i reduces the number of bits to represent the values of c 2i × 2 −l 2i and c 1i × 2 −l 1i , but increases the rounding errors. Our synthesis method finds optimum values of l 2i and l 1i for each segment such that an acceptable error is achieved. When l 2i and l 1i are 0 for all the segments, no shifter is implemented, that is, c 2i (x − q i ) 2 and c 1i (x − q i ) are directly implemented with multipliers.
Next, we consider the value of x − q i . The number of bits for x − q i influences the sizes of the squaring unit and multipliers. Thus, reducing the value of x − q i reduces the sizes of the squaring unit and multipliers, and also the error. From (5), we can choose any value for q i . To reduce the value of x − q i , for a segment [s i , e i ], we set q i = (s i + e i )/2. Then, we have |x − q i | ≤ (e i − s i )/2. Thus, reducing the segment width e i − s i reduces the value for x − q i . However, this also increases the number of segments, and results in increased memory size.
The rest of this section shows a reduction method of segment width without increasing the memory size.
The coefficients table in Fig. 3 has 2 k words, where k = log 2 t and t is the number of segments. Therefore, we can increase the number of segments up to t = 2 k without increasing the memory size. From Theorem 4.1, the size of LUT cascade also depends on the value of k. However, increasing the number of segments to t = 2 k seldom increases the size of the LUT cascade. We reduce the size of segments by dividing the largest segment into two equal sized segments up to t = 2 k . This method reduces both the number of bits for x − q i and the error without increasing the memory size.
B. Pipeline Processing
To implement a high-throughput NFG in an FPGA, our synthesis system inserts pipeline registers between all units in the architecture. Since all units operate in parallel, and each unit has a short delay time, our NFGs achieves high throughput. Table I shows the units and the number of pipeline stages for them. Our NFGs have n cas + (5 or 6) pipeline stages, where n cas is the number of LUTs for the LUT cascade. Table II compares the number of segments for various approximation methods for the functions in [16] . In this table, Entropy, Sigmoid, and Gaussian are
VI. EXPERIMENTAL RESULTS
A. Number of Segments and Computation Time of Algorithm
In Table II , the columns "Linear Non" show the number of non-uniform segments for linear approximation in [17] , and the columns "2nd-Chebyshev Uniform" and "2nd-Chebyshev Non" show the number of uniform segments and non-uniform segments for 2nd-order Chebyshev approximation, respectively. The columns "Time" show the CPU time for our nonuniform segmentation algorithm applied to functions, in milliseconds. Table II shows that, for many functions, the 2nd-order Chebyshev approximations require many fewer segments than the linear approximation. However, for some functions, such as − ln(x), the 2nd-order Chebyshev approximation based on uniform segmentation requires many more segments than the linear and 2nd-order Chebyshev approximations based on non-uniform segmentations. Many existing polynomial approximation methods are based on uniform segmentation. For trigonometric and exponential functions, approximation methods based on uniform segmentation require relatively few segments. However, for some kinds of functions such as − ln(x), the uniform 2nd-order approximation method requires excessively many segments. On the other hand, our quadratic approximation based on non-uniform segmentation requires fewer segments for a wide range of functions. Also, Table II shows that the CPU time is strongly correlated to the number of segments. Smaller acceptable approximation error (AAE) requires more segments and longer computation time. However, Table II shows that, for all functions in the table, the CPU times are shorter than 1 second when the acceptable approximation error is 2 −25 .
These results show that, for various functions, our segmentation algorithm partitions a domain into fewer non-uniform segments quickly, and it is useful for automatic synthesis.
B. Memory Sizes of Various NFGs
This section compares the memory sizes of our NFGs with three existing NFGs [17, 3, 4] . Table III compares NFGs using linear approximation shown in [17] . This linear approximation is based on non-uniform segmentation. In Table III , the columns "R" show the following values: R = memory size of quadratic approximation memory size of linear approximation × 100. Table III shows that NFGs using quadratic approximation require much smaller memory than ones using linear approximation. Especially, 24-bit precision NFGs using quadratic approximation can be implemented with only 4% of the memory size needed for a linear approximation. From the relation between precision and memory size shown in Table III , we can see that increasing the precision decreases the ratio of memory sizes in NFGs. [3] and NFGs using 2nd-order minimax approximation by the Remez algorithm [4] , respectively. Both approximations in [3, 4] are based on uniform segmentation. Thus, their NFGs require no segment index encoder. On the other hand, since our approximation is based on nonuniform segmentation, the memory size is obtained by the sum of the coefficients table and the segment index encoder. As shown in [17] and Table II , for trigonometric and exponential functions, the difference of the number of uniform segments and non-uniform segments is not so large under the same approximation polynomial. For such functions, NFGs based on uniform segmentation (needing no segment index encoder) often require smaller memory than non-uniform segmentations. Although our NFGs require the segment index encoder and use approximation polynomials with larger approximation error than approximation polynomials in [3, 4] , our NFGs for such functions are implemented with only 22% to 52% of the memory sizes of NFGs in [3] , and with memory size comparable to [4] . In [3, 4] , memory sizes of NFGs for √ x and − ln(x) are unavailable. However, from Table II, we can see that the memory size of their NFGs for √ x and − ln(x) is excessively large. On the other hand, our NFGs can realize a wide range of functions with small memory size.
C. FPGA Implementation Results
Table VI compares the FPGA implementation results of our NFGs with NFGs using linear approximation [17] .
Since the architecture of linear NFG is simpler than quadratic NFG, linear NFGs are faster, and require fewer logic elements and DSP units than quadratic NFGs. However, linear approximation requires more segments and larger memory than quadratic approximation, as shown in Table II and  Table III . Table VI shows that 24-bit precision linear NFGs cannot realize any function except Gaussian with the FPGA (the smallest device in the Stratix family) due to the excessive memory size although many logic elements and DSP units are unused. The most crucial issue in the FPGA implementation is the constraints on these hardware resources. For 24-bit precision, the linear approximation requires a larger FPGA due to the excessive memory size. However, in the larger FPGA, more logic elements and DSP units are left unused and wasted. On the other hand, the quadratic NFGs can be implemented with a smaller FPGA since they require much less memory size than the linear NFGs and reasonable sizes of logic elements and DSP units. In fact, 24-bit precision quadratic NFGs can be implemented with lower cost and more compact FPGAs (Cyclone II).
VII. CONCLUSION AND COMMENTS
We have demonstrated an architecture and a synthesis method for programmable NFGs for trigonometric functions, logarithm functions, square root, reciprocal, etc. Our architecture can efficiently realize any non-uniform segmentation using a compact LUT cascade, and approximate many numerical functions by quadratic polynomials. Therefore, our architecture is suitable for automatic synthesis of fast and compact NFGs. Implementation results on an FPGA show that our synthesis method can approximate a wide range of functions with a small number of non-uniform segments, and generate NFGs with small memory size. For 24-bit precision, our NFGs can be implemented with only 4% of the memory size of NFGs based on the linear approximation with non-uniform segmentation, and with only 22% of the memory size of NFGs based on the 5th-order approximation with uniform segmentation. NFGs based on the linear approximation are faster than the quadratic ones, but for high-precision, they require a large FPGA due to the excessive memory size. On the other hand, our quadratic NFGs can be implemented with more compact and low-cost FPGA by using hardware resources on the FPGA efficiently.
