This paper presents an architecture and a synthesis method for compact numerical function generators (NFGs) for trigonometric, logarithmic, square root, reciprocal, and combinations of these functions. Our NFG partitions a given domain of the function into non-uniform segments using an LUT cascade, and approximates the given function by a quadratic polynomial for each segment. Thus, we can implement fast and compact NFGs for a wide range of functions. Experimental results show that: 1) our NFGs require, on average, only 4% of the memory needed by NFGs based on the linear approximation with non-uniform segmentation; 2) our NFG for 2 x − 1 requires only 22% of the memory needed by the NFG based on a 5th-order approximation with uniform segmentation; and 3) our NFGs achieve about 70% of the throughput of the existing table-based NFGs using only a few percent of the memory. Thus, our NFGs can be implemented with more compact FPGAs than needed for the existing NFGs. Our automatic synthesis system generates such compact NFGs quickly. key words: LUT cascades, nonuniform segmentation, NFGs, automatic synthesis, FPGA 
Introduction
Numerical functions f (x), such as trigonometric, logarithmic, square root, reciprocal, and combinations of these functions, are extensively used in computer graphics, digital signal processing, communication systems, robotics, astrophysics, fluid physics, etc. To compute elementary functions, iterative algorithms, such as the CORDIC (COordinate Rotation DIgital Computer) algorithm [1] , [23] , have been often used. Although the CORDIC algorithm achieves accuracy with compact hardware, its computation time is proportional to the precision (i.e. the number of bits). For a function composed of elementary functions, the CORDIC algorithm is slower, since it computes each elementary function sequentially. It is too slow for numerically intensive applications. Implementation by a single lookup table for f (x) is simple and very fast. For low-precision computations Manuscript To reduce the memory size, polynomial approximations have been used [9] , [10] , [21] , [22] . These methods approximate the given numerical functions by piecewise polynomials, and realize the polynomials with hardware. Linear approximations offer fast and middle-precision evaluation of numerical functions. However, for high-precision, these methods also require excessive memory size. And, the methods proposed so far are ad-hoc and not systematic. This paper proposes an architecture and a systematic synthesis method for NFGs based on quadratic approximation. By using the LUT cascade [8] , many numerical functions are efficiently approximated by piecewise quadratic functions, and are realized by NFGs with small memory size. Our synthesis method is fully automated, so that fast and compact NFGs can be produced by non-experts. Figure 1 shows the synthesis flow for the NFG. It converts the Design Specification described by Scilab [19] , a MATLAB-like software, into HDL code. The Design Specification consists of a function f (x), a domain over x, and an accuracy. This system first partitions the domain into segments, and then approximates f (x) by a quadratic function in each segment. Next, it analyzes the errors, and derives the necessary precision for computing units in the NFG. Then, it generates HDL code to be mapped into an FPGA using an FPGA vendor tool.
Preliminaries

Definition 1:
The binary fixed-point representation of a value r has the form
where d i ∈ {0, 1}, n int is the number of bits for the integer part, and n f rac is the number of bits for the fractional part of r. The representation in (1) is two's complement, and so
Copyright c 2006 The Institute of Electronics, Information and Communication Engineers Definition 2: Error is the absolute difference between the original value and the approximated value. Approximation error is the error caused by a function approximation. Rounding error is the error caused by a binary fixed-point representation. It is the result of truncation or rounding, whichever is applied. However, both operations yield an error that is called rounding error. Acceptable error is the maximum error that an NFG may assume. Acceptable approximation error is the maximum approximation error that a function approximation may assume.
For example, when a function is approximated by a piecewise linear function, approximation error occurs because the approximating function does not equal the approximated function everywhere. Typically, there is a non-zero error over most of the function values; indeed, this error is usually 0 only for a very few function values. Rounding error occurs because the binary representation of the function (2) takes on only specific values. For example, if the function value is √ 2, there is no representation in the form (2) , where n f rac is finite, that is exactly √ 2.
Definition 3:
Precision is the total number of bits for a binary fixed-point representation. Specially, n-bit precision specifies that n bits are used to represent the number; that is, n = n int + n f rac. An n-bit precision NFG has an n-bit input.
Definition 4:
Accuracy is the number of bits in the fractional part of a binary fixed-point representation. Specially, m-bit accuracy specifies that m bits are used to represent the fractional part of the number; that is, m = n f rac. An m-bit accuracy NFG is an NFG with an m-bit fractional part of the input, an m-bit fractional part of the output, and a 2 −m acceptable error.
Definition 5:
Truncation is the process of removing lower order bits from a binary fixed-point number. If r is represented as r = d n int−1 d n int−2 . . . d 0 . d −1 . . . d −n f rac and is truncated to r = d n int−1 d n int−2 . . . d 0 . d −1 . . . d −i , then the resulting rounding error is at most 2 −i − 2 −n f rac , where i ≤ n f rac. Truncation of r never produces a value larger than r.
r is rounded to
The error caused by rounding is at most 2 −(i+1) .
Truncation can cause a larger error than rounding. On the other hand, truncation requires less hardware. In our architecture, we use both truncation and rounding. This is discussed in Appendix.
Piecewise Quadratic Approximation
To approximate the numerical function f (x) using quadratic functions, we first partition the domain for x into segments. For each segment, we approximate f (x) using a quadratic function g(x) = c 2 x 2 + c 1 x + c 0 . In this case, we seek the fewest segments, since this reduces the memory size needed for storing the coefficients of the quadratic functions.
For piecewise polynomial approximations, in many cases, the domain is partitioned into uniform segments [2] - [4] , [21] , [22] . Such methods are simple and fast, but for some kinds of numerical functions, too many segments are required, resulting in large memory.
For a given error, non-uniform segmentation of the domain uses fewer segments than uniform segmentation [9] , [18] . However, a non-uniform segmentation requires an additional circuit that maps values of x to a segment number. Potentially, this is a complex circuit. To simplify the additional circuit, Lee et al. [9] have proposed a special non-uniform segmentation for the √ − ln(x) function. Their method produces a simple circuit by restricting the segmentation points, and results in fewer segments as well as faster and more compact NFG than produced by uniform segmentation. However, it is not always optimum for the given function f (x). For a fast and compact realization of any nonuniform segmentations, we use an LUT cascade proposed by Sasao et al. [17] , [18] (see Sect. 4). By using the LUT cascade, we can use an optimum non-uniform segmentation for the given function f (x). As far as we know, there is no other method that uses an optimum non-uniform segmentation.
Non-uniform Segmentation Algorithm
The number of non-uniform segments depends on the approximation polynomial. A more accurate approximation polynomial requires fewer segments. In this paper, we use the 2nd-order Chebyshev polynomials to approximate f (x). We show that its coefficients are computed easily and quickly: it is suitable for fast automatic synthesis.
For a segment [s, e] of f (x), the maximum approximation error 2 (s, e) of the 2nd-order Chebyshev approximation [11] is given by
where f (3) is the 3rd-order derivative of f . From (3), 2 (s, e) is a monotone increasing function of segment width e − s. Using this property, we partition a domain into as wide segments as possible such that the approximation error is less than the specified approximation error. Figure 2 shows the non-uniform segmentation algorithm. 1. Let s 0 = a and i = 0.
2.
Find a value p (≥ s i ) where 2 (s i , p) = a . 3.
If p > b, then let p = b.
4.
Let e i = p and i = i + 1.
5.
If p = b, then let t = i, and stop the process. 6.
Else, let s i = p, and go to step 2. the value p where 2 (s i , p) = a is difficult. Thus, we obtain the maximum value p satisfying 2 (s i , p ) ≤ a . Such p can be found by scanning values of n-bit input x. However, it requires O(2 n ) search, and is time-consuming. Therefore, we compute the maximum value p by specifying the bits of x from the MSB to the LSB such that 2 (s i , p ) ≤ a . This requires a search with time complexity O(n). In the computation of 2 (s i , p ), the value of max s i ≤x≤p | f (3) (x)| is computed by a nonlinear programming algorithm [7] .
Computation of the Approximate Value
For each [s i , e i ], f (x) is approximated by the corresponding quadratic function g i (x). That is, the approximated value
where the coefficients c 2i , c 1i , and c 0i are derived from the 2nd-order Chebyshev approximation polynomial [11] . Substituting x − q i + q i for x yields the transformation
This transformation reduces the multiplier size. Figure 3 shows the architecture that realizes (4). It has 7 units: the segment index encoder (which produces the segment index i for [s i , e i ] given the value x); the coefficients table (which stores −q i , c 2i , c 1i , and c 0i ); an adder (which produces x + (−q i )); a squaring unit; two multipliers; and the output adder.
Architecture for NFGs
A segment index encoder converts x into a segment index i. It realizes the segment index function seg f unc(x) : Fig. 4 (a) , where x has n bits, B = {0, 1}, and t denotes the number of segments. NFGs in which the most significant bits directly drive the address inputs of a coefficient memory need no segment index encoder. On the other hand, our NFGs based on non-uniform segmentation require this additional circuit. Although NFGs based on non-uniform segmentation have a smaller coefficients table than those based on uniform segmentation, the size of segment index encoder may have to be considered to provide a fair comparison. In [9] , to simplify the segment index encoder, a special non-uniform segmentation is used for √ − ln(x). However, it does not correspond to an optimum segmentation. To produce compact NFGs for a wide range of functions, it is important to guarantee that the size of the segment index encoder is reasonable for any non-uniform segmentation. Our synthesis system uses the LUT cascade [8] , [16] , [17] shown in Fig. 4 (b) to realize any given (optimum) seg f unc(x). In an LUT cascade, the interconnecting lines between adjacent LUTs are called rails. The size of an LUT cascade depends on the number of rails. Sasao et al. [17] have shown that the size of an LUT cascade depends on the number of segments. Specifically, Theorem 1: Let seg f unc(x) be a segment index function with t segments. Then, there exists an LUT cascade for seg f unc(x) with at most log 2 t rails. [18] shows that by using an LUT cascade, we can generate compact NFGs for a wide range of functions. In this paper, we significantly reduce memory sizes of the coefficients table and the LUT cascade by reducing the number of segments using the quadratic approximation. Our synthesis system uses heterogeneous MDDs (Multi-valued Decision Diagrams) [13] , [14] to find compact LUT cascades. An LUT cascade passes data from the leftmost LUT to the rightmost LUT. Since each LUT in an LUT cascade operates independently and concurrently, we can easily obtain a pipeline structure by assigning each LUT to one pipeline stage. By using LUTs with the same size, we can easily achieve an efficient pipeline processing. To the best of our knowledge, this is the first method for realizing any seg f unc(x).
Reduction of the Size of the Multiplier
To generate a fast NFG, reducing multiplier size is important. Since the size of multipliers depends on the number of bits for c 2i , c 1i , and x − q i , we reduce the number of bits to represent these values.
To reduce the number of bits for c 2i and c 1i , we use a scaling method [9] . We represent c 2i and c 1i as c 2i × 2 −l 2i × 2 l 2i and c 1i × 2 −l 1i × 2 l 1i , respectively. And, we compute the products c 2i (x − q i ) 2 and c 1i (x − q i ) using multipliers for the right-shifted multiplicand, c 2i ×2 −l 2i (x−q i ) 2 and c 1i ×2 −l 1i (x− q i ), and left shifters to restore the products. Applying right shifts reduces the number of bits for c 2i × 2 −l 2i and c 1i × 2 −l 1i (i.e. the multiplier sizes) by rounding the l 2i LSBs of c 2i and l 1i of c 1i , while increasing the rounding errors. In Appendix, we find the largest l 2i and l 1i for each segment that preserve the given acceptable error. When l 2i and l 1i are 0 for all the segments, we do not use scaling method.
Example 1: Consider a 24-bit precision NFG for √ − ln(x). By using the scaling method, c 2i , l 2i , c 1i , and l 1i have 13 bits, 5 bits, 20 bits, and 4 bits, respectively, and they produce an NFG with memory size: 173, 056 bits, operating frequency: 130 MHz, and 16 DSP units. On the other hand, without the scaling method, c 2i and c 1i have 37 bits and 28 bits, respectively, and they produce an NFG with larger memory size: 184, 832 bits, much lower operating frequency: 71 MHz, and more DSP units: 20 units.
(End of Example)
Next, we reduce the value of x − q i . The number of bits for x − q i influences the sizes of the squaring unit and multipliers. Thus, reducing the value of x − q i reduces the sizes of the squaring unit and multipliers, and also the error. From (4), we can choose any value for q i . To reduce the value of x − q i , for a segment [s i , e i ], we set q i = (s i + e i )/2. Then, we have |x − q i | ≤ (e i − s i )/2. Thus, reducing the segment width e i − s i reduces the value for x − q i . However, this also increases the number of segments, and results in increased memory size. We show a reduction method of segment width without increasing the memory size.
The coefficients table in Fig. 3 has 2 k words, where k = log 2 t and t is the number of segments. Therefore, we can increase the number of segments up to t = 2 k without increasing the memory size. From Theorem 1, the size of the LUT cascade also depends on the value of k. Thus, increasing the number of segments to t = 2 k rarely increases the size of the LUT cascade. We reduce the size of segments by dividing the largest segment into two equal sized segments up to t = 2 k . This reduces the rounding error as well (see Appendix). Table 1 compares the number of segments for various approximation methods for the functions in [17] . In this table, Entropy, Sigmoid, and Gaussian are
Experimental Results
Number of Segments and Computation Time
In Table 1 , the columns "Linear Uniform" and "Linear Non" show the number of uniform and non-uniform segments [18] for linear approximation, respectively. The columns "2nd-Chebyshev Uniform" and "2nd-Chebyshev Non" show the number of uniform and non-uniform segments for the 2ndorder Chebyshev approximation, respectively. The columns "Time" show the CPU time for our non-uniform segmentation algorithm applied to functions, in milliseconds. Table 1 shows that, for many functions, the 2nd-order Chebyshev approximations require many fewer segments than the linear approximations. However, for some functions, such as √ − ln(x), the 2nd-order Chebyshev approximation based on uniform segmentation requires many more segments than the linear and 2nd-order Chebyshev approximations based on non-uniform segmentations. Many existing polynomial approximation methods are based on uniform segmentation. For trigonometric and exponential functions, the methods based on uniform segmentation require relatively few segments. However, for some kinds of functions such as √ − ln(x), the uniform methods require excessively many segments, even if quadratic approximation is used. On the other hand, our quadratic approximation based on non-uniform segmentation requires many fewer segments for a wide range of functions. Also, Table 1 shows that the CPU time to compute the segmentation strongly depends on the number of segments. Smaller acceptable approximation error requires more segments and longer computation time. However, Table 1 shows that, for all functions in the table, the CPU times are shorter than 1 second when the acceptable approximation error is 2 −25 . These results show that, for various functions, our segmentation algorithm partitions a domain into fewer nonuniform segments quickly, and it is useful for fast synthesis.
Memory Sizes of Various NFGs
This section compares the memory sizes of our NFGs with three existing NFGs [3] , [4] , [18] . Table 2 compares with NFGs based on linear approximation and non-uniform segmentation shown in [18] . In Table 2 , the columns "R" show the following values: R = memory size of quadratic approximation memory size of linear approximation × 100 (%). Table 2 shows that NFGs using quadratic approximation require much smaller memory than ones using linear approximation. Especially, 24-bit precision NFGs using quadratic approximation can be implemented with, on average, only 4% of the memory size needed for a linear approximation. From the relation between precision and memory size shown in Table 2 , we can see that increasing the precision decreases the ratio of memory sizes in NFGs. Figure 5 shows a plot of the total memory size and the required precision for NFGs based on linear and quadratic approximations of 1/x. Since memory size of quadratic NFG increases more slowly than that of linear NFG, as the precision increases, we conjecture that for precisions higher than 24-bit, our quadratic NFGs will require reasonable memory size. Unfortunately, we could not verify that because of the precision of our NFG synthesis tool. Table 3 and Table 4 compare our NFGs with NFGs using a 5thorder Taylor expansion [3] and NFGs using 2nd-order min- imax approximation by the Remez algorithm [4] , respectively. Both approximations in [3] , [4] are based on uniform segmentation. Thus, their NFGs require no segment index encoder. On the other hand, since our approximation is based on non-uniform segmentation, the memory size is obtained by summing the memory sizes of the coefficients table and the segment index encoder. As shown in Table 1 , for trigonometric and exponential functions, the difference of the number of uniform segments and non-uniform segments is not so large under the same approximation polynomial. For such functions, NFGs based on uniform segmentation (needing no segment index encoder) often require smaller memory than non-uniform segmentations. Although our NFGs require the segment index encoder and use approximation polynomials with larger approximation error than approximation polynomials in [3] , [4] , our NFGs for such functions are implemented with only 22% to 52% of the memory sizes of NFGs in [3] , and with memory size comparable to [4] . In [3] , [4] , memory sizes of NFGs for √
x and √ − ln(x) are unavailable. However, from Table 1 , we can see that the memory size of their NFGs for √ x and √ − ln(x) is excessively large. On the other hand, our NFGs can realize a wide range of functions with small memory size.
FPGA Implementation
This section compares the FPGA implementation results of our NFGs with two existing NFGs [18] , [22] . Table 5 shows FPGA implementation results of our NFGs and NFGs based on linear approximation using non-uniform segmentation [18] .
Since the architecture of a linear NFG [18] is simpler than that of a quadratic NFG, linear NFGs are faster, and require fewer logic elements and DSP units than quadratic NFGs. However, linear approximations require more segments and larger memory than quadratic approximations, as shown in Table 1 and Table 2 . Table 5 shows that there are not enough resources in the smallest device in the Stratix family to achieve 24-bit precision using linear approximation for all functions, except Gaussian. This is due to the excessive memory size (although many logic elements and DSP units are unused). The most crucial issue in the FPGA implementation is the constraints on these hardware resources. For 24-bit precision, the linear approximation requires a larger FPGA due to the excessive memory size. However, in larger FPGAs, more logic elements and DSP units are left unused. On the other hand, a quadratic NFG can be implemented with a smaller FPGA, since it requires much smaller memory size than a linear NFG. In fact, we implemented 24-bit precision NFGs with lower cost and smaller FPGAs (Cyclone II), using quadratic approximation instead of linear approximation.
To show the performance of our NFGs, we compare with a well-known table-based NFG, STAM [22] , that uses 1st-order Taylor expansion and uniform segmentation. Table 6 compares memory size and operating frequency. Since the STAM requires tables and a multiple-input adder, but requires no multiplier, it is faster. However, it requires excessive memory size and therefore larger FPGAs. On the other hand, our NFGs achieve about 60% to 70% of the operating frequency of the STAM with much smaller memory size.
Conclusion and Comments
We have demonstrated an architecture and a synthesis method for compact NFGs for trigonometric, logarithmic, square root, reciprocal, and combinations of these functions. Our architecture can efficiently realize any non-uniform segmentation using a compact LUT cascade, and can approximate many numerical functions by quadratic polynomials. Therefore, our architecture is suitable for automatic synthesis of fast and compact NFGs. Experimental results show that our synthesis method can approximate a wide range of functions with a small number of non-uniform segments, and generate NFGs with small memory size. For 24-bit precision, our NFGs can be implemented with, on average, only 4% of the memory size of NFGs based on linear approximation and non-uniform segmentation, and with, on average, only 33% of the memory size of NFGs based on the 5th-order approximation and uniform segmentation. NFGs based on the linear approximation are faster than the quadratic ones, but for high-precision, they require a large FPGA due to the excessive memory size. On the other hand, our quadratic NFGs can achieve about 70% of the throughput of the table-based NFG using only a few percent of the memory, and can be implemented with a more compact and low-cost FPGA.
Our 24-bit precision NFGs can be used to compute a significand of an IEEE single-precision floating-point number. Since our NFGs are compact, we conjecture that our NFGs will be practical also for higher-precision than 24-bit. However, we could not verify the accuracy of NFGs with higher-precision than 24-bit due to the errors of NFG synthesis tool developed by C language. Thus, we have to develop a more accurate NFG synthesis tool and a verification tool in the future.
where α 2 , α 1 , and α 0 are rounding errors of c 2i , c 1i , and c 0i , respectively. In this case, we need no hardware for rounding the coefficients because it is precomputed before storing in the coefficients table. Since rounding yields a more accurate result than the truncation, we choose rounding. Error in Squaring Unit: When input x has m in bits, our synthesis system ensures that q i has also m in bits. Therefore, the value of (x − q i ) also has m in bits, and has no rounding error. Although the value of (x − q i ) 2 has 2m in bits, the value of (x − q i ) 2 is truncated to u 3 bits in order to reduce the size of succeeding multiplier, where u 3 ≤ 2m in . We choose truncation because it does not require additional hardware, while rounding requires half adders at the output of squaring unit. Thus, the succeeding multiplier uses the value of
where α 3 is the rounding error for the truncation.
Errors in Multipliers:
As shown in the two previous paragraphs, c 2i and c 1i are stored as c 2i + α 2 and c 1i + α 1 in the ROM, and (x − q i ) 2 is changed to (x − q i ) 2 − α 3 by truncation. These values change the original products c 2i (x − q i ) 2 and c 1i (x − q i ) to
The values of (A· 1) and (A· 2) have (u 2 + u 3 ) bits and (u 1 + m in ) bits, respectively, and they are truncated to u 0 bits in order to match the addition with c 0i . Note that u 0 ≤ min(u 2 + u 3 , u 1 + m in ). Following a method shown in [20] , the (u 0 + 1)th-fractional bit d −(u 0 +1) of (A· 1) is set to 1 after the truncation. Thus, the adder uses the values of
where α 4 and α 5 are the rounding errors for the truncations of (A· 1) and (A· 2
Errors in Adder:
From previous paragraphs, the original sum c 2i (x − q i ) 2 + c 1i (x − q i ) + c 0i is changed to
This value has u 0 bits, and is rounded to m out bits (output accuracy given in the specification), where m out ≤ u 0 . Thus, the value of g i (x) is changed to
where α 6 is the error for the rounding to m out bits. Since d −(u 0 +1) of (A· 1) is set to 1, d −(u 0 +1) of (A· 3) is also 1, and we have |α 6 | ≤ 2 −(m out +1) − 2 −(u 0 +1) . Note that d −(u 0 +1) is not implemented in hardware [20] . By expanding and rearranging this, we have
This is an output value of the NFG including rounding error. (A· 4) has the maximum rounding error when α 0 = −2 −(u 0 +1) , α 1 = −2 −(u 1 +1) , α 2 = −2 −(u 2 +1) , α 3 = 2 −u 3 − 2 −2m in , α 4 = 2 −(u 0 +1) − 2 −(u 2 +u 3 ) , α 5 = 2 −u 0 − 2 −(u 1 +m in ) , and α 6 = −(2 −(m out +1) − 2 −(u 0 +1) ), where we assume that the values of (x − q i ) and c 2i are positive. Therefore, the maximum rounding error r is
where max seg and max c 2 are the maximum values of |x − q i | and |c 2i |, respectively.
A.2 Calculation of Bit Sizes for Units
The number of bits for the integer part is calculated as log 2 (max value + 1) + 1, where max value is an integer, and denotes the maximum absolute value of the range. On the other hand, the number of bits for the fractional part is calculated using the result of the error analysis. From the error analysis, an NFG with an acceptable error is achieved when a + r < , where a and r are the maximum approximation error and the rounding error, respectively. To generate fast and compact NFGs, we find the minimum u 0 , u 1 , u 2 , and u 3 that satisfy this relation using nonlinear programming [7] that minimizes u 0 + u 1 + u 2 + u 3 with the constraint a + r < . Calculation of Shift Bits l 2i and l 1i : From (A· 1) and (A· 2), the maximum errors in the multipliers 1 and 2 are 1 = 2 −(u 2 +1) (max seg 2 − 2 −u 3 + 2 −2m in ) +max c 2 (2 −u 3 − 2 −2m in ) and 2 = 2 −(u 1 +1) max seg, respectively. Since max seg and max c 2 are the maximum values of |x − q i | and |c 2i |, respectively, there are positive integers l 2i and l 1i that satisfy the following relations for each segment:
We transform these into 2 −(u 2 −l 2i +1) {|x − q i | 2 − 2 −u 3 + 2 −2m in } +|c 2i |(2 −u 3 − 2 −2m in ) ≤ 1 and (A· 5) 2 −(u 1 −l 1i +1) |x − q i | ≤ 2 .
(A· 6)
From (A· 5) and (A· 6), for each segment, coefficients c 2i and
