Abstract-This paper proposes an architecture and a synthesis method for high-speed computation of fixed-point numerical functions such as trigonometric, logarithmic, sigmoidal, square root, and combinations of these functions. Our architecture is based on the lookup table (LUT) cascade, which results in a significant reduction in circuit complexity compared to traditional approaches. This is suitable for automatic synthesis and we show a synthesis method that converts a Matlab-like specification into an LUT cascade design. Experimental results show the efficiency of our approach as implemented on a field-programmable gate array (FPGA).
Ç

INTRODUCTION
N UMERICAL functions such as trigonometric, logarithmic, square root, reciprocal, and combinations of these functions are extensively used in computer graphics, digital signal processing, communication systems, robotics, astrophysics, fluid physics, and so forth. To compute elementary functions, iterative algorithms such as the COordinate Rotation DIgital Computer (CORDIC) algorithm [1] , [35] have often been used. Although the CORDIC algorithm achieves accuracy with compact hardware, its computation time is proportional to the precision (that is, the number of bits) [34] . For a function composed of elementary functions, the CORDIC algorithm is slower since it has to compute each elementary function sequentially. It is too slow for numerically intensive applications.
Implementation by a single lookup table (LUT) for a numerical function fðxÞ is simple and fast. For lowprecision computations of fðxÞ (for example, x and fðxÞ have 8 bits), this implementation is efficient since the size of the LUT is small. For high-precision computations, however, the single LUT implementation is impractical due to the huge table size.
To reduce the memory size, polynomial approximations have been used [4] , [6] , [13] , [14] , [19] , [30] , [31] , [32] , [33] , [36] . These methods approximate the given numerical functions by piecewise polynomials and realize the polynomials with hardware. Linear or quadratic approximations offer fast and relatively high-precision evaluation of numerical functions. However, most existing methods are intended only for standard elementary functions and no systematic method for arbitrary functions exists. This paper considers fast and compact numerical function generators (NFGs) for arbitrary functions and proposes an architecture and a systematic synthesis method for them. Our architecture and synthesis method use linear approximations and are applicable not only to continuous functions but also to noncontinuous functions. The circuit's compactness is due, in large part, to the LUT cascade [12] , [25] , [26] . Our synthesis method is automated so that fast and compact NFGs can be produced by nonexperts. Fig. 1 shows our synthesis system. The user specifies only the numerical function fðxÞ, the domain over x, and accuracies for input x and output fðxÞ. Our system can accept an arbitrary numerical function expressed either algebraically (for example, sinðxÞ) or as a table of input/ output values. The user defines the numerical function by using the syntax of Scilab [29] , a free Matlab-like numerical software. This is applied directly to our system, along with the domain and accuracy. The user can either use a defined function in Scilab or specify it directly. Note that, by changing the parser of our system, any format can be used for the design entry.
First, our system partitions the given domain over x into segments and generates the begin/end points of all the segments. Next, an error analysis is performed to determine the precision of the various internal computations, as well as the word width of the memory needed to store the various coefficients. The smallest internal precision needed to achieve the (external) accuracy specified by the user is chosen. Finally, the Hardware Description Language (HDL) code is generated, which is then applied to a vendorsupplied tool that produces the field-programmable gate array (FPGA) implementation.
This paper is organized as follows: Section 2 introduces terminology. Section 3 discusses a piecewise linear approximation for numerical functions and proposes a nonuniform segmentation algorithm. Section 4 shows two different architectures for NFGs. Section 5 presents an example NFG design. Section 6 evaluates the performance of our architecture and synthesis system. The error analysis for NFGs is discussed in the Appendix.
PRELIMINARIES
Number Representation and Precision
Definition 1. The binary fixed-point representation of a value r has the form
where d i 2 f0; 1g, l is the number of bits for the integer part, and m is the number of bits for the fractional part of r. This representation is two's complement and, so,
Definition 2. Error is the absolute difference between the exact value and the value produced by the hardware. Approximation error is the error caused by a function approximation. Rounding error is the error caused by a binary fixed-point representation. It is the result of truncation or rounding, whichever is applied. However, both operations yield an error that is called rounding error. Acceptable error is the maximum error that an NFG may assume. Acceptable approximation error is the maximum approximation error that a function approximation may assume.
Definition 3. Precision is the total number of bits for a binary fixed-point representation. Specially, n-bit precision specifies that n bits are used to represent the number, that is, n ¼ l þ m. An n-bit precision NFG has an n-bit input.
Definition 4. Accuracy is the number of bits in the fractional part of a binary fixed-point representation. Specially, m-bit accuracy specifies that m bits are used to represent the fractional part of the number. In this paper, an m-bit accuracy NFG is an NFG with an m-bit fractional part of the input, an m-bit fractional part of the output, and a 2 Àm acceptable error. 
Àm is added to the result of the truncation. That is, if r is represented as
then r is rounded to
The error caused by rounding is at most 2 Àðmþ1Þ .
Truncation can cause a larger error than rounding. On the other hand, truncation requires less hardware. In our architecture, we use both truncation and rounding. This is discussed in the Appendix.
Binary Decision Diagram (BDD)
Definition 7. A binary decision diagram (BDD) [2] is a rooted directed acyclic graph representing a logic function f0; 1g n ! f0; 1g. The BDD is obtained by repeatedly applying the Shannon expansion to the logic function. It consists of two terminal nodes representing function values 0 and 1, respectively, and nonterminal nodes representing input variables. Each nonterminal node has two outgoing edges, namely, 0-edge and 1-edge, that correspond to the values of input variables.
Definition 8.
A multiterminal BDD (MTBDD) [5] is an extension of the BDD and represents an integer function f0; 1g n ! Z, where Z is a set of integers. The MTBDD has multiple terminal nodes representing integer values.
Example 1. Fig. 2a shows an integer function of five variables. It represents a segment index function, which will be explained in Section 4.2. Fig. 2b shows an MTBDD for the integer function. In Fig. 2b , dashed lines and solid lines denote 0-edges and 1-edges, respectively. In the MTBDD, terminal nodes represent function values. Thus, to evaluate the function, we traverse the MTBDD from the root node to a terminal node according to the input values and obtain the function value (an integer) at a terminal node. This MTBDD will be used for the design of the segment index encoder, as discussed in Section 4.2. 
PIECEWISE LINEAR APPROXIMATION BASED ON NONUNIFORM SEGMENTATION
Segmentation Problem
A straightforward method to realize a given function fðxÞ is to use a single memory in which the address is the binary representation of the value of x and the content of that address is the corresponding value of fðxÞ. However, with this simple approach, the memory size can be very large. For example, if x and fðxÞ are represented by 24-bit binary numbers, then 2 24 Â 24 ¼ 48 Mbytes of memory is needed. In addition, memory is not used efficiently since adjacent locations contain nearly the same value of fðxÞ.
A more efficient realization takes advantage of the last observation. Divide a domain over x into segments and realize fðxÞ in each segment as the linear function c 1 x þ c 0 . In this way, the number of memory locations is the number of segments. Although it is now necessary to store two numbers, c 1 and c 0 , the total memory needed is typically much lower.
Many existing methods [4] , [6] , [13] , [14] , [19] , [30] , [31] , [32] , [33] , [36] use uniform segmentation, which divides a domain over x into segments with the same size. In such a segmentation, the segment size is determined by the least significant bits of x. In this case, the most significant bits can be used to specify a segment number, which is used to find a memory location that stores c 1 and c 0 . The number of uniform segments is determined by the smallest segment size needed to achieve the specified approximation error. One then must choose the size of all segments to be the same as this smallest size. This can yield too many segments, depending on the function. Since the memory size depends on the number of segments, one seeks a segmentation with the fewest segments.
A more sophisticated approach is to choose all segment sizes to be as large as possible while maintaining the specified approximation accuracy. In this case, the segments are likely to have different widths. Cantoni [3] has proposed an algorithm to approximate a given function by using nonuniform segments. This introduces a problem of designing an additional circuit that maps values of x to a segment number. Potentially, this is a complex circuit.
To simplify the additional circuit, Lee et al. [16] have proposed a special nonuniform segmentation called the hierarchical segmentation scheme. The hierarchical segmentation partitions a domain into nonuniform segments by using one of four segmentation types, {P2S(US), P2SL(US), P2SR(US), US(US)}. Since these nonuniform segmentations can be realized by simple circuits, the hierarchical segmentation method results in fewer segments, as well as faster and more compact NFGs, than produced by existing uniform segmentation methods. However, in their method, the designer has to choose the segmentation type and segment widths for the given function by trial and error.
We seek a method in which the optimum segmentation is generated automatically, with no input from the user. It was shown in [26] that an LUT cascade can compactly realize any nonuniform segmentation in which the memory size depends on the number of segments. As a result, for fast and compact NFG design, we can use any nonuniform segmentation algorithm, such as Cantoni's algorithm [3] , and can design the nonuniform segmentation circuit for the given function. , where x has a 5-bit accuracy and the acceptable approximation error is 2 À7 . The number of uniform segments is 32, whereas the number of nonuniform segments is 6.
Note that the segmentation shown in Fig. 3b is identical to the segmentation represented by the table and MTBDD shown in Fig. 2 .
Nonuniform Segmentation Algorithm
The algorithms proposed by Cantoni [3] find the piecewise linear functions with the minimum approximation error for the given number of segments. However, we want to find the fewest segments to approximate a given function for a given approximation error. Thus, we use another algorithm. Fig. 4 presents a heuristic nonuniform segmentation algorithm based on the Douglas-Peucker algorithm, which has been used in rendering curves for graphics displays [9] . Since this is a dedicated algorithm for piecewise linear approximation, this cannot be directly extended to higher order polynomial approximations. However, since this algorithm is simple and robust, it can be applied to a wide range of functions (even noncontinuous functions) and it is fast. The inputs of the segmentation algorithm are a numerical function fðxÞ, a domain ½a; b over x, an accuracy m in of x, and an acceptable approximation error " a . Note that " a has to be smaller than 2
Àðmoutþ1Þ to produce an m out -bit accuracy NFG because the total rounding error is larger than 2
Àðm out þ1Þ (see the Appendix). In our system, " a is set to 2
Àðmoutþ2Þ by default, but the user can choose any " a < 2
Àðmoutþ1Þ . This algorithm begins by forming one segment over the whole domain ½a; b. This is an initial piecewise approximation by a linear function whose end points are ða; fðaÞÞ and ðb; fðbÞÞ. If the current segment fails to provide an acceptable approximation error, then it is partitioned into two segments joined at a point ðp; fðpÞÞ where the maximum error occurs. By iterating this process until the two subsegments approximate fðxÞ to within the acceptable approximation error, this algorithm finds the nonuniform segmentation with a small number of segments. Note that this algorithm does not always produce the fewest segments because it is a heuristic. In the DouglasPeucker algorithm, once a point is chosen, it never moves. For example, in the sinðxÞ function for 0 x 1, the point chosen after the two end points is x ¼ 1=2. Therefore, if the optimum segmentation requires a segment with x ¼ 1=2 internal to the segment, then the optimum segmentation is not achievable. However, the Douglas-Peucker algorithm will place many segments in regions where the function is nonlinear and fewer segments where it is close to linear. Therefore, it is close to optimum. The correction values v i are used to reduce the approximation errors. In Fig. 4 , max fg and min fg denote the maximum positive error and the maximum negative error, respectively. These errors are equalized by a vertical shift of linear function gðxÞ with v i . This vertical shift can produce minimax approximations when fðxÞ is convex or concave on the segment ½s; e. Remez's algorithm is usually used to obtain minimax approximations [20] . However, our algorithm can produce minimax approximations without using Remez's algorithm because our algorithm is based on a linear approximation [21] . In Fig. 4 , p max and p min can be found by scanning values of x over ½s; e. However, it is time consuming. We use a nonlinear programming algorithm [11] to find these values efficiently.
Example 3. Fig. 5 shows the segmentation process when the algorithm in Fig. 4 is applied to ffiffiffi x p over the domain [0, 1). First, we compute a linear function gðxÞ and find p max and p min (Fig. 5a ). In this case, the entire function is approximated as a single line and p max is the point causing the maximum error. In the next iteration of the algorithm, the function is approximated by two segments joined at p max .
Then, we compute the correction value v and the approximation error (Fig. 5b) . If the approximation error is larger than the given acceptable approximation error " a , then the segment is partitioned at the point p max that causes the maximum error (Fig. 5c) . Finally, the nonuniform segmentation is obtained by iterating this step recursively (Fig. 5d ). 
Computation of Approximated Values
For each segment ½s i ; e i , the numerical function fðxÞ is approximated by the corresponding linear function g i ðxÞ.
That is, the approximated value y of fðxÞ is computed as
By substituting c 0i into (1) and, simplifying, we have
Note that, in this form of the piecewise linear approximation, x is offset by s i , which is the start of the segment.
Comparing (2) with (1) reveals that ðx À s i Þ will be (often much) smaller than x. As a result, a smaller multiplier is needed to realize (2) than (1). This results in a lower multiplication delay. When fðxÞ is convex or concave on the segment ½s i ; e i , this linear function is identical to the minimax approximation to fðxÞ [21] . For many functions, there are few inflection points and, so, in most segments, the function is entirely concave or convex. Therefore, our piecewise linear approximation yields an accurate representation of the given function with few segments.
ARCHITECTURE FOR NFGS 4.1 Architectures Based on Nonuniform and
Uniform Segmentations Fig. 6 shows two architectures for the NFGs realizing (2). Fig. 6a shows an architecture based on nonuniform segmentation. It uses five units: the segment index encoder that produces the segment index i for ½s i ; e i , given the value x, the coefficients table for Às i , c 1i , and fðs i Þ þ v i , the adder for x þ ðÀs i Þ, the multiplier, and the output adder. Fig. 6b shows an architecture based on uniform segmentation. In uniform segmentation, the segment index i of ½s i ; e i and the value of ðx À s i Þ can be obtained from the most significant bits and the least significant bits of x, respectively. Therefore, the architecture in Fig. 6b does not use the segment index encoder. that is, the function realized by the segment index encoder, where x has n-bit precision and t denotes the number of segments. This is used in the case of nonuniform segmentation. It converts a value of x into a segment index, which is then applied to the address inputs of the coefficients memory. Fig. 7b shows the circuit. Each block labeled LUT is a combinational logic circuit. Thus, the complete circuit is combinational logic. For all but the leftmost LUT, part of the input comes from the output of another LUT and the other part of the input comes from x. For the leftmost LUT, all inputs come from x. The outputs of the rightmost LUT are the segment index encoder outputs and they encode the segment index. These outputs and all connections between LUTs are called rails. With a fixed number of input lines representing x, the number of rails determines the complexity of this circuit. It was shown in [26] that the LUT cascade in Fig. 7b has reasonable complexity.
Architecture of Segment Index Encoder
Specifically, [26] shows the following lemma:
Lemma 1. Let seg funcðxÞ be a segment index function with t segments. Then, there exists an LUT cascade for seg funcðxÞ with at most dlog 2 te rails.
From Lemma 1, we have the following theorem:
Consider a segment index function seg funcðxÞ : f0; 1g n ! f0; 1; . . . ; t À 1g, where n is the precision of x and t is the number of segments. Then, seg funcðxÞ can be implemented by an LUT cascade, as shown in Fig. 7 , where each LUT has k þ 2 inputs and k outputs (rails), the number of LUTs is d nÀk 2 e, the total memory size is 2 kþ1 kðn À kÞ bits, and k ¼ dlog 2 te.
Proof. By Lemma 1, the number of the rails of the LUT cascade is at most k ¼ dlog 2 te. In addition, we can easily see that the LUT cascade consists of d
When ðn À kÞ is an odd number, the memory size of the rightmost LUT is 2 kþ1 Â k bits. Thus, the total memory size of the LUT cascade is
Hence, we have the theorem. t u
The above theorem shows that the memory size of the LUT cascade depends only on the precision n and the number of segments t. Thus, an approximation with fewer segments requires a smaller segment index encoder. We can use Theorem 1 to estimate the memory size of the segment index encoder from n and t. This will be shown in Section 4.4.
The LUT cascade is obtained by functional decomposition using an MTBDD for seg funcðxÞ [12] , [25] , [26] . Our synthesis system first finds a nonuniform segmentation, then generates the MTBDD, and, finally, decomposes it to find the LUT cascade.
Example 5. Consider the segment index function in Fig. 2a .
Let the binary fixed-point representation of x be ð: Fig. 2b is the corresponding MTBDD. By decomposing the MTBDD in Fig. 2 , we obtain two different LUT cascades in Fig. 8 . Fig. 8 illustrates the relationship between each LUT and decompositions of the MTBDD. In these figures, the "r i " in each LUT denotes the rails that represent subfunctions in the MTBDD. In the MTBDD, numbers assigned to edges that cut across the horizontal lines represent subfunctions. The LUT cascade in Fig. 8a requires a memory size of 2 3 Â 3 þ 2 4 Â 3 þ 2 4 Â 3 ¼ 120 bits and three LUTs. On the other hand, the LUT cascade in Fig. 8b requires a memory size of 2 4 Â 3 þ 2 4 Â 3 ¼ 96 bits and two LUTs. Note that the best realization is the LUT cascade consisting of a single LUT. As shown in Theorem 1, since k ¼ dlog 2 6e ¼ 3, there exists an LUT cascade consisting of a single LUT with 3 þ 2 inputs and 3 outputs. Its memory size is 2 5 Â 3 ¼ 96 bits.
As shown in Example 5, the memory size and the number of LUTs in an LUT cascade also depend on the decomposition and the variable order of the MTBDD. To find the best LUT cascade, we use an optimization algorithm for heterogeneous multivalued decision diagrams (MDDs) [22] , [23] . In an FPGA implementation of an NFG, each LUT in an LUT cascade is implemented with a block RAM in an FPGA. In this case, our synthesis system decomposes an MTBDD so that each LUT has the same size as the available block RAM (for example, 4 or 18 Kbits). Note that this automatically produces a pipelined implementation of an LUT cascade since block RAMs in a modern FPGA are synchronous.
To the best of our knowledge, this is the first realization method for any general segment index function.
Size Reduction of Multiplier
The coefficients table in Fig. 6a has 2 k words, where k ¼ dlog 2 te, and t is the number of segments. FPGA technology restricts memory sizes to a power of 2. Therefore, if t < 2 k , then we can increase the number of segments up to t ¼ 2 k , without increasing the memory size. From Lemma 1, the size of the LUT cascade also depends on the value of k rather than t. Thus, increasing the number of segments to t ¼ 2 k rarely increases the size of the LUT cascade. To reduce the multiplier size, we reduce the maximum value of ðx À s i Þ (that is, the maximum width of segments ½s i ; e i that is defined as ðe i À s i Þ) by dividing the widest segment into two equal-sized segments up to t ¼ 2 k . This reduction of the segment width reduces the rounding error as well (see the Appendix).
To reduce the multiplier size further, we use the scaling method shown in [15] . We represent c 1i as c 1i ¼ c 1i Â 2 Àh i Â 2 hi and compute c 1i ðx À s i Þ by using the multiplier for the right-shifted multiplicand c 1i Â 2 Àhi ðx À s i Þ and a left shifter to restore the term. Applying a right shift reduces the number of bits for c 1i Â 2
Àh i (that is, the multiplier size) by rounding the least significant h i bits of c 1i while increasing the rounding error. In the Appendix, we find the largest h i for each segment i that preserves the given acceptable error. When the largest h i s are 0 for all of the segments, we do not use the scaling method. For simplicity, Fig. 6a omits the scheme for the scaling method.
Note that these techniques utilizing the variation in segment sizes do not apply to the architecture based on uniform segmentation.
Example 6. Consider a 16-bit-precision NFG for ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi À lnðxÞ p for 0 < x 1 and its FPGA implementation with the Altera Stratix EP1S20F484C5. By using the scaling method, c 1i and h i have 16 and 3 bits, respectively, and they produce an NFG with 241 logic elements (LEs), two multipliers (digital signal processors (DSPs)), a memory size of 168,960 bits, an operating frequency of 208 MHz, and a delay of 38 ns. On the other hand, without the scaling method, c 1i has 22 bits and it produces an NFG with fewer LEs (153 LEs) but more DSPs (eight DSPs), a larger memory size (172,032 bits), a lower operating frequency (130 MHz), and a longer delay (54 ns).
Memory Size Estimate for Each Architecture
Our synthesis system estimates the memory size for each of the two architectures in Fig. 6 and generates the HDL code that implements the architecture with the smaller memory size. That is, our synthesis system chooses an architecture based on the estimate of the memory size. The estimate is discussed below.
Memory Size Estimate for Nonuniform Segmentation
The coefficients table in Fig. 6a has 2 k words, where k ¼ dlog 2 te, and t is the number of nonuniform segments obtained by the algorithm in Fig. 4 . We assume that three coefficients, Às i , c 1i , and fðs i Þ þ v i , have m out bits, respectively, where m out is the output accuracy of NFG given by the specification. Then, the memory size of the coefficients table is estimated as
We assume that when we generate an LUT cascade with the minimum memory size, each LUT in the LUT cascade has k þ 2 inputs and k outputs (the number 2 of k þ 2 is based on the results in [28] ). Then, from Theorem 1, the memory size of the LUT cascade (segment index encoder) is estimated as where n in is the input precision of NFG given by the specification. The total memory size of an NFG based on nonuniform segmentation is estimated as (3) + (4):
Memory Size Estimate for Uniform Segmentation
First, we estimate the number of uniform segments by using the nonuniform segmentation results obtained by the algorithm in Fig. 4 . Let ½a; b and ½s min ; e min be the given domain and the narrowest segment produced by the algorithm, respectively. Then, the number of uniform segments t u is estimated as
where
and m in is the input accuracy of the NFG given by the specification. Thus, the coefficients table in Fig. 6b has 2 ku words, where k u ¼ dlog 2 t u e. We assume that the coefficients have the same number of bits as in the nonuniform case. Then, the memory size of the coefficients table is estimated as
This is the total memory size estimate of the NFG based on uniform segmentation. Note that, when k u ¼ n in , the total memory size is estimated as a single LUT of size 2 n in n out , where n out is the output precision.
EXAMPLE DESIGN
In this section, we show in detail the design flow of a 5-bit precision (5-bit accuracy) NFG based on nonuniform segmentation for ffiffiffi x p for 0 x < 1. For simplicity, we do not use the multiplier reduction techniques shown in Section 4.3.
First, we partition the given domain 0 x < 1 into nonuniform segments by using the algorithm shown in Fig. 4 . This produces six segments, as shown in Fig. 3b . Note that the acceptable approximation error is set to 2 Àð5þ2Þ ¼ 2 À7 , as discussed in Section 3.
2. Then, we analyze the errors of the NFG and compute the number of bits needed to store coefficients by the method shown in the Appendix. Applying this analysis yields the following required number of bits: s i , 4 bits; c 1i , 9 bits (3 integer bits and 6 fractional bits); and c 0i , 7 bits (7 fractional bits).
Finally, we design the NFG shown in Fig. 6a by using these required numbers of bits. From the segments produced by the algorithm in Fig. 4 , we generate the MTBDD, as shown in Fig. 2 . By decomposing the MTBDD as shown in Fig. 8b , we obtain the LUT cascade. Moreover, from the segments and correction values, we generate the linear functions and then the memory data for the coefficients table. Since the number of segments is six, the memory size of the coefficients table is
and the memory size of the LUT cascade is
Thus, the total memory size of the NFG is 256 bits. Fig. 9 shows the generated NFG. Our synthesis system automates this design flow. Table 1 shows that, for all functions, nonuniform segmentation yields fewer segments than uniform segmentation. Especially for logarithmic, square root, reciprocal, and their compound functions, the number of segments needed in nonuniform segmentation is only a few percent of the number of segments needed in uniform segmentation. However, for the sinðxÞ and e x functions, the additional segments needed in a uniform segmentation is not so large-only about twice that needed for a nonuniform segmentation. This suggests that uniform segmentation is a better choice for these functions. Many existing polynomial approximation methods are based on uniform segmentation. For the sinðxÞ and e x functions, the methods based on uniform segmentation require relatively few segments and, therefore, a small memory size. However, for logarithmic, square root, reciprocal, and their compound functions on the domains listed in Table 1 , the uniform method requires an excessive number of segments. For standard elementary functions such as 1=x, lnðxÞ, and ffiffiffi x p , range reduction [20] is often used to translate the given domain into the domain suitable for uniform segmentation and to reduce the number of segments. Thus, the uniform method requires range reduction for these functions even if the given domain is a small range, as shown in Table 1 . On the other hand, the nonuniform method approximates a wide range of functions with fewer segments without using range reduction. That is, our method requires no range reduction when the given domain is not large. This is useful for nonexperts who are unfamiliar with range reduction and for functions for which there is no known range reduction technique, as in the case of compound functions.
EXPERIMENTAL RESULTS
Comparison of the Number of Uniform and Nonuniform Segments
In addition, Table 1 shows that the CPU time to compute the segmentation is strongly dependent on the number of segments. Therefore, a smaller acceptable approximation error requires more segments and longer CPU time. However, for all of the functions in Table 1 , the CPU times needed to compute nonuniform segmentations are shorter than 2 sec when the acceptable approximation error is 2 À25 . These results show that, for various functions, our segmentation algorithm generates fewer nonuniform segments quickly and it is useful for automatic synthesis. Table 2 compares the memory sizes for the two architectures shown in Fig. 6 . These correspond to nonuniform and uniform segmentations. In Table 2 , the values under the column "Estimated mem. size" are obtained by the estimates derived in Section 4.4. As shown in Section 4.2, the memory size of the LUT cascade depends on the decomposition and variable order of the MTBDD. In this experiment, we used the optimum decomposition and variable order that minimize the memory size [22] , [23] . Table 2 shows that, for logarithmic, square root, reciprocal, and their compound functions, the memory size for nonuniform segmentation is smaller than that for the uniform segmentation. On the other hand, for sinðxÞ and e x functions, the memory size for uniform segmentation is smaller than that for the nonuniform segmentation. As shown in Table 1 , for sinðxÞ and e x functions, the difference between the numbers of uniform and nonuniform segments is not so large. Thus, for such functions, the architecture based on uniform segmentation requires a smaller memory size than that based on nonuniform segmentation because it needs no segment index encoder.
Memory Sizes for Two Architectures
In addition, Table 2 shows that the estimated memory sizes are accurate. Designing the two architectures for each function can take minutes of CPU time. To reduce the design time, our synthesis system first estimates the memory sizes, then selects the better architecture, and, finally, designs only the better one.
As mentioned in the previous section, for 1=x, lnðxÞ, and ffiffiffi x p , the number of uniform segments can be reduced by using range reduction, thus reducing the memory size for uniform segmentation. Even in such cases (reduced domains), our memory size estimates are accurate and, so, our synthesis system accommodates range reduction.
FPGA Implementation
We implemented 16-bit-precision NFGs by using the Altera Stratix EP1S20F484C5 FPGA. Note that both of our architectures contain only combinational logic. For application-specific integrated circuit (ASIC) implementations, such a design yields the smallest delay. It represents one extreme in a range of NFG designs. However, in an FPGA implementation, this form does not always yield the smallest delay due to the interconnection delay. The pipelined design often achieves higher performance. In addition, modern FPGAs have synchronous block RAMs and this requires pipelined implementations. Thus, we adopt another extreme design, that is, a fully pipelined design. In this design, all elements, LUTs, adders, multipliers, and memory are buffered. Note that the delay of a small unit such as an adder is much smaller than a larger combinational logic circuit. Table 3 compares the FPGA implementation results of two architectures shown in Fig. 6 . In this table, the "Delay" columns show the total delay time of each NFG from the input to the output in nanoseconds. These results correspond to the delays of combinational NFGs.
The NFGs based on uniform segmentation require fewer pipeline stages and have shorter delays than the nonuniform segmentation because they have no segment index encoder. However, for logarithmic, square root, and reciprocal functions, the NFGs based on uniform segmentation are not so easily implemented in an FPGA due to the excessive memory size. Table 3 shows that they cannot be mapped into the FPGA due to the insufficient memory blocks. Note that NFGs that have only one pipelined stage in Table 3 are realized with a single LUT due to the excessively many segments. On the other hand, for various functions, the NFGs based on nonuniform segmentation achieve a high operating frequency.
To show the efficiency of our automatic synthesis system, we compare our NFGs with the ones reported in [15] and [16] . NFGs in [15] are based on nonuniform segmentation and linear approximation and they are manually designed. Moreover, NFGs in [16] are not only also based on nonuniform segmentation, but also on quadratic approximation. We generated the NFGs with the same precision as in [15] and [16] . Note that, in [15] , only one result for three functions is reported. Thus, the detailed result for each function is unavailable. Table 4 compares FPGA implementation results for the NFGs. In our NFGs, the segment index encoders are realized only by RAMs. Thus, our NFGs require more block RAMs than the NFGs in [15] and [16] . However, our NFGs require fewer slices and achieve a shorter delay time. In [16] , the 24-bit-precision NFGs for x lnðxÞ and the humps functions are implemented with the FPGA (XC2V4000) in Table 4 . On the other hand, our 24-bit NFGs for those functions require a larger FPGA (Virtex-II Pro, XC2VP70) since our NFGs are based on linear approximation and require more block RAMs. Although NFGs based on linear approximation require a larger memory size than NFGs based on quadratic approximation, they are faster. This is because linear approximation requires only one multiplier [24] .
CONCLUSION AND COMMENTS
We have proposed an architecture and a synthesis method for programmable NFGs for trigonometric functions, logarithm functions, square root, reciprocal, and so forth. Our architecture uses the LUT cascade to compactly realize various numerical functions and is suitable for automatic synthesis. To the best of our knowledge, this is the first architecture appropriate for any nonuniform segmentation. Experimental results show that our synthesis system 1) efficiently implements NFGs for wide range of numerical functions, 2) generates more suitable NFGs for the given functions between two architectures by using an accurate estimate of memory sizes, and 3) generates the NFGs with comparable performance to manually designed ones.
Since our NFGs are based on linear approximation, the practical precision for an FPGA implementation is up to 24 bits because of the memory size. Therefore, our method is useful to generate fast NFGs for a wide range of functions in up to 24-bit precision.
APPENDIX A
We analyze the error for our NFGs and show a method to obtain the appropriate bit sizes for the units in our architectures. Our synthesis system applies the error analysis method proposed in [8] . We show only the error analysis for the NFG using nonuniform segmentation in Fig. 6a . The analysis for the NFG using uniform segmentation in Fig. 6b is similar.
A.1 Error of NFG
In our NFG, there are two kinds of errors: approximation error and rounding error. The approximation error, " a , was discussed in Section 3. Thus, this section focuses on rounding error. Since truncation and rounding occur only in the fractional bits, we ignore the integer bits in this analysis. We assume that there is at least one fractional bit.
Errors in coefficients. The coefficients, c 1i and fðs i Þ þ v i , used in the linear approximation g i ðxÞ ¼ c 1i ðx À s i Þ þ fðs i Þ þ v i are rounded to u 1 and u 0 bits, respectively. That is, due to rounding, the coefficients c 1i and fðs i Þ þ v i become
where 1 and 0 are rounding errors of c 1i and fðs i Þ þ v i , respectively. In this case, we need no hardware for rounding coefficients because they are precomputed before storing in the coefficients table. Since rounding yields a more accurate result than truncation, we choose rounding.
Errors in multiplication.
When input x has m in bits, our synthesis system ensures that s i also has m in bits. Therefore, the result of the subtraction operation ðx À s i Þ also has m in bits and has no rounding error. As shown in the previous paragraph, the value of c 1i is stored as c 1i þ 1 in the coefficients table. Thus, the original product c 1i ðx À s i Þ is changed to
The value of (i) has ðu 1 þ m in Þ bits and it is truncated to u 0 bits in order to match the addition with fðs i Þ þ v i . Note that u 0 u 1 þ m in . We choose truncation because it does not require additional hardware, whereas rounding requires half adders at the multiplier output. Following the method shown in [30] , the ðu 0 þ 1Þth fractional bit d Àðu 0 þ1Þ in (i) is set to 1 after truncation. The rounding error 2 for truncation from ðu 1 þ m in Þ to u 0 bits is 0 2 2 Àðu0þ1Þ À 2 Àðu1þminÞ :
Thus, the linear part c 1i ðx À s i Þ of the approximation has the value
This is added to the constant coefficient fðs i Þ þ v i to form the complete approximation. Errors in addition. From the previous paragraphs, the original sum c 1i ðx À s i Þ þ fðs i Þ þ v i is changed to
This value has u 0 bits and this is rounded to m out bits (the output accuracy given in the specification), where m out u 0 . Since d Àðu0þ1Þ in (i) is set to 1, d Àðu0þ1Þ in (ii) is also 1. Note that d Àðu0þ1Þ is not implemented in hardware [30] . The error 3 for this rounding is
Thus, the value of the approximation g i ðxÞ is
It follows that
is the total rounding error for an NFG implemented with nonuniform segmentation. The absolute value of (iii) is maximum when the rounding errors are 0 ¼ À 
and t is the number of segments.
A.2 Calculation of Bit Sizes for Units
For any unit in Fig. 6a , the number of bits in the integer part is
where max value is an integer that denotes the maximum absolute value of output of each unit. That is, max value ¼ maxðjmax valj; jmin valjÞ, where max val and min val are the maximum and minimum values, respectively, of the unit's output. On the other hand, the number of bits for the fractional part is calculated using the result of the error analysis. From the error analysis, an NFG with an acceptable error " is achieved when
where " a and " r are the approximation error and the rounding error, respectively. To generate fast and compact NFGs, we find the minimum u 0 and u 1 that satisfy (iv) by using nonlinear programming [11] , where the objective and constraint are
Calculation of shift bit h i . From (i), the maximum error in the multiplication is Academic, 1993 Academic, , 1996 Academic, , 1999 . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
