Abstract-In this paper, we introduce the edge-valued binary decision diagram (EVBDD) to reduce the memory and delay in numerical function generators (NFGs). An NFG realizes a function, such as a trigonometric, logarithmic, square root, or reciprocal function, in hardware. NFGs are important in, for example, digital signal applications, where high speed and accuracy are necessary. We use the EVBDD to produce a fast and compact segment index encoder (SIE) that is a key component in our NFG. We compare our approach with NFG designs based on multi-terminal BDD's (MTBDDs), and show that the EVBDD produces SIEs that have, on average, only 7% of the memory and 40% of the delay of those designed using MTBDDs. Therefore, our NFGs based on EVBDDs have, on average, only 38% of the memory and 59% of the delay of NFGs based on MTBDDs.
I. INTRODUCTION
There has been significant interest recently in the realization of numeric functions, like sin(πx), ln(x), 1/x, and √ x, by high speed logic circuits. This is, in part, due to the availability of large quantities of inexpensive, programmable logic in FPGA's, and, in part, to the development of realization methods based on polynomial approximations [3, 5, 6, 8, 16, 22, 23] . Until the last few years, the dominant approach has been to partition the domain into uniform segments. Within each segment a linear [23] or higher order [3, 5, 6, 8, 16, 22] approximation is used to represent the function. It has been shown [7] that linear approximation is well suited for certain 'simple' functions like 2 x , sin(πx), and cos(πx), but is inappropriate for 'complex' functions like √ x and the entropy function, −(x log 2 (x)+ (1 − x) log 2 (1 − x)). For complex functions, optimum non-uniform segmentation produces tractable realizations [2] . In this method, segments are chosen as wide as possible, while still achieving the specified accuracy. Typically, narrow segments are used where the function changes rapidly and wide segments are used in other regions. A segment index encoder (SIE) is therefore needed to map the values in the domain to the segments. Within each segment the coefficients are the same for all points in the segment. As in uniform segmentation, a memory stores the coefficient values, which are then used to form the polynomial approximation. Potentially, the SIE designed for an optimum non-uniform segmentation is a complex circuit. To simplify the SIE, two approaches have been proposed. One is a segmentation approach [10, 11] that uses a special (non-optimum) non-uniform segmentation. Another one is a realization approach [20, 21] that uses an LUT cascade to realize the optimum non-uniform segmentation compactly.
In this paper, we use both approaches to simplify the SIE. That is, this paper proposes a new segmentation approach and a new realization approach using an edge-valued binary decision diagram (EVBDD). Our segmentation approach can also reduce the memory size and the delay time of an LUT cascade using a multi-terminal BDD (MTBDD). For both approaches, we establish a formal synthesis procedure that is easily programmed.
II. PRELIMINARIES

A. Number Representation and Precision
Definition 1 A value X represented by the binary fixed-point representation is denoted by X = ( 
Definition 3
Precision is the total number of bits for a binary fixed-point representation. Specially, n-bit precision specifies that n bits are used to represent the number; that is, n = l + m. We assume that an n-bit precision NFG has an n-bit input. x 0 x 0 For more detail on these BDDs, refer to [17] . Fig. 2(b) and (c) show an MTBDD and an EVBDD for the integer function f defined by Fig. 2(a) . In Fig. 2 Step:
Definition 4 Accuracy is the number of bits in the fractional
α 0 α 1 x i α 0 α 1 x i 0 0 − α 0 0 (a) Terminal case. x i α 0 α 1 x i − α 0 0 x j x j α 0 α 1 x j x j (b) Non-terminal case.x 1 x 1 x 2 x 2 x 3 (b) MTBDD. 0 x 0 x 1 x 2 x 2 x 3 1 1 1 3 4 0 (c) EVBDD.
Example 1
Else, partition [A, B) into two segments [A, P) and [P, B),
Repeat Steps 1, 2, and 3 for each new segment recursively, until the maximum approximation errors are smaller than ε a in all segments. 
III. PIECEWISE POLYNOMIAL APPROXIMATION BASED ON NON-UNIFORM SEGMENTATION
To approximate the numerical function f (X) using polynomial functions, we first partition the domain for X into segments. For each segment, we approximate f (X) using a polynomial function specific to that segment. In many cases, the domain is partitioned into uniform segments. Such methods are useful for elementary functions, such as sin(πX), but for some numerical functions, such as −(X log 2 (X) + (1 − X) log 2 (1 − X)), too many segments are required, resulting in large memory. To reduce the number of segments, we use a non-uniform segmentation, called recursive segmentation. Fig. 3 shows a recursive segmentation algorithm. The inputs for this algorithm are a numerical function f (X), a domain [A, B) for X, an accuracy m in of X, a polynomial order d, and an acceptable approximation error ε a . Then, this algorithm produces t segments [A, P 0 ), [P 0 , P 1 ),... ,[P t−2 , B) by recursively partitioning a segment into two equal-sized segments until achieving the acceptable approximation error ε a in all segments. Note that this algorithm restricts the width w i of each segment to w i = 2 h i × 2 −m in , where h i is an integer. That is, the segmentation points P i are restricted to values of which the least significant h i bits are 0 (i.e.,
A. Recursive Segmentation Algorithm
As shown in Fig. 3 , the number of segments depends on the maximum approximation error ε d (A, B). In this paper, we use the Chebyshev approximation polynomials. For a segment [S, E] of f (X), the maximum approximation error of the dth-order Chebyshev approximation ε d (S, E) is given by [12] :
where f (d+1) is the (d + 1)th-order derivative of f .
B. Computation of the Approximate Value
For each segment, f (X) is approximated by the corresponding polynomial function g(X, i). That is, the approx-
(b) Computation of X − S i using AND gates. 
, where i is a segment index assigned to each segment, and the coefficients
are derived from the dth-order Chebyshev approximation polynomial [12] .
For each segment
where
This transformation reduces the multiplier size (see Section IV-B).
IV. ARCHITECTURE FOR NFG Fig. 4 shows the architecture for the NFG based on a 2nd-order polynomial. As shown in Fig. 4(a) , polynomials of the form (1) are realized using a segment index encoder (SIE), a coefficients table, circuits for
, multipliers, and adders. This architecture can realize any nonuniform segmentation. However, when recursive segmentation is used, we can realize X − S i using 2-input AND gates instead of an adder. As mentioned in the previous section, the least significant h i bits of S i are 0, and X − S i < 2 h i × 2 −m in . Therefore, X − S i has 1's only in the least significant h i bits, and these 1's occur in exactly the same position as the 1's in X. Thus, as shown in Fig. 4(b) , we realize X − S i using AND gates driven on one side by S i , the complement of S i . The SIE converts X into a segment index i. It realizes the segment index function seg f unc(X) : {0, 1} n → {0, 1,...,t − 1} shown in Fig. 5(a) , where X has n bits, and t denotes the number of segments.
A. Architecture of SIE Fig. 5(b) shows an LUT cascade [20] that realizes seg f unc(X). The LUT cascade is obtained by functional decompositions using an MTBDD for seg f unc(X) [18, 19] , and can realize any seg f unc(X), where the size depends on the number of segments. [15] has shown that the size of an LUT cascade can be reduced by reducing the number of segments. This section presents a new architecture for the SIE to reduce the size and delay time further. Fig. 5(c) shows the new architecture. To realize seg f unc(X) using the SIE in Fig. 5(c) , we represent seg f unc(X) using an EVBDD. And then, by decomposing the EVBDD, we obtain the SIE that consists of an LUT cascade and adders. In an LUT cascade, the interconnecting lines between adjacent LUTs are called rails. In this case, the rails represent sub-functions in the EVBDD. And, the outputs from each LUT other than rails represent the sum of weights of edges. In this paper, we call such outputs Arails (adder rails). To the best of our knowledge, this is the first design method using an EVBDD to produce the cascaded architecture. Fig. 2 , we obtain the SIEs in Fig. 6. Fig. 6(a) and (b) Fig. 6(a) requires a memory size of 2 2 × 2 + 2 3 × 3 + 2 4 × 3 = 80 bits and 3 levels (3 LUTs). On the other hand, the SIE in Fig. 6(b) requires a memory size of 2 2 × 4 + 2 2 × 2 + 2 2 × 1 = 28 bits and 4 levels (3 LUTs + 1 adder).
Example 2 By decomposing the MTBDD and EVBDD in
(End of Example)
This paper uses two terms: MT SIE and EV SIE denote the SIEs designed using an MTBDD (Fig. 5(b) ) and EVBDD ( Fig. 5(c) ), respectively. Both the MT SIE and the EV SIE can realize any non-uniform segmentation. In both cases, memory size depends on the number of segments. Specifically,
Theorem 1 Let seg f unc(X) be a segment index function with t segments. Then, there exists an EV SIE for seg f unc(X) with at most log 2 t rails and log 2 t Arails.
The proof is omitted because of the page limitation.
The memory size and the number of levels of an EV SIE depend on the decomposition of an EVBDD. To obtain the optimum decomposition, we use optimization algorithms for heterogeneous multi-valued decision diagrams (MDDs) [14] . (a) SIE using MTBDD (MT SIE). 
5C-5
B. Reduction of the Size of the Multiplier
Since large multipliers have large delay, it is important to reduce multiplier size. We do this in two ways; Reduce the number of bits needed to represent 1. the coefficients and 2. the variables (X − S i ).
To reduce the number of bits in the coefficients, we use a scaling method [10] . We first shift right the coefficients. Then, we apply rounding. Then, we do the actual multiplication. And, finally, we shift left the product to compensate for the original shift right of the coefficients. This process is similar to floating point multiplication. A side effect is that rounding error is increased, since rounding occurs on a smaller value. In applying this method, we choose the largest exponent (right shift) that produces an error no greater than the given acceptable error [15] . If this yields an exponent of 0 (no right shift), in all segments, then we do not use the scaling method.
To reduce the value of the variable X − S i , we make the following observation. In each segment [S i , E i ), we have X − S i < E i − S i . Thus, reducing the segment width reduces X − S i for X near E i . However, this also increases the number of segments, and thus the memory size. We show a segment reduction technique that does not increase memory size.
In an FPGA implementation, the coefficients table in Fig. 4  has 2 u words, where u = log 2 t and t is the number of segments. Therefore, we can increase the number of segments up to t = 2 u without increasing the memory size. From Theorem 1, the size of the EV SIE also depends on the value of u. Increasing the number of segments to t = 2 u rarely increases the size of the EV SIE. We reduce the size of segments by dividing the largest segment into two equal sized segments up to t = 2 u . Table I compares the number of segments for various segmentation methods based on 2nd-order Chebyshev approximation. In Table I , "No. of uniform segs" shows the number of uniform segments, "No. of nonuni. segs" shows the number of non-uniform segments produced by [15] , and "Recursive" denotes the recursive segmentation method shown in this paper. In the column "Recursive", the sub-column "No. of segs 1" shows the number of segments produced by the segmentation algorithm shown in Section III. The sub-column "No. of segs 2" shows the number of segments produced by additionally applying the reduction method of multiplier size shown in Section IV. The sub-column "Time" shows the total CPU time, in milliseconds, for both the segmentation algorithm and the reduction method of multiplier size. Table I shows that uniform segmentation requires excessively many segments to approximate certain functions, such as tan(πX). Many existing NFGs are based on uniform segmentation, and have not realized tan(πX) in domain [0, 0.5). tan(πX) in [0, 0.5) can be computed by sin(πX)/ cos(πX) or a combination of tan(πX) in [0, 0.25] and 1/ tan(πX ), where X = 0.5 − X. However, these require multiple NFGs for elementary functions, such as sin, cos, or the reciprocal function. On the other hand, methods based on non-uniform or recursive segmentation can compactly realize tan(πX) with a single NFG, since non-uniform and recursive segmentation methods require many fewer segments. For all functions in Table I , the non-uniform segmentation method [15] requires the fewest segments among the three segmentation methods. Although our recursive segmentation algorithm restricts the segmentation points, it requires only up to 2.2 times more segments than non-uniform segmentation [15] . That is, our recursive segmentation algorithm generates a segmentation appropriate to the given function, while restricting the segmentation points. For example, for e X and sin(πX), our algorithm generates uniform segmentation. As shown in [15, 21] , uniform segmentation is appropriate for these functions.
V. EXPERIMENTAL RESULTS
A. Number of Segments and Computation Time
These results show that our recursive segmentation algorithm generates a non-uniform segmentation appropriate to the given functions quickly. Table II shows that the recursive segmentation algorithm also automatically generates uniform segmentation when appropriate. Table II compares the FPGA implementation results of the MT SIE and EV SIE. Note that the memory size in bits and LE, the number of logic elements, are 0 for e X and sin(πX) when recursive segmentation is used. This indicates that uniform segmentation was applied, and so an SIE was not needed. The result is a faster NFG for these functions. In the experiment that produced the data in Table II , we optimized the decomposition of the MTBDDs and EVBDDs by requiring the memory size of each LUT in the LUT cascade in these SIEs to be 4K bits, the same as the RAM block (M4K) of the FPGA. Table II shows that for optimum non-uniform segmentation, the EV SIEs have smaller memory size than the MT SIEs. For example, for tan(πX), the memory size of the EV SIE is only 10% of the memory size needed by the MT SIE. For tan(πX), the memory size of MT SIE is quite large because the number of non-uniform segments is large. From experiments with uniform segmentation we know that the NFG for tan(πX) using the MT SIE requires only 1.5% of the memory size needed by the NFG based on uniform segmentation. However, this is still too large to implement with an FPGA. By using the EV SIE, we can reduce the memory size significantly, and make the NFG implementable with an FPGA.
5C-5
B. FPGA Implementation of SIEs
Our recursive segmentation can reduce both the memory size and the delay time of the MT SIEs. Especially, for X ln(X), using an MT SIE designed for recursive segmentation has only 34% of the memory and 63% of the delay of the MT SIE designed for optimum non-uniform segmentation.
By using recursive segmentation and the EV SIE, we can reduce both memory size and delay time of SIEs significantly. For all functions in Table II , both memory size and delay time of the EV SIEs for recursive segmentation are much smaller for the MT SIEs. In terms of the number of LEs, the EV SIEs require only up to 1.5 times more LEs than the MT SIEs. Therefore, designing an EV SIE for recursive segmentation yields faster and more compact SIEs than obtained by previous methods. The design is formal and is easily programmed. Table III compares the FPGA implementation results of our NFGs using EV SIE (EVNFGs) with the existing NFGs using MT SIE (MTNFGs) [15] , where EVNFGs are based on recursive segmentation and MTNFGs are based on the optimum non-uniform segmentation. Both NFGs have 23-bit precision (23-bit accuracy).
C. FPGA Implementation of NFGs
From Table II and Table III , we can see that the memory size of MT SIE accounts for more than 2/3 of the total memory size of the MTNFG. On the other hand, by using recursive segmentation and EV SIE, the memory size needed for the SIE can be reduced to less than 1/4 of the total memory size of the EVNFG. Thereby, the EVNFGs require only 21% to 63% of memory size needed for the MTNFGs. For arcsin(X) and √ X, as shown in Table I , our recursive segmentation requires a coefficients table that is about twice as large as needed for the optimum non-uniform segmentation. Nevertheless, by using EV SIEs, the memory sizes of EVNFGs can be reduced to about 63% of the memory sizes of MTNFGs. Further, Table III shows that the EVNFGs require fewer LEs and levels (i.e., shorter latency) than the MTNFGs, and the delay time of EVNFGs is only about 25% to 89% of the delay time of the MTNFGs.
To compare our NFG with another existing NFG based on a segmentation approach (hierarchical segmentation) shown in [11] , we implemented our 24-bit precision NFG for X ln(X) using the Xilinx Virtex-II FPGA (XC2V4000-6) and the Synplify Premier 8.5. Memory size and delay time of the NFG described in [11] are 40, 446 bits and 103.7 nsec.. On the other hand, memory size and delay time of our EVNFG based on recursive segmentation are 30, 976 bits (77%) and 63.3 nsec. (61%).
From these results, we can see that our NFGs using recursive segmentation and the EV SIE can realize a wide range of functions faster and more compactly than existing NFGs.
VI. CONCLUSION AND COMMENTS
We have presented an architecture and a synthesis method for fast and compact NFGs for trigonometric, logarithmic, square root, reciprocal, and combinations of these functions. Our NFG partitions a given domain of the function into nonuniform segments using recursive segmentation, and approximates the given function by a polynomial function for each segment. By using an EVBDD to realize the recursive segmentation, we can implement fast and compact NFGs for a wide range of functions. Experimental results showed that: 1) By using recursive segmentation, we reduced memory size and delay time needed for the MT SIE, and produced MT SIEs that have, on average, only 49% of the memory and 53% of the delay of MT SIEs for the optimum non-uniform segmentation. 2) By using EVBDD to realize recursive segmentation, we further reduced memory size and delay time needed for the SIE. Our SIEs using the EVBDDs require, on average, only 7% of the memory and 40% of the delay of MT SIEs for the opti- mum non-uniform segmentation. And therefore, 3) our NFGs require, on average, only 38% of the memory and 59% of the delay needed by the existing NFGs based on MT SIE and the optimum non-uniform segmentation.
