Numerical function generators using edge-valued binary decision diagrams by Nagayama, Shinobu et al.
Calhoun: The NPS Institutional Archive
Faculty and Researcher Publications Faculty and Researcher Publications
2007-01
Numerical function generators using
edge-valued binary decision diagrams
Nagayama, Shinobu
S. Nagayama, T. Sasao, and J. T. Butler, "Numerical function generators using edge-valued
binary decision diagrams," ASPDAC-2007,Yokohama, Jan. 25, 2007, pp.535-540.
http://hdl.handle.net/10945/35860
Numerical Function Generators Using Edge-Valued Binary Decision Diagrams
Shinobu Nagayama Tsutomu Sasao Jon T. Butler
Dept. of Computer Engineering, Dept. of Computer Science Dept. of Electrical and Computer
Hiroshima City University, and Electronics, Engineering,
Hiroshima 731-3194, Japan Kyushu Institute of Technology, Naval Postgraduate School,
Iizuka 820-8502, Japan CA 93943-5121, USA
Abstract— In this paper, we introduce the edge-valued binary
decision diagram (EVBDD) to reduce the memory and delay in
numerical function generators (NFGs). An NFG realizes a func-
tion, such as a trigonometric, logarithmic, square root, or recip-
rocal function, in hardware. NFGs are important in, for exam-
ple, digital signal applications, where high speed and accuracy are
necessary. We use the EVBDD to produce a fast and compact seg-
ment index encoder (SIE) that is a key component in our NFG. We
compare our approach with NFG designs based on multi-terminal
BDD’s (MTBDDs), and show that the EVBDD produces SIEs that
have, on average, only 7% of the memory and 40% of the delay
of those designed using MTBDDs. Therefore, our NFGs based on
EVBDDs have, on average, only 38% of the memory and 59% of
the delay of NFGs based on MTBDDs.
I. INTRODUCTION
There has been significant interest recently in the realiza-
tion of numeric functions, like sin(πx), ln(x), 1/x, and
√
x, by
high speed logic circuits. This is, in part, due to the availabil-
ity of large quantities of inexpensive, programmable logic in
FPGA’s, and, in part, to the development of realization meth-
ods based on polynomial approximations [3, 5, 6, 8, 16, 22, 23].
Until the last few years, the dominant approach has been to
partition the domain into uniform segments. Within each seg-
ment a linear [23] or higher order [3, 5, 6, 8, 16, 22] ap-
proximation is used to represent the function. It has been
shown [7] that linear approximation is well suited for certain
‘simple’ functions like 2x, sin(πx), and cos(πx), but is inappro-
priate for ‘complex’ functions like
√
x and the entropy func-
tion,−(x log2(x)+(1−x) log2(1−x)). For complex functions,
optimum non-uniform segmentation produces tractable realiza-
tions [2]. In this method, segments are chosen as wide as possi-
ble, while still achieving the specified accuracy. Typically, nar-
row segments are used where the function changes rapidly and
wide segments are used in other regions. A segment index en-
coder (SIE) is therefore needed to map the values in the domain
to the segments. Within each segment the coefficients are the
same for all points in the segment. As in uniform segmentation,
a memory stores the coefficient values, which are then used to
form the polynomial approximation. Potentially, the SIE de-
signed for an optimum non-uniform segmentation is a com-
plex circuit. To simplify the SIE, two approaches have been
proposed. One is a segmentation approach [10, 11] that uses
a special (non-optimum) non-uniform segmentation. Another
one is a realization approach [20, 21] that uses an LUT cascade
to realize the optimum non-uniform segmentation compactly.
In this paper, we use both approaches to simplify the SIE.
That is, this paper proposes a new segmentation approach and
a new realization approach using an edge-valued binary deci-
sion diagram (EVBDD). Our segmentation approach can also
reduce the memory size and the delay time of an LUT cascade
using a multi-terminal BDD (MTBDD). For both approaches,
we establish a formal synthesis procedure that is easily pro-
grammed.
II. PRELIMINARIES
A. Number Representation and Precision
Definition 1 A value X represented by the binary fixed-point
representation is denoted by X = (xl−1 xl−2 . . . x1 x0. x−1 x−2
. . . x−m)2, where xi ∈ {0,1}, l is the number of bits for the
integer part, and m is the number of bits for the fractional part.
This representation is two’s complement.
Definition 2 Error is the absolute difference between the exact
value and the value produced by the hardware. Approximation
error is the error caused by a function approximation. Round-
ing error is the error caused by a binary fixed-point representa-
tion. Acceptable error is the maximum error that an NFG may
assume. Acceptable approximation error is the maximum ap-
proximation error that a function approximation may assume.
Definition 3 Precision is the total number of bits for a binary
fixed-point representation. Specially, n-bit precision specifies
that n bits are used to represent the number; that is, n = l +m.
We assume that an n-bit precision NFG has an n-bit input.
Definition 4 Accuracy is the number of bits in the fractional
part of a binary fixed-point representation. m-bit accuracy
specifies that m bits are used to represent the fractional part
of the number. When the maximum error is 2−m, the accuracy
can be expressed as 1 unit in the last place (ULP). In this pa-
per, an m-bit accuracy NFG is an NFG with an m-bit fractional
part of the input, an m-bit fractional part of the output, and 1
ULP.
B. Edge-Valued Binary Decision Diagram
Definition 5 A binary decision diagram (BDD) [1] is a rooted
directed acyclic graph representing a logic function: {0,1}n→
{0,1}. The BDD is obtained by repeatedly applying the Shan-
non expansion to the logic function. Each function, including
the original function and all sub-functions resulting from ap-
plying the Shannon expansion, is represented by a non-terminal
node, unless that function is a trivial function, 0 or 1, in which
case, it is represented by a terminal node. Each non-terminal
node has two outgoing edges, 0-edge and 1-edge, that corre-
spond to the values of input variables. Both terminal nodes





















Fig. 1. Conversion of an MTBDD node into an EVBDD node.
x3 x2 x1 x0 f x3 x2 x1 x0 f
0 0 0 0 0 1 0 0 0 4
0 0 0 1 0 1 0 0 1 4
0 0 1 0 0 1 0 1 0 5
0 0 1 1 0 1 0 1 1 6
0 1 0 0 1 1 1 0 0 7
0 1 0 1 1 1 1 0 1 7
0 1 1 0 2 1 1 1 0 7
0 1 1 1 3 1 1 1 1 7
(a) Function table.


















Fig. 2. MTBDD and EVBDD for an integer function.
Definition 6 A multi-terminal BDD (MTBDD) [4] is an ex-
tension of the BDD, and represents an integer function:
{0,1}n → Z, where Z is a set of integers. Specifically, it is a
BDD in which the terminal nodes are not restricted to 0 and
1. Rather, they are labeled by integer values. Alternatively, we
can think of a BDD as a special case of an MTBDD, in which
there are only two terminal nodes, labeled 0 and 1.
Definition 7 An edge-valued BDD (EVBDD) [9] is an ex-
tension of the BDD, and represents an integer function. An
EVBDD consists of one terminal node representing 0 and non-
terminal nodes with a weighted 1-edge, where the weight is
an integer. An EVBDD is obtained by recursively applying the
conversion shown in Fig. 1 to each non-terminal node in an
MTBDD, where in Fig. 1, dashed lines and solid lines denote
0-edges and 1-edges, respectively. Note that, in the EVBDD,
0-edges (dashed lines) have weight 0, while the incoming edge
into the root node can have some weight.
For more detail on these BDDs, refer to [17].
Example 1 Fig. 2(b) and (c) show anMTBDD and an EVBDD
for the integer function f defined by Fig. 2(a). In Fig. 2, dashed
lines and solid lines denote 0-edges and 1-edges, respectively.
Note that the EVBDD has weighted 1-edges. In the MTBDD,
terminal nodes represent function values. Thus, to evaluate the
function, we traverse the MTBDD from the root node to a ter-
minal node according to the input values, and obtain the func-
tion value (an integer) from the terminal node. On the other
hand, in the EVBDD, we obtain the function value by summing
Input: Numerical function f (X), domain [A,B) for X , accuracy
min of X , polynomial order d, and acceptable approxima-
tion error εa.
Output: Segments [A,P0), [P0,P1), . . . , [Pt−2,B).
Step:
1. For [A,B), compute the maximum approximation error
εd(A,B).
2. If εd(A,B) < εa or B−A≤ 2−min , then stop.
3. Else, partition [A,B) into two segments [A,P) and [P,B),
where P = (A+B)/2.
4. Repeat Steps 1, 2, and 3 for each new segment recursively,
until the maximum approximation errors are smaller than
εa in all segments.
Fig. 3. Recursive segmentation algorithm for the domain.
the weights of the edges traversed from the root node to the
terminal node. (End of Example)
III. PIECEWISE POLYNOMIAL APPROXIMATION BASED ON
NON-UNIFORM SEGMENTATION
To approximate the numerical function f (X) using polyno-
mial functions, we first partition the domain for X into seg-
ments. For each segment, we approximate f (X) using a poly-
nomial function specific to that segment. In many cases, the
domain is partitioned into uniform segments. Such methods are
useful for elementary functions, such as sin(πX), but for some
numerical functions, such as −(X log2(X)+ (1−X) log2(1−
X)), too many segments are required, resulting in large mem-
ory. To reduce the number of segments, we use a non-uniform
segmentation, called recursive segmentation.
A. Recursive Segmentation Algorithm
Fig. 3 shows a recursive segmentation algorithm. The in-
puts for this algorithm are a numerical function f (X), a do-
main [A,B) for X , an accuracy min of X , a polynomial order
d, and an acceptable approximation error εa. Then, this algo-
rithm produces t segments [A,P0), [P0,P1), . . . , [Pt−2,B) by re-
cursively partitioning a segment into two equal-sized segments
until achieving the acceptable approximation error εa in all seg-
ments. Note that this algorithm restricts the width wi of each
segment to wi = 2hi ×2−min , where hi is an integer. That is, the
segmentation pointsPi are restricted to values of which the least
significant hi bits are 0 (i.e., Pi = (. . . p− j+1 p− j 00 . . . 0)2,
where j = min− hi). As shown in Fig. 3, the number of seg-
ments depends on the maximum approximation error εd(A,B).
In this paper, we use the Chebyshev approximation polynomi-
als. For a segment [S,E] of f (X), the maximum approxima-
tion error of the dth-order Chebyshev approximation εd(S,E)
is given by [12]:
εd(S,E) =
2(E−S)d+1
4d+1(d+1)! maxS≤X≤E | f
(d+1)(X)|,
where f (d+1) is the (d+1)th-order derivative of f .
B. Computation of the Approximate Value
For each segment, f (X) is approximated by the corre-
sponding polynomial function g(X , i). That is, the approx-


















(a) Architecture for NFG.
X =( . . . x− j x− j−1 . . . x−min )2
− Si =( . . . s− j 0 . . . 0 )2
( . . . 0 x− j−1 . . . x−min )2
⇓
X =( . . . x− j x− j−1 . . . x−min )2
& Si =( . . . s− j 1 . . . 1 )2
( . . . 0 x− j−1 . . . x−min )2
Note: j = min−hi.
(b) Computation of X−Si using AND gates.
Fig. 4. Architecture for the NFG based on 2nd-order polynomial.
Cd−1(i)Xd−1+ . . .+C0(i), where i is a segment index assigned
to each segment, and the coefficients Cd(i),Cd−1(i), . . . ,C0(i)
are derived from the dth-order Chebyshev approximation poly-
nomial [12].
For each segment [Si,Ei), substituting X−Si+Si for X yields
the transformation
g(X , i) =









C j+k(i)Ski ( j = 0,1, . . . ,d−1).
This transformation reduces the multiplier size (see Section IV-
B).
IV. ARCHITECTURE FOR NFG
Fig. 4 shows the architecture for the NFG based on a 2nd-
order polynomial. As shown in Fig. 4(a), polynomials of the
form (1) are realized using a segment index encoder (SIE), a
coefficients table, circuits for (X − Si)k (k = d,d − 1, . . . ,2),
multipliers, and adders. This architecture can realize any non-
uniform segmentation. However, when recursive segmentation
is used, we can realize X−Si using 2-input AND gates instead
of an adder. As mentioned in the previous section, the least sig-
nificant hi bits of Si are 0, and X−Si < 2hi ×2−min . Therefore,
X −Si has 1’s only in the least significant hi bits, and these 1’s
occur in exactly the same position as the 1’s in X . Thus, as
shown in Fig. 4(b), we realize X − Si using AND gates driven
on one side by Si, the complement of Si. The SIE converts X
into a segment index i. It realizes the segment index function
seg f unc(X) : {0,1}n → {0,1, . . . ,t − 1} shown in Fig. 5(a),
where X has n bits, and t denotes the number of segments.
A. Architecture of SIE
Fig. 5(b) shows an LUT cascade [20] that realizes
seg f unc(X). The LUT cascade is obtained by functional de-
compositions using an MTBDD for seg f unc(X) [18, 19], and
Segments Index
A≤ X < P0 0
P0 ≤ X < P1 1
...
...
Pt−2 ≤ X < B t−1























(c) LUT cascade and
adders (EV SIE).
Fig. 5. Segment index encoders.
can realize any seg f unc(X), where the size depends on the
number of segments. [15] has shown that the size of an LUT
cascade can be reduced by reducing the number of segments.
This section presents a new architecture for the SIE to reduce
the size and delay time further. Fig. 5(c) shows the new archi-
tecture. To realize seg f unc(X) using the SIE in Fig. 5(c), we
represent seg f unc(X) using an EVBDD. And then, by decom-
posing the EVBDD, we obtain the SIE that consists of an LUT
cascade and adders. In an LUT cascade, the interconnecting
lines between adjacent LUTs are called rails. In this case, the
rails represent sub-functions in the EVBDD. And, the outputs
from each LUT other than rails represent the sum of weights of
edges. In this paper, we call such outputs Arails (adder rails).
To the best of our knowledge, this is the first design method
using an EVBDD to produce the cascaded architecture.
Example 2 By decomposing the MTBDD and EVBDD in
Fig. 2, we obtain the SIEs in Fig. 6. Fig. 6(a) and (b) illustrate
the correspondences between each LUT and decompositions of
the MTBDD and the EVBDD, respectively. In these figures,
the column labeled as ‘ri’ in the table of each LUT denotes the
rails that represent sub-functions in BDDs. And, the column
‘ai’ in Fig. 6(b) denotes the Arails that represent the sum of
weights of edges. In the MTBDD, numbers assigned to edges
that cut across the horizontal lines represent sub-functions. In
the EVBDD, “(ai,ri)” assigned to edges that cut across the
horizontal lines represent the sum of weights and sub-functions,
respectively. The SIE in Fig. 6(a) requires a memory size of
22× 2+ 23× 3+ 24× 3 = 80 bits and 3 levels (3 LUTs). On
the other hand, the SIE in Fig. 6(b) requires a memory size of
22× 4+ 22× 2+ 22× 1 = 28 bits and 4 levels (3 LUTs + 1
adder). (End of Example)
This paper uses two terms: MT SIE and EV SIE denote the
SIEs designed using an MTBDD (Fig. 5(b)) and EVBDD
(Fig. 5(c)), respectively. Both the MT SIE and the EV SIE can
realize any non-uniform segmentation. In both cases, memory
size depends on the number of segments. Specifically,
Theorem 1 Let seg f unc(X) be a segment index function with
t segments. Then, there exists an EV SIE for seg f unc(X) with
at most log2 t rails and log2 t Arails.
The proof is omitted because of the page limitation.
The memory size and the number of levels of an EV SIE
depend on the decomposition of an EVBDD. To obtain the op-
timum decomposition, we use optimization algorithms for het-





0   0
0   1
1   0
1   1
0   *
1   0
1   1
2   0
2   1
3   *
x1
0   *
1   *
2   0
2   1
3   *
4   0
4   1
5   *
x0





0 1 2 3




























(a) SIE using MTBDD (MT SIE).
x3
x2
0   0
0   1
1   0
1   1
0   *
1   0
1   1
x1
x00   *1   0












(0, 0) (1, 1) (4, 1) (7, 0)































(b) SIE using EVBDD (EV SIE).
Fig. 6. Example of SIEs.
B. Reduction of the Size of the Multiplier
Since large multipliers have large delay, it is important to
reduce multiplier size. We do this in two ways; Reduce the
number of bits needed to represent 1. the coefficients and 2.
the variables (X−Si).
To reduce the number of bits in the coefficients, we use a
scaling method [10]. We first shift right the coefficients. Then,
we apply rounding. Then, we do the actual multiplication.
And, finally, we shift left the product to compensate for the
original shift right of the coefficients. This process is similar
to floating point multiplication. A side effect is that rounding
error is increased, since rounding occurs on a smaller value.
In applying this method, we choose the largest exponent (right
shift) that produces an error no greater than the given accept-
able error [15]. If this yields an exponent of 0 (no right shift),
in all segments, then we do not use the scaling method.
To reduce the value of the variable X −Si, we make the fol-
lowing observation. In each segment [Si,Ei), we have X−Si <
Ei−Si. Thus, reducing the segment width reduces X−Si for X
near Ei. However, this also increases the number of segments,
and thus the memory size. We show a segment reduction tech-
nique that does not increase memory size.
In an FPGA implementation, the coefficients table in Fig. 4
has 2u words, where u = log2 t and t is the number of seg-
ments. Therefore, we can increase the number of segments up
to t = 2u without increasing the memory size. From Theorem 1,
the size of the EV SIE also depends on the value of u. Increas-
ing the number of segments to t = 2u rarely increases the size
TABLE I
NUMBER OF SEGMENTS FOR VARIOUS SEGMENTATION METHODS.
X has 23-bit accuracy.
Acceptable approximation error: 2−25
Function Domain No. of No. of Recursive
f (X) [A,B) uniform nonuni. No. of No. of Time
segs segs segs 1 segs 2 [msec.]
eX [0,1) 128 67 103 128* 10
sin(πX) [0,0.5) 128 74 112 128* 10
tan(πX) [0,0.5) 4,194,304 4,594 5,723 8,192 1,600
arcsin(X) [0,1) 8,388,608 256 363 512 70√
X (0,1) 8,388,607 228 322 512 30√− ln(X) (0,1) 8,388,607 698 967 1,024 190
X ln(X) (0,1) 2,097,152 172 250 256 10
*Uniform segmentation is produced.
Environment: Sub Blade 2500 (Silver), UltraSPARC-IIIi 1.6GHz,
6GB memory, Solaris 9.
of the EV SIE. We reduce the size of segments by dividing the
largest segment into two equal sized segments up to t = 2u.
V. EXPERIMENTAL RESULTS
A. Number of Segments and Computation Time
Table I compares the number of segments for various seg-
mentation methods based on 2nd-order Chebyshev approxima-
tion. In Table I, “No. of uniform segs” shows the number
of uniform segments, “No. of nonuni. segs” shows the num-
ber of non-uniform segments produced by [15], and “Recur-
sive” denotes the recursive segmentation method shown in this
paper. In the column “Recursive”, the sub-column “No. of
segs 1” shows the number of segments produced by the seg-
mentation algorithm shown in Section III. The sub-column
“No. of segs 2” shows the number of segments produced by
additionally applying the reduction method of multiplier size
shown in Section IV. The sub-column “Time” shows the total
CPU time, in milliseconds, for both the segmentation algorithm
and the reduction method of multiplier size.
Table I shows that uniform segmentation requires exces-
sively many segments to approximate certain functions, such
as tan(πX). Many existing NFGs are based on uniform seg-
mentation, and have not realized tan(πX) in domain [0,0.5).
tan(πX) in [0,0.5) can be computed by sin(πX)/cos(πX) or
a combination of tan(πX) in [0,0.25] and 1/ tan(πX ′), where
X ′ = 0.5−X . However, these require multiple NFGs for ele-
mentary functions, such as sin, cos, or the reciprocal function.
On the other hand, methods based on non-uniform or recur-
sive segmentation can compactly realize tan(πX) with a sin-
gle NFG, since non-uniform and recursive segmentation meth-
ods require many fewer segments. For all functions in Table I,
the non-uniform segmentation method [15] requires the fewest
segments among the three segmentation methods. Although
our recursive segmentation algorithm restricts the segmenta-
tion points, it requires only up to 2.2 times more segments than
non-uniform segmentation [15]. That is, our recursive segmen-
tation algorithm generates a segmentation appropriate to the
given function, while restricting the segmentation points. For
example, for eX and sin(πX), our algorithm generates uniform
segmentation. As shown in [15, 21], uniform segmentation is
appropriate for these functions.
These results show that our recursive segmentation algo-





FPGA IMPLEMENTATION OF SIES.
FPGA device: Altera Stratix EP1S10F484C5 (LE: 10,570, M4K: 60, M512: 90)
Logic synthesis tool: Altera QuartusII 5.0 (speed optimization, timing requirement of 200MHz)
Function Optimum non-uniform segmentation Recursive segmentation
f (X) MT SIE EV SIE MT SIE EV SIE
Memory LE Level Delay Memory LE Level Delay Memory LE Level Delay Memory LE Level Delay
[bits] [nsec.] [bits] [nsec.] [bits] [nsec.] [bits] [nsec.]
eX 26,368 73 8 27.5 23,040 119 7 24.1 0 0 0 0 0 0 0 0
sin(πX) 26,880 71 8 27.5 23,552 114 7 24.1 0 0 0 0 0 0 0 0
tan(πX) 1,802,240 – 5 – 179,968 207 10 36.3 1,687,552 – 5 – 15,108 142 7 24.1
arcsin(X) 61,440 72 8 27.5 53,824 217 13 44.7 49,152 72 7 24.3 9,984 88 5 17.2√
X 61,440 72 8 27.5 57,408 183 11 40.3 44,544 71 7 24.1 9,216 91 5 17.2√− ln(X) 266,240 81 7 33.2 116,160 204 11 40.2 172,032 71 7 28.1 12,736 109 6 20.6
X ln(X) 61,440 79 8 27.5 48,384 160 9 30.9 20,992 49 5 17.2 6,912 65 4 13.8
–: It cannot be mapped into the FPGA due to insufficient RAM blocks.
B. FPGA Implementation of SIEs
Table II shows that the recursive segmentation algorithm also
automatically generates uniform segmentation when appropri-
ate. Table II compares the FPGA implementation results of the
MT SIE and EV SIE. Note that the memory size in bits and
LE, the number of logic elements, are 0 for eX and sin(πX)
when recursive segmentation is used. This indicates that uni-
form segmentation was applied, and so an SIE was not needed.
The result is a faster NFG for these functions. In the experiment
that produced the data in Table II, we optimized the decompo-
sition of the MTBDDs and EVBDDs by requiring the memory
size of each LUT in the LUT cascade in these SIEs to be 4K
bits, the same as the RAM block (M4K) of the FPGA.
Table II shows that for optimum non-uniform segmentation,
the EV SIEs have smaller memory size than the MT SIEs. For
example, for tan(πX), the memory size of the EV SIE is only
10% of the memory size needed by the MT SIE. For tan(πX),
the memory size of MT SIE is quite large because the number
of non-uniform segments is large. From experiments with uni-
form segmentation we know that the NFG for tan(πX) using
the MT SIE requires only 1.5% of the memory size needed by
the NFG based on uniform segmentation. However, this is still
too large to implement with an FPGA. By using the EV SIE,
we can reduce the memory size significantly, and make the
NFG implementable with an FPGA.
Our recursive segmentation can reduce both the memory size
and the delay time of the MT SIEs. Especially, for X ln(X), us-
ing an MT SIE designed for recursive segmentation has only
34% of the memory and 63% of the delay of the MT SIE de-
signed for optimum non-uniform segmentation.
By using recursive segmentation and the EV SIE, we can re-
duce both memory size and delay time of SIEs significantly.
For all functions in Table II, both memory size and delay time
of the EV SIEs for recursive segmentation are much smaller for
the MT SIEs. In terms of the number of LEs, the EV SIEs re-
quire only up to 1.5 times more LEs than the MT SIEs. There-
fore, designing an EV SIE for recursive segmentation yields
faster and more compact SIEs than obtained by previous meth-
ods. The design is formal and is easily programmed.
C. FPGA Implementation of NFGs
Table III compares the FPGA implementation results of our
NFGs using EV SIE (EVNFGs) with the existing NFGs us-
ing MT SIE (MTNFGs) [15], where EVNFGs are based on re-
cursive segmentation and MTNFGs are based on the optimum
non-uniform segmentation. Both NFGs have 23-bit precision
(23-bit accuracy).
From Table II and Table III, we can see that the memory
size of MT SIE accounts for more than 2/3 of the total mem-
ory size of the MTNFG. On the other hand, by using recursive
segmentation and EV SIE, the memory size needed for the SIE
can be reduced to less than 1/4 of the total memory size of
the EVNFG. Thereby, the EVNFGs require only 21% to 63%
of memory size needed for the MTNFGs. For arcsin(X) and√
X , as shown in Table I, our recursive segmentation requires
a coefficients table that is about twice as large as needed for
the optimum non-uniform segmentation. Nevertheless, by us-
ing EV SIEs, the memory sizes of EVNFGs can be reduced
to about 63% of the memory sizes of MTNFGs. Further, Ta-
ble III shows that the EVNFGs require fewer LEs and levels
(i.e., shorter latency) than the MTNFGs, and the delay time of
EVNFGs is only about 25% to 89% of the delay time of the
MTNFGs.
To compare our NFG with another existing NFG based on
a segmentation approach (hierarchical segmentation) shown in
[11], we implemented our 24-bit precision NFG for X ln(X)
using the Xilinx Virtex-II FPGA (XC2V4000-6) and the Syn-
plify Premier 8.5. Memory size and delay time of the NFG
described in [11] are 40,446 bits and 103.7 nsec.. On the other
hand, memory size and delay time of our EVNFG based on
recursive segmentation are 30,976 bits (77%) and 63.3 nsec.
(61%).
From these results, we can see that our NFGs using recur-
sive segmentation and the EV SIE can realize a wide range of
functions faster and more compactly than existing NFGs.
VI. CONCLUSION AND COMMENTS
We have presented an architecture and a synthesis method
for fast and compact NFGs for trigonometric, logarithmic,
square root, reciprocal, and combinations of these functions.
Our NFG partitions a given domain of the function into non-
uniform segments using recursive segmentation, and approx-
imates the given function by a polynomial function for each
segment. By using an EVBDD to realize the recursive segmen-
tation, we can implement fast and compact NFGs for a wide
range of functions. Experimental results showed that: 1) By
using recursive segmentation, we reduced memory size and de-
lay time needed for the MT SIE, and produced MT SIEs that
have, on average, only 49% of the memory and 53% of the
delay of MT SIEs for the optimum non-uniform segmentation.
2) By using EVBDD to realize recursive segmentation, we fur-
ther reduced memory size and delay time needed for the SIE.
Our SIEs using the EVBDDs require, on average, only 7% of




FPGA IMPLEMENTATION OF 23-BIT PRECISION (23-BIT ACCURACY) NFGS.
FPGA device: Altera Stratix EP1S60F1020C5
(LE: 57,120, DSP: 144, M4K: 292, M512: 574)
Logic synthesis tool: Altera QuartusII 5.0
(speed optimization, timing requirement of 200MHz)
Function MTNFG based on optimum nonuni. EVNFG based on recursive
f (X) Memory LE DSP Level Delay Memory LE DSP Level Delay
[bits] [nsec.] [bits] [nsec.]
eX 39,040 689 10 13 99.6 8,064 432 10 3 25.1
sin(πX) 36,864 635 10 13 99.1 7,936 395 10 3 28.3
tan(πX) 2,867,200 – 16 11 – 973,572 1,059 16 12 92.3
arcsin(X) 84,736 1,301 16 14 107.3 53,504 937 16 10 80.3√
X 83,712 1,041 16 14 116.5 52,224 962 16 10 85.5√− ln(X) 357,376 950 16 13 99.8 103,872 972 16 11 88.3
X ln(X) 83,200 988 16 14 116.0 29,696 997 16 9 70.3
–: It cannot be mapped into the FPGA due to insufficient RAM blocks.
mum non-uniform segmentation. And therefore, 3) our NFGs
require, on average, only 38% of the memory and 59% of the
delay needed by the existing NFGs based on MT SIE and the
optimum non-uniform segmentation.
ACKNOWLEDGMENTS
This research is partly supported by the Grant in Aid for
Scientific Research of the Japan Society for the Promotion of
Science (JSPS), funds from Ministry of Education, Culture,
Sports, Science, and Technology (MEXT) via Kitakyushu in-
novative cluster project, a contract with the National Secu-
rity Agency, the MEXT Grant-in-Aid for Young Scientists (B),
18700048, 2006, and Hiroshima City University Grant for Spe-
cial Academic Research (General Studies), 6101, 2006.
REFERENCES
[1] R. E. Bryant, “Graph-based algorithms for boolean function manipula-
tion,” IEEE Trans. Comput., Vol. C-35, No. 8, pp. 677–691, Aug. 1986.
[2] A. Cantoni, “Optimal curve fitting with piecewise linear functions,”
IEEE Trans. on Comp., Vol. 20, No. 1, pp. 59–67, Jan. 1971.
[3] J. Cao, B. W. Y. Wei, and J. Cheng, “High-performance architectures
for elementary function generation,” Proc. of the 15th IEEE Symp. on
Computer Arithmetic (ARITH’01), Vail, Colorado, pp. 136–144, June
2001.
[4] E. M. Clarke, K. L. McMillan, X. Zhao, M. Fujita, and J. Yang, “Spec-
tral transforms for large Boolean functions with applications to tech-
nology mapping,” Proc. of 30th ACM/IEEE Design Automation Confer-
ence, pp. 54–60, June 1993.
[5] D. Defour, F. de Dinechin, and J.-M. Muller, “A new scheme for table-
based evaluation of functions,” 36th Asilomar Conference on Signals,
Systems, and Computers,, Pacific Grove, California, pp. 1608–1613,
Nov. 2002.
[6] J. Detrey and F. de Dinechin, “Table-based polynomials for fast hard-
ware function evaluation,” 16th IEEE Inter. Conf. on Application-
Specific Systems, Architectures, and Processors (ASAP’05), pp. 328–
333, 2005.
[7] C. L. Frenzen, T. Sasao, and J. T. Butler, “The tradeoff between memory
size and approximation error in numerical function generators based on
lookup tables,” preprint.
[8] V. K. Jain, S. A. Wadekar, and L. Lin, “A universal nonlinear component
and its application to WSI,” IEEE Trans. on Components, Hybrids, and
Manufacturing Technology, Vol. 16, No. 7, pp. 656–664, Nov. 1993.
[9] Y-T. Lai, M. Pedram, and S. B. Vrudhula, “EVBDD-based algorithms
for linear integer programming, spectral transformation and functional
decomposition,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.,
Vol. 13, No. 8, pp. 959–975, Aug. 1994.
[10] D.-U. Lee, W. Luk, J. Villasenor, and P. Y. K. Cheung, “Non-uniform
segmentation for hardware function evaluation,” Proc. Inter. Conf. on
Field Programmable Logic and Applications, pp. 796–807, Lisbon, Por-
tugal, Sept. 2003.
[11] D.-U. Lee, W. Luk, J. Villasenor, and P. Y. K. Cheung, “Hierarchi-
cal segmentation schemes for function evaluation,” Proc. of the IEEE
Conf. on Field-Programmable Technology, Tokyo, Japan, pp. 92–99,
Dec. 2003.
[12] J. H. Mathews, Numerical Methods for Computer Science, Engineering
and Mathematics, Prentice-Hall, Inc., Englewood Cliffs, NJ, 1987.
[13] J.-M. Muller, Elementary Function: Algorithms and Implementation,
Birkhauser Boston, Inc., Secaucus, NJ, 1997.
[14] S. Nagayama and T. Sasao, “On the optimization of heterogeneous
MDDs,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., Vol. 24,
No. 11, pp. 1645–1659, Nov. 2005.
[15] S. Nagayama, T. Sasao, and J. T. Butler, “Programmable numerical
function generators based on quadratic approximation: architecture and
synthesis method,” Proc. of Asia and South Pacific Design Automation
Conference (ASPDAC’06), Yokohama, Japan, pp. 378–383, 2006.
[16] J.-A. Pin˜eiro, S. F. Oberman, J.-M. Muller, and J. D. Bruguera, “High-
speed function approximation using a minimax quadratic interpolator,”
IEEE Trans. on Comp., Vol. 54, No. 3, pp. 304–318, Mar. 2005.
[17] T. Sasao and M. Fujita (eds.), Representations of Discrete Functions,
Kluwer Academic Publishers 1996.
[18] T. Sasao, M. Matsuura, and Y. Iguchi, “A cascade realization of
multiple-output function for reconfigurable hardware,” Inter. Workshop
on Logic Synthesis (IWLS’01), Lake Tahoe, CA, pp. 225–230, June 12–
15, 2001.
[19] T. Sasao and M. Matsuura, “A method to decompose multiple-output
logic functions,” 41st Design Automation Conference, San Diego, CA,
pp. 428–433, June 2–6, 2004.
[20] T. Sasao, J. T. Butler, and M. D. Riedel, “Application of LUT cas-
cades to numerical function generators,” Proc. the 12th workshop on
Synthesis And System Integration of Mixed Information technologies
(SASIMI’04), Kanazawa, Japan, pp. 422–429, Oct. 2004.
[21] T. Sasao, S. Nagayama, and J. T. Butler, “Programmable numerical
function generators: architectures and synthesis method,” Proc. Inter.
Conf. on Field Programmable Logic and Applications (FPL’05), Tam-
pere, Finland, pp. 118–123, Aug. 2005.
[22] M. J. Schulte and E. E. Swartzlarnder, “Hardware designs for exactly
rounded elementary functions,” IEEE Trans. on Comp., Vol. 43, No. 8,
pp. 964–973, Aug. 1994.
[23] J. E. Stine and M. J. Schulte, “The symmetric table addition method
for accurate function approximation,” Jour. of VLSI Signal Processing,
Vol. 21, No. 2, pp. 167–177, June 1999.
5C-5
540
