Compact numerical function generators based on quadratic approximation: architecture and synthesis method by Nagayama, Shinobu et al.
Calhoun: The NPS Institutional Archive
Faculty and Researcher Publications Faculty and Researcher Publications Collection
2006-12
Compact numerical function generators based
on quadratic approximation: architecture and
synthesis method
Nagayama, Shinobu
S. Nagayama, T. Sasao, and J. T. Butler "Compact numerical function generators
based on quadratic approximation: architecture and synthesis method," IEICE
Transactions on Fundamentals of Electronics, Communications and Computer
Sciences, Vol. E89-A, No.12, Dec. 2006, pp.3510-3518,Special Section on VLSI
Design and CAD Algorithms.
http://hdl.handle.net/10945/35830
3510
IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.12 DECEMBER 2006
PAPER Special Section on VLSI Design and CAD Algorithms
Compact Numerical Function Generators Based on Quadratic
Approximation: Architecture and Synthesis Method∗
Shinobu NAGAYAMA†a), Tsutomu SASAO††b), Members, and Jon T. BUTLER†††c), Nonmember
SUMMARY This paper presents an architecture and a synthesis method
for compact numerical function generators (NFGs) for trigonometric, log-
arithmic, square root, reciprocal, and combinations of these functions. Our
NFG partitions a given domain of the function into non-uniform segments
using an LUT cascade, and approximates the given function by a quadratic
polynomial for each segment. Thus, we can implement fast and compact
NFGs for a wide range of functions. Experimental results show that: 1) our
NFGs require, on average, only 4% of the memory needed by NFGs based
on the linear approximation with non-uniform segmentation; 2) our NFG
for 2x − 1 requires only 22% of the memory needed by the NFG based on
a 5th-order approximation with uniform segmentation; and 3) our NFGs
achieve about 70% of the throughput of the existing table-based NFGs us-
ing only a few percent of the memory. Thus, our NFGs can be implemented
with more compact FPGAs than needed for the existing NFGs. Our auto-
matic synthesis system generates such compact NFGs quickly.
key words: LUT cascades, 2nd-order Chebyshev approximation, non-
uniform segmentation, NFGs, automatic synthesis, FPGA
1. Introduction
Numerical functions f (x), such as trigonometric, logarith-
mic, square root, reciprocal, and combinations of these func-
tions, are extensively used in computer graphics, digital
signal processing, communication systems, robotics, astro-
physics, fluid physics, etc. To compute elementary func-
tions, iterative algorithms, such as the CORDIC (COordi-
nate Rotation DIgital Computer) algorithm [1], [23], have
been often used. Although the CORDIC algorithm achieves
accuracy with compact hardware, its computation time is
proportional to the precision (i.e. the number of bits). For
a function composed of elementary functions, the CORDIC
algorithm is slower, since it computes each elementary func-
tion sequentially. It is too slow for numerically intensive
applications. Implementation by a single lookup table for
f (x) is simple and very fast. For low-precision computations
Manuscript received March 3, 2006.
Manuscript revised June 8, 2006.
Final manuscript received July 28, 2006.
†The author is with the Department of Computer Engineering,
Hiroshima City University, Hiroshima-shi, 731–3194 Japan.
††The author is with the Department of Computer Science and
Electronics, Kyushu Institute of Technology, Iizuka-shi, 820–8502
Japan.
†††The author is with the Department of Electrical and Computer
Engineering, Naval Postgraduate School, Monterey, CA 93943-
5121 USA.
∗This paper is an extension of [15].
a) E-mail: nagayama@ieee.org
b) E-mail: sasao@cse.kyutech.ac.jp
c) E-mail: jon butler@msn.com
DOI: 10.1093/ietfec/e89–a.12.3510
Fig. 1 Synthesis flow for NFGs.
of f (x) (e.g. x and f (x) have 8 bits), this implementation is
straightforward. For high-precision computations, however,
the single lookup table implementation is impractical due to
the huge table size.
To reduce the memory size, polynomial approxima-
tions have been used [9], [10], [21], [22]. These methods ap-
proximate the given numerical functions by piecewise poly-
nomials, and realize the polynomials with hardware. Lin-
ear approximations oﬀer fast and middle-precision evalua-
tion of numerical functions. However, for high-precision,
these methods also require excessive memory size. And,
the methods proposed so far are ad-hoc and not systematic.
This paper proposes an architecture and a systematic synthe-
sis method for NFGs based on quadratic approximation. By
using the LUT cascade [8], many numerical functions are
eﬃciently approximated by piecewise quadratic functions,
and are realized by NFGs with small memory size. Our syn-
thesis method is fully automated, so that fast and compact
NFGs can be produced by non-experts. Figure 1 shows the
synthesis flow for the NFG. It converts the Design Specifi-
cation described by Scilab [19], a MATLAB-like software,
into HDL code. The Design Specification consists of a func-
tion f (x), a domain over x, and an accuracy. This system
first partitions the domain into segments, and then approxi-
mates f (x) by a quadratic function in each segment. Next, it
analyzes the errors, and derives the necessary precision for
computing units in the NFG. Then, it generates HDL code
to be mapped into an FPGA using an FPGA vendor tool.
2. Preliminaries
Definition 1: The binary fixed-point representation of a
value r has the form
dn int−1 dn int−2 . . . d1 d0. d−1 d−2 . . . d−n f rac, (1)
where di ∈ {0, 1}, n int is the number of bits for the integer
part, and n f rac is the number of bits for the fractional part
of r. The representation in (1) is two’s complement, and so




Copyright c© 2006 The Institute of Electronics, Information and Communication Engineers
NAGAYAMA et al.: COMPACT NFGS BASED ON QUADRATIC APPROXIMATION
3511
Definition 2: Error is the absolute diﬀerence between the
original value and the approximated value. Approxima-
tion error is the error caused by a function approximation.
Rounding error is the error caused by a binary fixed-point
representation. It is the result of truncation or rounding,
whichever is applied. However, both operations yield an
error that is called rounding error. Acceptable error is the
maximum error that an NFG may assume. Acceptable ap-
proximation error is the maximum approximation error that
a function approximation may assume.
For example, when a function is approximated by a
piecewise linear function, approximation error occurs be-
cause the approximating function does not equal the approx-
imated function everywhere. Typically, there is a non-zero
error over most of the function values; indeed, this error
is usually 0 only for a very few function values. Round-
ing error occurs because the binary representation of the
function (2) takes on only specific values. For example, if
the function value is
√
2, there is no representation in the
form (2), where n f rac is finite, that is exactly √2.
Definition 3: Precision is the total number of bits for a bi-
nary fixed-point representation. Specially, n-bit precision
specifies that n bits are used to represent the number; that is,
n = n int + n f rac. An n-bit precision NFG has an n-bit
input.
Definition 4: Accuracy is the number of bits in the frac-
tional part of a binary fixed-point representation. Specially,
m-bit accuracy specifies that m bits are used to represent the
fractional part of the number; that is, m = n f rac. An m-bit
accuracy NFG is an NFG with an m-bit fractional part of
the input, an m-bit fractional part of the output, and a 2−m
acceptable error.
Definition 5: Truncation is the process of removing lower
order bits from a binary fixed-point number. If r is repre-
sented as r = dn int−1 dn int−2 . . . d0. d−1 . . . d−n f rac and is
truncated to r′ = dn int−1 dn int−2 . . . d0. d−1 . . . d−i, then
the resulting rounding error is at most 2−i − 2−n f rac, where
i ≤ n f rac. Truncation of r never produces a value larger
than r.
Definition 6: Rounding is truncation if d−(i+1) = 0. If
d−(i+1) = 1, 2−i is added to the result of truncation. That
is, if r is represented as
r = dn int−1 dn int−2 . . . d0. d−1 . . . d−n f rac,
r is rounded to
r′ = dn int−1 dn int−2 . . . d0. d−1 . . . d−i + d−(i+1)2−i.
The error caused by rounding is at most 2−(i+1).
Truncation can cause a larger error than rounding. On
the other hand, truncation requires less hardware. In our
architecture, we use both truncation and rounding. This is
discussed in Appendix.
3. Piecewise Quadratic Approximation
To approximate the numerical function f (x) using quadratic
functions, we first partition the domain for x into segments.
For each segment, we approximate f (x) using a quadratic
function g(x) = c2x2 + c1x + c0. In this case, we seek the
fewest segments, since this reduces the memory size needed
for storing the coeﬃcients of the quadratic functions.
For piecewise polynomial approximations, in many
cases, the domain is partitioned into uniform segments [2]–
[4], [21], [22]. Such methods are simple and fast, but for
some kinds of numerical functions, too many segments are
required, resulting in large memory.
For a given error, non-uniform segmentation of the do-
main uses fewer segments than uniform segmentation [9],
[18]. However, a non-uniform segmentation requires an ad-
ditional circuit that maps values of x to a segment num-
ber. Potentially, this is a complex circuit. To simplify the
additional circuit, Lee et al. [9] have proposed a special
non-uniform segmentation for the √− ln(x) function. Their
method produces a simple circuit by restricting the segmen-
tation points, and results in fewer segments as well as faster
and more compact NFG than produced by uniform segmen-
tation. However, it is not always optimum for the given
function f (x). For a fast and compact realization of any non-
uniform segmentations, we use an LUT cascade proposed by
Sasao et al. [17], [18] (see Sect. 4). By using the LUT cas-
cade, we can use an optimum non-uniform segmentation for
the given function f (x). As far as we know, there is no other
method that uses an optimum non-uniform segmentation.
3.1 Non-uniform Segmentation Algorithm
The number of non-uniform segments depends on the ap-
proximation polynomial. A more accurate approximation
polynomial requires fewer segments. In this paper, we
use the 2nd-order Chebyshev polynomials to approximate
f (x). We show that its coeﬃcients are computed easily and
quickly: it is suitable for fast automatic synthesis.
For a segment [s, e] of f (x), the maximum approxima-
tion error 2(s, e) of the 2nd-order Chebyshev approximation
[11] is given by
2(s, e) = (e − s)
3
192 maxs≤x≤e | f
(3)(x)|, (3)
where f (3) is the 3rd-order derivative of f . From (3), 2(s, e)
is a monotone increasing function of segment width e − s.
Using this property, we partition a domain into as wide seg-
ments as possible such that the approximation error is less
than the specified approximation error. Figure 2 shows the
non-uniform segmentation algorithm. The inputs for this al-
gorithm are a numerical function f (x), a domain [a, b] for
x, and an acceptable approximation error a. Then, this al-
gorithm approximates f (x) with the acceptable approxima-
tion error a, and produces t segments [s0, e0], [s1, e1], . . .,
[st−1, et−1]. For step 2 in Fig. 2, the accurate computation of
3512
IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.12 DECEMBER 2006
Input: Numerical function f (x), Domain [a, b] for x,
Acceptable approximation error a.
Output: Segments [s0, e0], [s1, e1], . . . , [st−1, et−1].
Process:
1. Let s0 = a and i = 0.
2. Find a value p (≥ si) where 2(si, p) = a.
3. If p > b, then let p = b.
4. Let ei = p and i = i + 1.
5. If p = b, then let t = i, and stop the process.
6. Else, let si = p, and go to step 2.
Fig. 2 Non-uniform segmentation algorithm for the domain.
the value p where 2(si, p) = a is diﬃcult. Thus, we obtain
the maximum value p′ satisfying 2(si, p′) ≤ a. Such p′
can be found by scanning values of n-bit input x. However,
it requires O(2n) search, and is time-consuming. Therefore,
we compute the maximum value p′ by specifying the bits
of x from the MSB to the LSB such that 2(si, p′) ≤ a.
This requires a search with time complexity O(n). In the
computation of 2(si, p′), the value of maxsi≤x≤p′ | f (3)(x)| is
computed by a nonlinear programming algorithm [7].
3.2 Computation of the Approximate Value
For each [si, ei], f (x) is approximated by the corresponding
quadratic function gi(x). That is, the approximated value
y of f (x) is computed by y = gi(x) = c2ix2 + c1ix + c0i,
where the coeﬃcients c2i, c1i, and c0i are derived from the
2nd-order Chebyshev approximation polynomial [11]. Sub-
stituting x − qi + qi for x yields the transformation
gi(x) = c2i(x − qi)2 + (c1i + 2c2iqi)(x − qi)
+c0i + c1iqi + c2iq2i .
Let c′1i = c1i + 2c2iqi and c′0i = c0i + c1iqi + c2iq2i . Then, we
have
gi(x) = c2i(x − qi)2 + c′1i(x − qi) + c′0i. (4)
This transformation reduces the multiplier size.
4. Architecture for NFGs
Figure 3 shows the architecture that realizes (4). It has 7
units: the segment index encoder (which produces the seg-
ment index i for [si, ei] given the value x); the coeﬃcients
table (which stores −qi, c2i, c′1i, and c′0i); an adder (which
produces x + (−qi)); a squaring unit; two multipliers; and
the output adder.
A segment index encoder converts x into a segment in-
dex i. It realizes the segment index function seg f unc(x) :
Bn → {0, 1, . . . , t−1} shown in Fig. 4 (a), where x has n bits,
B = {0, 1}, and t denotes the number of segments. NFGs
in which the most significant bits directly drive the address
inputs of a coeﬃcient memory need no segment index en-
coder. On the other hand, our NFGs based on non-uniform
segmentation require this additional circuit. Although NFGs
based on non-uniform segmentation have a smaller coeﬃ-
cients table than those based on uniform segmentation, the
size of segment index encoder may have to be considered to
Fig. 3 Architecture for NFGs.
(a) Segment index function. (b) LUT cascade.
Fig. 4 Segment index encoder.
provide a fair comparison. In [9], to simplify the segment in-
dex encoder, a special non-uniform segmentation is used for√− ln(x). However, it does not correspond to an optimum
segmentation. To produce compact NFGs for a wide range
of functions, it is important to guarantee that the size of the
segment index encoder is reasonable for any non-uniform
segmentation. Our synthesis system uses the LUT cascade
[8], [16], [17] shown in Fig. 4(b) to realize any given (opti-
mum) seg f unc(x). In an LUT cascade, the interconnecting
lines between adjacent LUTs are called rails. The size of
an LUT cascade depends on the number of rails. Sasao et
al. [17] have shown that the size of an LUT cascade depends
on the number of segments. Specifically,
Theorem 1: Let seg f unc(x) be a segment index function
with t segments. Then, there exists an LUT cascade for
seg f unc(x) with at most 	log2 t
 rails.
[18] shows that by using an LUT cascade, we can generate
compact NFGs for a wide range of functions. In this pa-
per, we significantly reduce memory sizes of the coeﬃcients
table and the LUT cascade by reducing the number of seg-
ments using the quadratic approximation. Our synthesis sys-
tem uses heterogeneous MDDs (Multi-valued Decision Di-
agrams) [13], [14] to find compact LUT cascades. An LUT
cascade passes data from the leftmost LUT to the rightmost
LUT. Since each LUT in an LUT cascade operates inde-
pendently and concurrently, we can easily obtain a pipeline
structure by assigning each LUT to one pipeline stage. By
using LUTs with the same size, we can easily achieve an
eﬃcient pipeline processing. To the best of our knowledge,
this is the first method for realizing any seg f unc(x).
4.1 Reduction of the Size of the Multiplier
To generate a fast NFG, reducing multiplier size is impor-
tant. Since the size of multipliers depends on the number of
bits for c2i, c′1i, and x − qi, we reduce the number of bits to
represent these values.
To reduce the number of bits for c2i and c′1i, we use a
scaling method [9]. We represent c2i and c′1i as c2i × 2−l2i ×
2l2i and c′1i × 2−l1i × 2l1i , respectively. And, we compute the
NAGAYAMA et al.: COMPACT NFGS BASED ON QUADRATIC APPROXIMATION
3513
products c2i(x − qi)2 and c′1i(x − qi) using multipliers for the
right-shifted multiplicand, c2i×2−l2i (x−qi)2 and c′1i×2−l1i (x−
qi), and left shifters to restore the products. Applying right
shifts reduces the number of bits for c2i × 2−l2i and c′1i × 2−l1i(i.e. the multiplier sizes) by rounding the l2i LSBs of c2i and
l1i of c′1i, while increasing the rounding errors. In Appendix,
we find the largest l2i and l1i for each segment that preserve
the given acceptable error. When l2i and l1i are 0 for all the
segments, we do not use scaling method.
Example 1: Consider a 24-bit precision NFG for
√− ln(x).
By using the scaling method, c2i, l2i, c′1i, and l1i have 13 bits,
5 bits, 20 bits, and 4 bits, respectively, and they produce an
NFG with memory size: 173, 056 bits, operating frequency:
130 MHz, and 16 DSP units. On the other hand, without the
scaling method, c2i and c′1i have 37 bits and 28 bits, respec-
tively, and they produce an NFG with larger memory size:
184, 832 bits, much lower operating frequency: 71 MHz,
and more DSP units: 20 units. (End of Example)
Next, we reduce the value of x − qi. The number of
bits for x − qi influences the sizes of the squaring unit and
multipliers. Thus, reducing the value of x − qi reduces the
sizes of the squaring unit and multipliers, and also the error.
From (4), we can choose any value for qi. To reduce the
value of x − qi, for a segment [si, ei], we set qi = (si + ei)/2.
Then, we have |x − qi| ≤ (ei − si)/2. Thus, reducing the
segment width ei − si reduces the value for x − qi. However,
this also increases the number of segments, and results in
increased memory size. We show a reduction method of
segment width without increasing the memory size.
The coeﬃcients table in Fig. 3 has 2k words, where
k = 	log2 t
 and t is the number of segments. Therefore,
we can increase the number of segments up to t = 2k with-
out increasing the memory size. From Theorem 1, the size
of the LUT cascade also depends on the value of k. Thus,
increasing the number of segments to t = 2k rarely increases
the size of the LUT cascade. We reduce the size of seg-
ments by dividing the largest segment into two equal sized
Table 1 Number of segments for various approximation methods.
Function Domain Acceptable approximation error: 2−17 Acceptable approximation error: 2−25
f (x) (x has 15-bit accuracy) (x has 23-bit accuracy)
Linear 2nd-Chebyshev Time Linear 2nd-Chebyshev Time
Uniform Non Uniform Non [msec] Uniform Non Uniform Non [msec]
2x [0, 1] 129 128 9 7 10 2,049 2,048 65 44 100
1/x [1, 2) 128 124 16 11 10 2,048 1,982 128 64 90√
x [1/32, 2) 2,016 193 252 24 10 32,256 3,082 2,016 138 130
1/
√
x [1, 2) 128 46 16 8 10 2,048 1,024 128 46 70
log2(x) [1, 2) 128 128 16 10 10 2,048 2,048 128 56 90
ln(x) [1, 2) 128 89 16 9 10 2,048 1,437 128 50 80
sin(πx) [0, 1/2) 257 127 17 12 10 4,097 2,027 129 74 70
cos(πx) [0, 1/2) 257 127 17 12 10 4,097 2,027 129 74 80
tan(πx) [0, 1/4) 257 112 33 12 10 4,097 1,787 129 73 140√− ln(x) [1/32, 1) 31,744 354 31,744 52 90 8,126,464 5,933 8,126,464 331 840
tan2(πx) + 1 [0, 1/4) 513 256 33 17 20 8,193 4,096 257 101 200
Entropy [1/256, 255/256] 2,033 520 509 40 30 32,513 8,320 4,065 234 260
Sigmoid [0, 1] 129 127 33 13 20 2,049 2,020 129 76 220
Gaussian [0, 1/2] 33 32 5 4 10 513 512 33 18 30
Average 2,706 170 2,337 17 19 587,466 2,739 580,995 99 171
Linear: Linear approximation. 2nd-Chebyshev: 2nd-order Chebyshev approximation.
Uniform: Uniform segmentation. Non: Non-uniform segmentation.
Time: CPU time for our non-uniform segmentation algorithm conducted on the following environment.
System: Sun Blade 2500 (Silver), CPU: UltraSPARC-IIIi 1.6 GHz, Memory: 6 GB, OS: Solaris 9, and C compiler: gcc -O2.
segments up to t = 2k. This reduces the rounding error as
well (see Appendix).
5. Experimental Results
5.1 Number of Segments and Computation Time
Table 1 compares the number of segments for various ap-
proximation methods for the functions in [17]. In this table,
Entropy, Sigmoid, and Gaussian are
Entropy = −x log2 x − (1 − x) log2(1 − x),
Sigmoid = 1
1 + e−4x





In Table 1, the columns “Linear Uniform” and “Linear Non”
show the number of uniform and non-uniform segments [18]
for linear approximation, respectively. The columns “2nd-
Chebyshev Uniform” and “2nd-Chebyshev Non” show the
number of uniform and non-uniform segments for the 2nd-
order Chebyshev approximation, respectively. The columns
“Time” show the CPU time for our non-uniform segmenta-
tion algorithm applied to functions, in milliseconds.
Table 1 shows that, for many functions, the 2nd-order
Chebyshev approximations require many fewer segments
than the linear approximations. However, for some func-
tions, such as
√− ln(x), the 2nd-order Chebyshev approxi-
mation based on uniform segmentation requires many more
segments than the linear and 2nd-order Chebyshev approx-
imations based on non-uniform segmentations. Many ex-
isting polynomial approximation methods are based on uni-
form segmentation. For trigonometric and exponential func-
tions, the methods based on uniform segmentation require
relatively few segments. However, for some kinds of func-
tions such as
√− ln(x), the uniform methods require ex-
cessively many segments, even if quadratic approximation
is used. On the other hand, our quadratic approximation
based on non-uniform segmentation requires many fewer
3514
IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.12 DECEMBER 2006
Table 2 Comparison with linear approximation based on non-uniform
segmentation.
Function 16-bit precision 24-bit precision
f (x) (15-bit accuracy) (23-bit accuracy)
Memory [bits] R Memory [bits] R
Linear Quad. [%] Linear Quad. [%]
2x 20,992 1,112 5 696,320 19,072 3
1/x 21,248 2,432 11 700,416 19,136 3√
x 43,776 5,536 13 1,425,408 86,784 6
1/
√
x 10,176 1,104 11 343,040 19,008 6
log2(x) 20,864 2,464 12 694,272 19,072 3
ln(x) 20,096 2,448 12 700,416 19,136 3
sin(πx) 19,456 2,336 12 661,504 38,656 6
cos(πx) 19,584 2,336 12 663,552 38,784 6
tan(πx) 19,712 2,304 12 667,648 38,272 6√− ln(x) 74,240 11,264 15 2,662,400 173,056 7
tan2(πx) + 1 37,632 4,960 13 1,290,240 39,040 3
Entropy 106,496 10,688 10 3,768,320 83,968 2
Sigmoid 21,120 2,432 12 702,464 40,320 6
Gaussian 4,416 444 10 156,672 8,384 5
Average 31,415 3,704 11 1,080,905 45,906 4
Memory: Memory size. Linear: Linear approximation [18].
Quad.: 2nd-order Chebyshev approximation. R: Ratio.
segments for a wide range of functions. Also, Table 1 shows
that the CPU time to compute the segmentation strongly de-
pends on the number of segments. Smaller acceptable ap-
proximation error requires more segments and longer com-
putation time. However, Table 1 shows that, for all functions
in the table, the CPU times are shorter than 1 second when
the acceptable approximation error is 2−25.
These results show that, for various functions, our seg-
mentation algorithm partitions a domain into fewer non-
uniform segments quickly, and it is useful for fast synthesis.
5.2 Memory Sizes of Various NFGs
This section compares the memory sizes of our NFGs with
three existing NFGs [3], [4], [18]. Table 2 compares with
NFGs based on linear approximation and non-uniform seg-
mentation shown in [18]. In Table 2, the columns “R” show
the following values:
R =
memory size of quadratic approximation
memory size of linear approximation
× 100 (%).
Table 2 shows that NFGs using quadratic approximation re-
quire much smaller memory than ones using linear approx-
imation. Especially, 24-bit precision NFGs using quadratic
approximation can be implemented with, on average, only
4% of the memory size needed for a linear approxima-
tion. From the relation between precision and memory size
shown in Table 2, we can see that increasing the precision
decreases the ratio of memory sizes in NFGs.
Figure 5 shows a plot of the total memory size and the
required precision for NFGs based on linear and quadratic
approximations of 1/x. Since memory size of quadratic
NFG increases more slowly than that of linear NFG, as
the precision increases, we conjecture that for precisions
higher than 24-bit, our quadratic NFGs will require reason-
able memory size. Unfortunately, we could not verify that
because of the precision of our NFG synthesis tool. Table 3
and Table 4 compare our NFGs with NFGs using a 5th-
order Taylor expansion [3] and NFGs using 2nd-order min-
Fig. 5 Memory sizes of linear NFGs and quadratic NFGs for 1/x.
Table 3 Comparison with 5th-order approximation based on uniform
segmentation.
Func. Domain Accuracy Memory size [bits] Ratio
f (x) 5th-order Quad. [%]
(Uniform) (Non)
sin(πx) [0, 1/4] 2−23 70,528 18,048 26
exp(x) [0, 1] 2−24 82,432 43,136 52
2x − 1 [0, 1] 2−24 89,600 19,968 22
5th-order: 5th-order approximation [3].
Quad.: 2nd-order Chebyshev approximation.
Table 4 Comparison with quadratic approximation based on uniform
segmentation.
Func. Domain Accuracy Memory size [bits] Ratio
f (x) Minimax Cheb. [%]
(Uniform) (Non)
sin(πx/4) [0, 1) 2−24 16,288 19,200 118
2x − 1 [0, 1) 2−16 2,208 2,512 114
Minimax: 2nd-order minimax approximation [4].
Cheb.: 2nd-order Chebyshev approximation.
imax approximation by the Remez algorithm [4], respec-
tively. Both approximations in [3], [4] are based on uni-
form segmentation. Thus, their NFGs require no segment
index encoder. On the other hand, since our approximation
is based on non-uniform segmentation, the memory size is
obtained by summing the memory sizes of the coeﬃcients
table and the segment index encoder. As shown in Table 1,
for trigonometric and exponential functions, the diﬀerence
of the number of uniform segments and non-uniform seg-
ments is not so large under the same approximation poly-
nomial. For such functions, NFGs based on uniform seg-
mentation (needing no segment index encoder) often require
smaller memory than non-uniform segmentations. Although
our NFGs require the segment index encoder and use ap-
proximation polynomials with larger approximation error
than approximation polynomials in [3], [4], our NFGs for
such functions are implemented with only 22% to 52% of
the memory sizes of NFGs in [3], and with memory size
comparable to [4]. In [3], [4], memory sizes of NFGs for√
x and
√− ln(x) are unavailable. However, from Table 1,
we can see that the memory size of their NFGs for
√
x and√− ln(x) is excessively large. On the other hand, our NFGs
can realize a wide range of functions with small memory
NAGAYAMA et al.: COMPACT NFGS BASED ON QUADRATIC APPROXIMATION
3515
Table 5 FPGA implementation of NFGs for linear and quadratic approximations.
FPGA device: Altera Stratix (EP1S10F484C5: 10,570 logic elements, 48 DSP units)
Logic synthesis tool: Altera QuartusII 5.0 with speed optimization option and timing requirement of 200 MHz
Function 16-bit precision 24-bit precision
f (x) Logic elements DSP units Freq. [MHz] Logic elements DSP units Freq. [MHz]
Linear Quad. Linear Quad. Linear Quad. Linear Quad. Linear Quad. Linear Quad.
2x 167 482 2 4 195 185 604 758 2 10 – 131
1/x 204 376 2 4 234 186 636 859 2 10 – 134√
x 270 496 2 4 237 179 1,211 822 2 16 – 124
1/
√
x 186 475 2 4 237 186 402 753 2 10 – 131
log2(x) 163 381 2 4 194 186 597 757 2 10 – 131
ln(x) 170 379 2 4 197 185 416 863 2 10 – 131
sin(πx) 154 424 2 4 197 192 480 646 8 10 – 134
cos(πx) 172 354 2 4 237 179 412 647 8 10 – 131
tan(πx) 234 382 2 4 237 178 655 604 2 10 – 131√− ln(x) 304 623 2 10 215 135 854 942 8 16 – 130
tan2(πx) + 1 132 282 2 4 194 215 991 720 2 10 – 135
Entropy 141 403 2 4 235 206 1,370 914 2 16 – 128
Sigmoid 167 430 2 4 194 191 627 706 2 10 – 131
Gaussian 181 419 2 4 237 186 303 747 2 10 216 129
Average 189 422 2 4 217 185 683 767 3 11 – 131
Linear: Linear approximation [18]. Quad.: 2nd-order Chebyshev approximation. Freq. : Operating frequency.
–: NFGs cannot be mapped into the FPGA due to the excessive memory size.
Table 6 Comparison with table-based method: STAM [22].
FPGA device: Altera Stratix (EP1S25F780C5: 25,660 logic elements, 80 DSP units)
Logic synthesis tool: Altera QuartusII 5.0 with speed optimization option and timing requirement of 200 MHz
Function Domain In prec. Out prec. Memory size [bits] Operating frequency [MHz]
f (x) Int Frac Int Frac STAM Quad. Ratio [%] STAM Quad. Ratio [%]
1/x [1, 2) 1 23 1 23 651,264 19,136 3 184 131 71
sin(x) [0, 1) 0 24 0 24 491,520 40,832 8 205 131 64
2x [0, 1) 0 24 1 23 356,352 19,136 5 198 132 66
Quad.: our NFG. In prec.: Input precision. Out prec.: Output precision.
Int: Number of integer bits. Frac: Number of fractional bits.
size.
5.3 FPGA Implementation
This section compares the FPGA implementation results of
our NFGs with two existing NFGs [18], [22]. Table 5 shows
FPGA implementation results of our NFGs and NFGs based
on linear approximation using non-uniform segmentation
[18].
Since the architecture of a linear NFG [18] is simpler
than that of a quadratic NFG, linear NFGs are faster, and
require fewer logic elements and DSP units than quadratic
NFGs. However, linear approximations require more seg-
ments and larger memory than quadratic approximations, as
shown in Table 1 and Table 2. Table 5 shows that there are
not enough resources in the smallest device in the Stratix
family to achieve 24-bit precision using linear approxima-
tion for all functions, except Gaussian. This is due to
the excessive memory size (although many logic elements
and DSP units are unused). The most crucial issue in the
FPGA implementation is the constraints on these hardware
resources. For 24-bit precision, the linear approximation
requires a larger FPGA due to the excessive memory size.
However, in larger FPGAs, more logic elements and DSP
units are left unused. On the other hand, a quadratic NFG
can be implemented with a smaller FPGA, since it requires
much smaller memory size than a linear NFG. In fact, we
implemented 24-bit precision NFGs with lower cost and
smaller FPGAs (Cyclone II), using quadratic approximation
instead of linear approximation.
To show the performance of our NFGs, we compare
with a well-known table-based NFG, STAM [22], that uses
1st-order Taylor expansion and uniform segmentation. Ta-
ble 6 compares memory size and operating frequency. Since
the STAM requires tables and a multiple-input adder, but re-
quires no multiplier, it is faster. However, it requires exces-
sive memory size and therefore larger FPGAs. On the other
hand, our NFGs achieve about 60% to 70% of the operating
frequency of the STAM with much smaller memory size.
6. Conclusion and Comments
We have demonstrated an architecture and a synthesis
method for compact NFGs for trigonometric, logarithmic,
square root, reciprocal, and combinations of these functions.
Our architecture can eﬃciently realize any non-uniform seg-
mentation using a compact LUT cascade, and can approx-
imate many numerical functions by quadratic polynomials.
Therefore, our architecture is suitable for automatic synthe-
sis of fast and compact NFGs. Experimental results show
that our synthesis method can approximate a wide range of
functions with a small number of non-uniform segments,
and generate NFGs with small memory size. For 24-bit
precision, our NFGs can be implemented with, on aver-
age, only 4% of the memory size of NFGs based on lin-
ear approximation and non-uniform segmentation, and with,
on average, only 33% of the memory size of NFGs based
on the 5th-order approximation and uniform segmentation.
NFGs based on the linear approximation are faster than the
quadratic ones, but for high-precision, they require a large
3516
IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.12 DECEMBER 2006
FPGA due to the excessive memory size. On the other hand,
our quadratic NFGs can achieve about 70% of the through-
put of the table-based NFG using only a few percent of the
memory, and can be implemented with a more compact and
low-cost FPGA.
Our 24-bit precision NFGs can be used to compute a
significand of an IEEE single-precision floating-point num-
ber. Since our NFGs are compact, we conjecture that our
NFGs will be practical also for higher-precision than 24-bit.
However, we could not verify the accuracy of NFGs with
higher-precision than 24-bit due to the errors of NFG syn-
thesis tool developed by C language. Thus, we have to de-
velop a more accurate NFG synthesis tool and a verification
tool in the future.
Acknowledgments
This research is partly supported by the Grant in Aid for Sci-
entific Research of the Japan Society for the Promotion of
Science (JSPS), funds from Ministry of Education, Culture,
Sports, Science, and Technology (MEXT) via Kitakyushu
innovative cluster project, NSA Contract RM A-54, and the
Ministry of Education, Culture, Sports, Science, and Tech-
nology (MEXT), Grant-in-Aid for Young Scientists (B),
18700048, 2006. The comments of two referees were useful
in improving this paper.
References
[1] R. Andraka, “A survey of CORDIC algorithms for FPGA based
computers,” Proc. 1998 ACM/SIGDA Sixth Inter. Symp. on Field
Programmable Gate Array (FPGA’98), pp.191–200, Monterey, CA,
Feb. 1998.
[2] J. Cao, B.W.Y. Wei, and J. Cheng, “High-performance architec-
tures for elementary function generation,” Proc. 15th IEEE Symp.
on Computer Arithmetic (ARITH’01), pp.136–144, Vail, CO, June
2001.
[3] D. Defour, F. de Dinechin, and J.-M. Muller, “A new scheme for
table-based evaluation of functions,” 36th Asilomar Conference on
Signals, Systems, and Computers, pp.1608–1613, Pacific Grove,
California, Nov. 2002.
[4] J. Detrey and F. de Dinechin, “Second order function approxima-
tion using a single multiplication on FPGAs,” Proc. Inter. Conf. on
Field Programmable Logic and Applications (FPL’04), pp.221–230,
Leuven, Belgium, 2004.
[5] N. Doi, T. Horiyama, M. Nakanishi, and S. Kimura, “Minimiza-
tion of fractional wordlength on fixed-point conversion for high-
level synthesis,” Proc. Asia and South Pacific Design Automation
Conference (ASPDAC’04), pp.80–85, Yokohama, Japan, 2004.
[6] H. Hassler and N. Takagi, “Function evaluation by table look-up
and addition,” Proc. 12th IEEE Symp. on Computer Arithmetic
(ARITH’95), pp.10–16, Bath, England, July 1995.
[7] T. Ibaraki and M. Fukushima, FORTRAN 77 Optimization Program-
ming, Iwanami, 1991.
[8] Y. Iguchi, T. Sasao, and M. Matsuura, “Realization of multiple-
output functions by reconfigurable cascades,” International Con-
ference on Computer Design: VLSI in Computers and Processors
(ICCD’01), pp.388–393, Austin, TX, Sept. 2001.
[9] D.-U. Lee, W. Luk, J. Villasenor, and P.Y.K. Cheung, “Non-uniform
segmentation for hardware function evaluation,” Proc. Inter. Conf.
on Field Programmable Logic and Applications, pp.796–807, Lis-
bon, Portugal, Sept. 2003.
[10] D.-U. Lee, W. Luk, J. Villasenor, and P.Y.K. Cheung, “A Gaussian
noise generator for hardware-based simulations,” IEEE Trans. Com-
put., vol.53, no.12, pp.1523–1534, Dec. 2004.
[11] J.H. Mathews, Numerical Methods for Computer Science, Engineer-
ing and Mathematics, Prentice-Hall, Englewood Cliﬀs, NJ, 1987.
[12] J.-M. Muller, Elementary Function: Algorithms and Implementa-
tion, Birkhauser Boston, Secaucus, NJ, 1997.
[13] S. Nagayama and T. Sasao, “Compact representations of logic func-
tions using heterogeneous MDDs,” IEICE Trans. Fundamentals,
vol.E86-A, no.12, pp.3168–3175, Dec. 2003.
[14] S. Nagayama and T. Sasao, “On the optimization of heterogeneous
MDDs,” IEEE Trans. Comput.-Aided Des. Integrated Circuits Syst.,
vol.24, no.11, pp.1645–1659, Nov. 2005.
[15] S. Nagayama, T. Sasao, and J.T. Butler, “Programmable numeri-
cal function generators based on quadratic approximation: Archi-
tecture and synthesis method,” Proc. Asia and South Pacific De-
sign Automation Conference (ASPDAC’06), pp.378–383, Yoko-
hama, Japan, 2006.
[16] T. Sasao, M. Matsuura, and Y. Iguchi, “A cascade realization of
multiple-output function for reconfigurable hardware,” Inter. Work-
shop on Logic and Synthesis (IWLS’01), pp.225–230, Lake Tahoe,
CA, June 2001.
[17] T. Sasao, J.T. Butler, and M.D. Riedel, “Application of LUT cas-
cades to numerical function generators,” Proc. 12th Workshop on
Synthesis and System Integration of Mixed Information Technolo-
gies (SASIMI’04), pp.422–429, Kanazawa, Japan, Oct. 2004.
[18] T. Sasao, S. Nagayama, and J.T. Butler, “Programmable numerical
function generators: Architectures and synthesis method,” Proc. In-
ter. Conf. on Field Programmable Logic and Applications (FPL’05),
pp.118–123, Tampere, Finland, Aug. 2005.
[19] Scilab 3.0, INRIA-ENPC, France, http://scilabsoft.inria.fr/
[20] M.J. Schulte and J.E. Stine, “Symmetric bipartite tables for accu-
rate function approximation,” 13th IEEE Symp. on Comput. Arith.,
vol.48, no.9, pp.175–183, Asilomar, CA, 1997.
[21] M.J. Schulte and J.E. Stine, “Approximating elementary functions
with symmetric bipartite tables,” IEEE Trans. Comput., vol.48, no.8,
pp.842–847, Aug. 1999.
[22] J.E. Stine and M.J. Schulte, “The symmetric table addition method
for accurate function approximation,” J. VLSI Signal Processing,
vol.21, no.2, pp.167–177, June 1999.
[23] J.E. Volder, “The CORDIC trigonometric computing technique,”
IRE Trans. Electronic Comput., vol.EC-820, no.3, pp.330–334,
Sept. 1959.
Appendix: Error Analysis
We analyze the error for our NFGs and show a method to
obtain the appropriate bit sizes for the units in our architec-
ture. Our synthesis system applies the method in [5] to the
NFG shown in Fig. 3.
A.1 Error of NFG
There are two kinds of errors: approximation error and
rounding error. The approximation error is given by a, as
shown in Sect. 3. Thus, this section focuses on rounding
error. We ignore the integer bits in this analysis because
rounding errors are independent of integer bits. We assume
there is at least one fractional bit.
Errors in Coeﬃcients: The coeﬃcients c2i, c′1i, and c′0i for
the quadratic approximation gi(x) are rounded to u2-bit, u1-
bit, and u0-bit accuracy, respectively. That is, due to round-
ing, the coeﬃcients become
NAGAYAMA et al.: COMPACT NFGS BASED ON QUADRATIC APPROXIMATION
3517
c2i + α2, c
′
1i + α1, and c′0i + α0,
(|α2| ≤ 2−(u2+1), |α1| ≤ 2−(u1+1), |α0| ≤ 2−(u0+1)),
where α2, α1, and α0 are rounding errors of c2i, c′1i, and c′0i,
respectively. In this case, we need no hardware for rounding
the coeﬃcients because it is precomputed before storing in
the coeﬃcients table. Since rounding yields a more accurate
result than the truncation, we choose rounding.
Error in Squaring Unit: When input x has min bits, our
synthesis system ensures that qi has also min bits. Therefore,
the value of (x − qi) also has min bits, and has no rounding
error. Although the value of (x−qi)2 has 2min bits, the value
of (x − qi)2 is truncated to u3 bits in order to reduce the
size of succeeding multiplier, where u3 ≤ 2min. We choose
truncation because it does not require additional hardware,
while rounding requires half adders at the output of squaring
unit. Thus, the succeeding multiplier uses the value of
(x − qi)2 − α3 (0 ≤ α3 ≤ 2−u3 − 2−2min ),
where α3 is the rounding error for the truncation.
Errors in Multipliers: As shown in the two previous para-
graphs, c2i and c′1i are stored as c2i + α2 and c′1i + α1 in the
ROM, and (x − qi)2 is changed to (x − qi)2 − α3 by trunca-
tion. These values change the original products c2i(x − qi)2
and c′1i(x − qi) to
(c2i + α2){(x − qi)2 − α3} and (A· 1)
(c′1i + α1)(x − qi). (A· 2)
The values of (A· 1) and (A· 2) have (u2 + u3) bits and
(u1 + min) bits, respectively, and they are truncated to u0
bits in order to match the addition with c′0i. Note that
u0 ≤ min(u2 + u3, u1 + min). Following a method shown
in [20], the (u0 + 1)th-fractional bit d−(u0+1) of (A· 1) is set to
1 after the truncation. Thus, the adder uses the values of
(c2i + α2){(x − qi)2 − α3} − α4 and
(c′1i + α1)(x − qi) − α5,
where α4 and α5 are the rounding errors for the truncations
of (A· 1) and (A· 2): 0 ≤ α4 ≤ 2−(u0+1) − 2−(u2+u3) and 0 ≤
α5 ≤ 2−u0 − 2−(u1+min).
Errors in Adder: From previous paragraphs, the original
sum c2i(x − qi)2 + c′1i(x − qi) + c′0i is changed to
(c2i + α2){(x − qi)2 − α3} + (c′1i + α1)(x − qi) + c′0i
+ α0 − α4 − α5. (A· 3)
This value has u0 bits, and is rounded to mout bits (output
accuracy given in the specification), where mout ≤ u0. Thus,
the value of gi(x) is changed to
(c2i + α2){(x − qi)2 − α3} + (c′1i + α1)(x − qi) + c′0i
+ α0 − α4 − α5 + α6,
where α6 is the error for the rounding to mout bits. Since
d−(u0+1) of (A· 1) is set to 1, d−(u0+1) of (A· 3) is also 1, and
we have |α6| ≤ 2−(mout+1) − 2−(u0+1). Note that d−(u0+1) is not
implemented in hardware [20]. By expanding and rearrang-
ing this, we have
gi(x) + α2{(x − qi)2 − α3} + α1(x − qi) + α0
−c2iα3 − α4 − α5 + α6. (A· 4)
This is an output value of the NFG including rounding er-
ror. (A· 4) has the maximum rounding error when α0 =
−2−(u0+1), α1 = −2−(u1+1), α2 = −2−(u2+1), α3 = 2−u3 − 2−2min ,
α4 = 2−(u0+1) − 2−(u2+u3), α5 = 2−u0 − 2−(u1+min), and α6 =
−(2−(mout+1) − 2−(u0+1)), where we assume that the values of
(x−qi) and c2i are positive. Therefore, the maximum round-
ing error r is
r = 2−(u2+1)(max seg2 − 3 · 2−u3 + 2−2min )
+2−(u1+1)max seg + 3 · 2−(u0+1)
+max c2(2−u3 − 2−2min ) − 2−(u1+min) + 2−(mout+1),
where max seg and max c2 are the maximum values of |x −
qi| and |c2i|, respectively.
A.2 Calculation of Bit Sizes for Units
The number of bits for the integer part is calculated as
	log2(max value + 1)
 + 1, where max value is an integer,
and denotes the maximum absolute value of the range.
On the other hand, the number of bits for the frac-
tional part is calculated using the result of the error anal-
ysis. From the error analysis, an NFG with an acceptable
error  is achieved when a + r < , where a and r are the
maximum approximation error and the rounding error, re-
spectively. To generate fast and compact NFGs, we find the
minimum u0, u1, u2, and u3 that satisfy this relation using
nonlinear programming [7] that minimizes u0 + u1 + u2 + u3
with the constraint a + r < .
Calculation of Shift Bits l2i and l1i: From (A· 1) and (A· 2),
the maximum errors in the multipliers 1 and 2 are
1 = 2−(u2+1)(max seg2 − 2−u3 + 2−2min )
+max c2(2−u3 − 2−2min ) and
2 = 2−(u1+1)max seg,
respectively. Since max seg and max c2 are the maximum
values of |x − qi| and |c2i|, respectively, there are positive in-
tegers l2i and l1i that satisfy the following relations for each
segment:
2l2i 2−(u2+1){|x − qi|2 − 2−u3 + 2−2min }
+|c2i|(2−u3 − 2−2min ) ≤ 1 and
2l1i 2−(u1+1)|x − qi| ≤ 2.
We transform these into
2−(u2−l2i+1){|x − qi|2 − 2−u3 + 2−2min }
+|c2i|(2−u3 − 2−2min ) ≤ 1 and (A· 5)
2−(u1−l1i+1)|x − qi| ≤ 2. (A· 6)
From (A· 5) and (A· 6), for each segment, coeﬃcients c2i and
3518
IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.12 DECEMBER 2006
c′1i can be rounded to (u2 − l2i) and (u1 − l1i) bits within an
acceptable error, respectively. That is, l2i and l1i can be used














nume. = 2−(u2+1)(max seg2 − 2−u3 + 2−2min )
+ (max c2 − c2i)(2−u3 − 2−2min ) and

















Shinobu Nagayama received the B.S. and
M.E. degrees from the Meiji University, Kana-
gawa, Japan, in 2000 and 2002, respectively, and
the Ph.D. degree in computer science from the
Kyushu Institute of Technology, Iizuka, Japan,
in 2004. He is now a research associate at the
Hiroshima City University, Hiroshima, Japan.
He received the Outstanding Contribution Pa-
per Award from the IEEE Computer Society
Technical Committee on Multiple-Valued Logic
(MVL-TC) in 2005 for a paper presented at the
International Symposium on Multiple-Valued Logic in 2004, and the Excel-
lent Paper Award from the Information Processing Society of Japan (IPS)
in 2006. His research interest includes numerical function generators, de-
cision diagrams, software synthesis, and embedded systems.
Tsutomu Sasao received the B.E., M.E.,
and Ph.D. degrees in electronics engineering
from Osaka University, Osaka, Japan, in 1972,
1974, and 1977, respectively. He has held
faculty/research positions at Osaka University,
Japan, the IBM T.J. Watson Research Center,
Yorktown Heights, New York, and the Naval
Postgraduate School, Monterey, California. He
is now a Professor of the Department of Com-
puter Science and Electronics at the Kyushu In-
stitute of Technology, Iizuka, Japan. His re-
search areas include logic design and switching theory, representations of
logic functions, and multiple-valued logic. He has published more than
nine books on logic design, including Logic Synthesis and Optimization,
Representation of Discrete Functions, Switching Theory for Logic Synthe-
sis, and Logic Synthesis and Verification, Kluwer Academic Publishers,
1993, 1996, 1999, and 2001, respectively. He has served as Program Chair-
man for the IEEE International Symposium on Multiple-Valued Logic (IS-
MVL) many times. Also, he was the Symposium Chairman of the 28th
ISMVL held in Fukuoka, Japan, in 1998. He received the NIWA Memorial
Award in 1979, Distinctive Contribution Awards from the IEEE Computer
Society MVL-TC for papers presented at ISMVLs in 1986, 1996, 2003 and
2004, and Takeda Techno-Entrepreneurship Award in 2001. He has served
as an Associate Editor of the IEEE Transactions on Computers. He is a
fellow of the IEEE.
Jon T. Butler received the BEE and MEngr
degrees from Rensselaer Polytechnic Institute,
Troy, New York, in 1966 and 1967, respectively.
He received the Ph.D. degree from The Ohio
State University, Columbus, in 1973. Since
1987, he has been a professor at the Naval Post-
graduate School, Monterey, California. From
1974 to 1987, he was at Northwestern Univer-
sity, Evanston, Illinois. During that time, he
served two periods of leave at the Naval Post-
graduate School, first as a National Research
Council Senior Postdoctoral Associate (1980–1981) and second as the
NAVALEX Chair Professor (1985–1987). He served one period of leave
as a foreign visiting professor at the Kyushu Institute of Technology, Ii-
zuka, Japan. His research interests include logic optimization and multiple-
valued logic. He has served on the editorial boards of the IEEE Transac-
tions on Computers, Computer, and IEEE Computer Society Press. He
has served as the editor-in-chief of Computer and IEEE Computer Society
Press. He received the Award of Excellence, the Outstanding Contributed
Paper Award, and a Distinctive Contributed Paper Award for papers pre-
sented at the International Symposium on Multiple-Valued Logic. He re-
ceived the Distinguished Service Award, two Meritorious Awards, and nine
Certificates of Appreciation for service to the IEEE Computer Society. He
is a fellow of the IEEE.
