Programmable numerical function generators based on quadratic approximation: Architecture and synthesis method by Nagayama, Shinobu et al.
Calhoun: The NPS Institutional Archive
Faculty and Researcher Publications Faculty and Researcher Publications
2006-01
Programmable numerical function




S. Nagayama, T. Sasao, and J. T. Butler, "Programmable numerical function generators based
on quadratic approximation: Architecture and synthesis method," ASPDAC 2006, Yokohama
Jan. 2006, pp. 378-383.
http://hdl.handle.net/10945/35859
Programmable Numerical Function Generators Based on Quadratic
Approximation: Architecture and Synthesis Method
Shinobu Nagayama Tsutomu Sasao Jon T. Butler
Dept. of CE Dept. of CSE Dept. of ECE
Hiroshima City Univ. Kyushu Inst. of Tech. Naval Postgraduate School
Hiroshima 731-3194, Japan Iizuka 820-8502, Japan CA 93943-5121, USA
Abstract— This paper presents an architecture and a synthesis
method for programmable numerical function generators (NFGs)
for trigonometric, logarithmic, square root, and reciprocal func-
tions. Our NFG partitions a given domain of the function into
non-uniform segments using an LUT cascade, and approximates
the given function by a quadratic polynomial for each segment.
Thus, we can implement fast and compact NFGs for a wide range
of functions. Implementation results on an FPGA show that:
1) our NFGs require only 4% of the memory needed by NFGs
based on the linear approximation with non-uniform segmenta-
tion; and 2) our NFGs require only 22% of the memory needed
by NFGs based on the 5th-order approximation with uniform seg-
mentation. Our automatic synthesis system generates such com-
pact NFGs quickly.
I. INTRODUCTION
Numerical function generators (NFGs) are often used in
computer graphics, digital signal processing, communication
systems, robotics, astrophysics, fluid physics, etc. The func-
tions realized include trigonometric, logarithmic, square root,
and reciprocal functions. High-performance CPUs usually
have numerical coprocessors. However, embedded CPUs and
CPUs on FPGAs do not have such coprocessors. Thus, FPGA
implementation of numerical functions f (x) is needed. Im-
plementation by a single lookup table for f (x) is simple and
fast. For low-precision computations of f (x) (e.g. x and f (x)
have 8 bits), this implementation is straightforward. For high-
precision computations, however, the single lookup table im-
plementation is impractical due to the huge table size. For
such applications, the CORDIC (COordinate Rotation DIgi-
tal Computer) algorithm [1, 21] has been often used. Although
CORDIC is implemented with compact hardware, it is itera-
tive and therefore slow. For numerically intensive applications,
faster evaluation of numerical function is required.
For fast evaluation of numerical functions, polynomial ap-
proximations have been used [9, 10, 19, 20]. These meth-
ods approximate the given numerical functions by piecewise
polynomials, and realize the polynomials with hardware. Lin-
ear or quadratic approximations offer fast and relatively high-
precision evaluation of numerical functions. However, the
methods proposed so far are ad-hoc and not systematic. This
paper proposes an architecture and a systematic synthesis
method for NFGs based on quadratic approximation. By using
















Fig. 1. Synthesis flow for NFGs.
approximated by piecewise quadratic functions. Our synthesis
method can be automated, so that fast and compact NFGs can
be produced by non-experts. Fig. 1 shows the synthesis flow
for the NFG. It converts the Design Specification described by
Scilab [18], a MATLAB-like software, into HDL code. The
Design Specification consists of a function f (x), a domain for
x, and an acceptable error. This system first partitions the do-
main into segments, and then approximates f (x) by a quadratic
function for each segment. Next, it analyzes the errors, and de-
rives the necessary precision for computing units in the NFG.
Then, it generates HDL code to be mapped into an FPGA us-
ing an FPGA vendor tool. Due to the page limitation, the error
analysis for our NFGs is omitted here, but it is available in
[14]. This paper extends [17] to quadratic approximations.
II. PRELIMINARIES
Definition 2.1 The binary fixed-point representation of a
value r has the form
dn int−1 dn int−2 . . . d1 d0. d−1 d−2 . . . d−n f rac, (1)
where di ∈ {0,1}, n int is the number of bits for the integer
part, and n f rac is the number of bits for the fractional part of
r. The representation in (1) is two’s complement, and so





Definition 2.2 Error is the absolute difference between the
original value and the approximated value. Approximation
error is the error caused by a function approximation, and
rounding error is the error caused by a binary fixed-point rep-
resentation. Acceptable error is the maximum error that an
NFG may assume. Acceptable approximation error (AAE) is
the maximum approximation error that a function approxima-
tion may assume.
Definition 2.3 Precision is the total number of bits for a bi-
nary fixed-point representation. Specially, n-bit precision
specifies that n bits are used to represent the number; that is,
n = n int+n f rac. An n-bit precision NFG has an n-bit input.
Definition 2.4 Accuracy is the number of bits in the fractional
part of a binary fixed-point representation. Specially, m-bit ac-
curacy specifies that m bits are used to represent the fractional
part of the number; that is, m = n f rac. An m-bit accuracy
NFG is an NFG with m-bit fractional part of the input, m-bit
fractional part of the output, and a 2−m acceptable error.
III. QUADRATIC APPROXIMATION ALGORITHM
To approximate the numerical function f (x) using quadratic
functions, first, we partition the domain for x into segments.
For each segment, we approximate f (x) using a quadratic
function g(x) = c2x2 + c1x+ c0. In this case, the approxima-
tion error depends on the segmentation method and the values
of coefficients c2, c1, and c0 in the approximation polynomial.
For piecewise polynomial approximations, in many cases,
the domain is partitioned into uniform segments [2, 6, 19].
Such methods are simple and fast, but for some kinds of nu-
merical functions, too many segments are required, resulting
in large memory.
For a given error, non-uniform segmentation of the domain
uses fewer segments than the uniform segmentation [9, 17].
However, a non-uniform segmentation often requires a com-
plicated segment index encoder (see Section IV), and results
in larger and slower NFGs. To overcome this problem, a spe-
cial non-uniform segmentation has been proposed [9]. This
method produces a simple segment index encoder by restrict-
ing the segmentation points, and results in fewer segments as
well as faster and more compact NFGs than produced by uni-
form segmentation. However, it is ad-hoc and non-optimum
for the given function. Our NFG can implement any non-
uniform segmentation with a fast and compact segment index
encoder by using an LUT cascade [17] with a synthesis method
that can be automated.
Selection of the approximation polynomial influences the
number of non-uniform segments as well as the approximation
error. In this paper, we use the 2nd-order Chebyshev approxi-
mation to approximate f (x) with fewer non-uniform segments,
and compute the approximated value. Since coefficients of the
Chebyshev approximation polynomial are easily computed, it
is suitable for automatic synthesis.
A. Segmentation Algorithm
For a segment [s,e] of f (x), the maximum approximation







| f (3)(x)|, (2)
where f (3) is the 3rd-order derivative of f . From (2), ε2(s,e)
is a monotone increasing function of segment width e− s.
Using this property, we partition a domain into as wide seg-
ments as possible such that the approximation error is less
Input: Numerical function f (x), Domain [a,b] for x,
Acceptable approximation error ε.
Output: Segments [s0,e0], [s1,e1], . . . , [st−1,et−1].
Process:
1. Let s0 = a and i = 0.
2. Find a value p (≥ si) where ε2(si, p) = ε.
3. If p > b, then let p = b.
4. Let ei = p and i = i+1.
5. If p = b, then let t = i, and stop the process.
6. Else, let si = p, and go to step 2.
Fig. 2. Non-uniform segmentation algorithm for the domain.
than the specified error. Fig. 2 shows the non-uniform seg-
mentation algorithm. The inputs for this algorithm are a nu-
merical function f (x), a domain [a,b] for x, and an accept-
able approximation error ε. Then, this algorithm approximates
f (x) with the acceptable approximation error ε, and produces
t segments [s0,e0], [s1,e1], . . ., [st−1,et−1]. For step 2 in Fig. 2,
the accurate computation of the value p where ε2(si, p) = ε
is difficult. Thus, we obtain the maximum value p′ satisfy-
ing ε2(si, p′)≤ ε. Such p′ can be found by scanning values of
n-bit input x. However, it requires O(2n) search, and is time-
consuming. Therefore, we compute the maximum value p′ by
setting 0 or 1 from MSB to LSB of x such that ε2(si, p′) ≤ ε.
This requires O(n) search. In the computation of ε2(si, p′), the
value of maxsi≤x≤p′ | f (3)(x)| is computed by the nonlinear pro-
gramming algorithm, which is one of the most efficient [7].
B. Computation of Approximate value
For each [si,ei], f (x) is approximated by the corresponding
quadratic function gi(x). That is, the approximated value y of
f (x) is computed as follows:
y = gi(x) = c2ix2 + c1ix+ c0i, (3)
where the coefficients c2i, c1i, and c0i are derived from the 2nd-
order Chebyshev approximation polynomial [11]. Substituting
x−qi +qi for x in (3) yields the transformation
gi(x) = c2i(x−qi)2 +(c1i +2c2iqi)(x−qi)
+c0i + c1iqi + c2iq2i . (4)
In (4), let c′1i = c1i +2c2iqi and c
′




gi(x) = c2i(x−qi)2 + c′1i(x−qi)+ c′0i. (5)
This transformation reduces the multiplier size.
IV. ARCHITECTURE FOR NFGS
Fig. 3 shows the architecture that realizes (5). It uses 7 units:
the segment index encoder that computes the index i for seg-
ment [si,ei] including the input value x; the coefficients table
for −qi, c2i, c′1i, and c′0i; the adder for x+(−qi); the squaring















Fig. 3. Architecture for NFGs.
Interval Index
s0 ≤ x≤ e0 0
s1 < x≤ e1 1
...
...
st−1 < x≤ et−1 t−1
(a) Segment index function.
LUT LUT LUT
(b) LUT cascade.
Fig. 4. Segment index encoder.
A segment index encoder converts x into a segment in-
dex i. It realizes the segment index function seg f unc(x) :
Bn → {0,1, . . . ,t− 1} shown in Fig. 4 (a), where x has n bits,
B = {0,1}, and t denotes the number of segments. In [9], to
simplify the segment index encoder, the values of si and ei are
restricted to what can be produced by a simple combinational
logic circuit. Such a segmentation method results in many seg-
ments since it does not adapt to the given function. Our syn-
thesis system uses the LUT cascade [8, 15, 16] shown in Fig. 4
(b) to realize arbitrary seg f unc(x). It can be designed by func-
tional decomposition using BDDs (Binary Decision Diagrams)
representing seg f unc(x). Our synthesis system uses a nonre-
strictive segmentation. It is suitable for automatic synthesis.
In LUT cascades, the interconnecting lines between adjacent
LUTs are called rails. The size of an LUT cascade depends on
the number of rails. The next theorem shows that the segment
index functions are realized by compact LUT cascades.
Theorem 4.1 [16] Let seg f unc(x) be a segment index func-
tion with t segments. Then, there exists an LUT cascade for
seg f unc(x) with at most log2 t rails.
Our synthesis system uses heterogeneousMDDs (Multi-valued
Decision Diagrams) [13] to find compact LUT cascades. Since
the LUT cascade is suitable for the pipeline processing, it of-
fers a fast and compact circuit. In Section VI, we will show
that our architecture produces fast and compact NFGs for var-
ious numerical functions.
V. IMPLEMENTATION WITH FPGA
Modern FPGAs consist of logic elements (LEs) or config-
urable logic blocks (CLBs), synchronousmemory blocks, mul-
tipliers (DSP units), etc. Our synthesis system efficiently gen-
erates NFGs using these components. Each unit for the NFG
shown in Fig. 3 is implemented by the following components
in an FPGA: 1) Segment index encoder (LUT cascade) and
coefficients table: by synchronous memory blocks; 2) Squar-
ing unit: by logic elements; 3) Multiplier: by DSP units; and
4) Adder: by logic elements. Our synthesis system derives the
appropriate bit-width for each component by automatic error
analysis.
A. Size Reduction of Multiplier
Although modern FPGAs have dedicated multipliers, large
multipliers are slow. In our architecture, the multiplier often
has the longest delay time among all the units. Thus, to imple-
ment a fast NFG, reducing multiplier size is important. Since
the size of multipliers depends on the number of bits for c2i,
c′1i, and x− qi, it is important to reduce the number of bits to
represent these values.
First, we consider the case where the absolute values of
c2i and c′1i are large. Our synthesis method uses a scaling
method [9]. We represent c2i and c′1i as c2i = c2i× 2−l2i × 2l2i
and c′1i = c
′
1i×2−l1i ×2l1i , respectively. That is, instead of the
original values of c2i and c′1i, we store the values of c2i×2−l2i ,
l2i, c′1i × 2−l1i , and l1i in the coefficients table. In this case,
the products c2i(x− qi)2 and c′1i(x− qi) are computed using
multipliers and shifters. The use of l2i and l1i reduces the num-
ber of bits to represent the values of c2i×2−l2i and c′1i×2−l1i ,
but increases the rounding errors. Our synthesis method finds
optimum values of l2i and l1i for each segment such that an
acceptable error is achieved. When l2i and l1i are 0 for all the
segments, no shifter is implemented, that is, c2i(x− qi)2 and
c′1i(x−qi) are directly implemented with multipliers.
Next, we consider the value of x− qi. The number of bits
for x− qi influences the sizes of the squaring unit and mul-
tipliers. Thus, reducing the value of x− qi reduces the sizes
of the squaring unit and multipliers, and also the error. From
(5), we can choose any value for qi. To reduce the value of
x− qi, for a segment [si,ei], we set qi = (si + ei)/2. Then, we
have |x− qi| ≤ (ei− si)/2. Thus, reducing the segment width
ei−si reduces the value for x−qi. However, this also increases
the number of segments, and results in increased memory size.
The rest of this section shows a reduction method of segment
width without increasing the memory size.
The coefficients table in Fig. 3 has 2k words, where k =
log2 t and t is the number of segments. Therefore, we can
increase the number of segments up to t = 2k without increas-
ing the memory size. From Theorem 4.1, the size of LUT cas-
cade also depends on the value of k. However, increasing the
number of segments to t = 2k seldom increases the size of the
LUT cascade. We reduce the size of segments by dividing the
largest segment into two equal sized segments up to t = 2k.
This method reduces both the number of bits for x−qi and the
error without increasing the memory size.
B. Pipeline Processing
To implement a high-throughputNFG in an FPGA, our syn-
thesis system inserts pipeline registers between all units in the
architecture. Since all units operate in parallel, and each unit
has a short delay time, our NFGs achieves high throughput.
Table I shows the units and the number of pipeline stages for
them. Our NFGs have n cas+(5 or 6) pipeline stages, where
n cas is the number of LUTs for the LUT cascade.
TABLE II
NUMBER OF SEGMENTS FOR VARIOUS APPROXIMATION METHODS.
Function Domain AAE = 2−17 AAE = 2−25
f (x) Linear 2nd-Chebyshev Time Linear 2nd-Chebyshev Time
Non Uniform Non [msec] Non Uniform Non [msec]
2x [0, 1] 128 9 7 0.1 2048 65 44 70
1/x [1, 2) 124 16 11 0.1 1982 128 64 60√
x [1/32, 2) 193 252 24 10 3082 2016 138 150
1/
√
x [1, 2) 46 16 8 0.1 1024 128 46 50
log2(x) [1, 2) 128 16 10 10 2048 128 56 70
ln(x) [1, 2) 89 16 9 10 1437 128 50 50
sin(πx) [0, 1/2) 127 17 12 10 2027 129 74 90
cos(πx) [0, 1/2) 127 17 12 10 2027 129 74 90
tan(πx) [0, 1/4) 112 33 12 10 1787 129 73 110√− ln(x) [1/32, 1) 354 31744 52 70 5933 8126464 331 720
tan2(πx)+1 [0, 1/4) 256 33 17 20 4096 257 101 170
Entropy [1/256, 255/256] 520 509 40 30 8320 4065 234 300
Sigmoid [0, 1] 127 33 13 20 2020 129 76 160
Gaussian [0, 1/2] 32 5 4 0.1 512 33 18 30
Average 170 2337 17 20 2739 580995 99 100
AAE: Acceptable Approximation Error. Time: CPU time for our non-uniform segmentation algorithm.
Linear: Linear approximation. 2nd-Chebyshev: 2nd-order Chebyshev approximation.
Uniform: Uniform segmentation. Non: Non-uniform segmentation.
Experiment environment: CPU: Pentium4 Xeon 2.8GHz Memory: 4GB
OS: Redhat (Linux 7.3) C compiler: gcc -O2
TABLE I
NUMBER OF PIPELINE STAGES FOR NFGS.
Name of units Pipeline stages
1. Segment index encoder n cas
2. Coefficients table 1
3. Adder for x+(−qi) 1
4. Squaring unit 1
5. Multipliers (parallel) 1
6. Shifter (optional) 0 or 1
7. Final adder 1
Total pipeline stages n cas+(5 or 6)
n cas: Number of LUTs for LUT cascade.
VI. EXPERIMENTAL RESULTS
A. Number of Segments and Computation Time of Algorithm
Table II compares the number of segments for various ap-
proximation methods for the functions in [16]. In this table,
Entropy, Sigmoid, and Gaussian are
Entropy = −x log2 x− (1− x) log2(1− x),
Sigmoid = 1
1+ e−4x






In Table II, the columns “Linear Non” show the number of
non-uniform segments for linear approximation in [17], and
the columns “2nd-Chebyshev Uniform” and “2nd-Chebyshev
Non” show the number of uniform segments and non-uniform
segments for 2nd-order Chebyshev approximation, respec-
tively. The columns “Time” show the CPU time for our non-
uniform segmentation algorithm applied to functions, in mil-
liseconds.
Table II shows that, for many functions, the 2nd-order
Chebyshev approximations require many fewer segments than
the linear approximation. However, for some functions, such
as
√− ln(x), the 2nd-order Chebyshev approximation based
on uniform segmentation requires many more segments than
the linear and 2nd-order Chebyshev approximations based on
non-uniform segmentations. Many existing polynomial ap-
proximation methods are based on uniform segmentation. For
trigonometric and exponential functions, approximation meth-
ods based on uniform segmentation require relatively few
segments. However, for some kinds of functions such as√− ln(x), the uniform 2nd-order approximation method re-
quires excessively many segments. On the other hand, our
quadratic approximation based on non-uniform segmentation
requires fewer segments for a wide range of functions. Also,
Table II shows that the CPU time is strongly correlated to the
number of segments. Smaller acceptable approximation error
(AAE) requires more segments and longer computation time.
However, Table II shows that, for all functions in the table,
the CPU times are shorter than 1 second when the acceptable
approximation error is 2−25.
These results show that, for various functions, our segmen-
tation algorithm partitions a domain into fewer non-uniform
segments quickly, and it is useful for automatic synthesis.
B. Memory Sizes of Various NFGs
This section compares the memory sizes of our NFGs with
three existing NFGs [17, 3, 4]. Table III compares NFGs using
linear approximation shown in [17]. This linear approxima-
tion is based on non-uniform segmentation. In Table III, the
columns “R” show the following values:
R =
memory size of quadratic approximation
memory size of linear approximation
×100.
Table III shows that NFGs using quadratic approximation re-
quire much smaller memory than ones using linear approxi-
mation. Especially, 24-bit precision NFGs using quadratic ap-
proximation can be implemented with only 4% of the memory
size needed for a linear approximation. From the relation be-
tween precision and memory size shown in Table III, we can
see that increasing the precision decreases the ratio of memory
sizes in NFGs.
TABLE III
COMPARISON WITH LINEAR APPROXIMATION BASED ON NON-UNIFORM
SEGMENTATION.
Function 16-bit precision 24-bit precision
f (x) Memory [bits] R Memory [bits] R
Linear Quad. [%] Linear Quad. [%]
2x 20992 1112 5 696320 19072 3
1/x 21248 2432 11 700416 19136 3√
x 43776 5536 13 1425408 86784 6
1/
√
x 10176 1104 11 343040 19008 6
log2(x) 20864 2464 12 694272 19072 3
ln(x) 20096 2448 12 700416 19136 3
sin(πx) 19456 2336 12 661504 38656 6
cos(πx) 19584 2336 12 663552 38784 6
tan(πx) 19712 2304 12 667648 38272 6√− ln(x) 74240 11264 15 2662400 173056 7
tan2(πx)+1 37632 4960 13 1290240 39040 3
Entropy 106496 10688 10 3768320 83968 2
Sigmoid 21120 2432 12 702464 40320 6
Gaussian 4416 444 10 156672 8384 5
Average 31415 3704 11 1080905 45906 4
Memory: Memory size. Linear: Linear approximation [17].
Quad.: 2nd-order Chebyshev approximation. R: Ratio.
TABLE IV
COMPARISON WITH 5TH-ORDER APPROXIMATION BASED ON UNIFORM
SEGMENTATION.
Func. Domain Acc. Memory size [bits] Ratio
f (x) 5th-order Quad. [%]
(Uniform) (Non)
sin(πx) [0, 1/4] 2−23 70528 18048 26
exp(x) [0, 1] 2−24 82432 43136 52
2x−1 [0, 1] 2−24 89600 19968 22
Acc.: Accuracy.
5th-order: 5th-order approximation [3].
Quad.: 2nd-order Chebyshev approximation.
Table IV and Table V compare our NFGs with NFGs using
5th-order Taylor expansion [3] and NFGs using 2nd-ordermin-
imax approximation by the Remez algorithm [4], respectively.
Both approximations in [3, 4] are based on uniform segmen-
tation. Thus, their NFGs require no segment index encoder.
On the other hand, since our approximation is based on non-
uniform segmentation, the memory size is obtained by the sum
of the coefficients table and the segment index encoder. As
shown in [17] and Table II, for trigonometric and exponential
functions, the difference of the number of uniform segments
and non-uniform segments is not so large under the same ap-
proximation polynomial. For such functions, NFGs based on
uniform segmentation (needing no segment index encoder) of-
ten require smaller memory than non-uniform segmentations.
Although our NFGs require the segment index encoder and
use approximation polynomials with larger approximation er-
ror than approximation polynomials in [3, 4], our NFGs for
such functions are implemented with only 22% to 52% of the
TABLE V
COMPARISON WITH QUADRATIC APPROXIMATION BASED ON UNIFORM
SEGMENTATION.
Func. Domain Acc. Memory size [bits] Ratio
f (x) Minimax Cheb. [%]
(Uniform) (Non)
sin(πx/4) [0, 1) 2−24 16288 19200 118
2x−1 [0, 1) 2−16 2208 2512 114
Minimax: 2nd-order minimax approximation [4].
Cheb.: 2nd-order Chebyshev approximation.
memory sizes of NFGs in [3], and with memory size com-
parable to [4]. In [3, 4], memory sizes of NFGs for
√
x and√− ln(x) are unavailable. However, from Table II, we can see




excessively large. On the other hand, our NFGs can realize a
wide range of functions with small memory size.
C. FPGA Implementation Results
Table VI compares the FPGA implementation results of our
NFGs with NFGs using linear approximation [17].
Since the architecture of linear NFG is simpler than
quadratic NFG, linear NFGs are faster, and require fewer logic
elements and DSP units than quadratic NFGs. However, lin-
ear approximation requires more segments and larger mem-
ory than quadratic approximation, as shown in Table II and
Table III. Table VI shows that 24-bit precision linear NFGs
cannot realize any function except Gaussian with the FPGA
(the smallest device in the Stratix family) due to the excessive
memory size although many logic elements and DSP units are
unused. The most crucial issue in the FPGA implementation
is the constraints on these hardware resources. For 24-bit pre-
cision, the linear approximation requires a larger FPGA due
to the excessive memory size. However, in the larger FPGA,
more logic elements and DSP units are left unused and wasted.
On the other hand, the quadratic NFGs can be implemented
with a smaller FPGA since they require much less memory
size than the linear NFGs and reasonable sizes of logic ele-
ments and DSP units. In fact, 24-bit precision quadratic NFGs
can be implemented with lower cost and more compact FPGAs
(Cyclone II).
VII. CONCLUSION AND COMMENTS
We have demonstrated an architecture and a synthesis
method for programmable NFGs for trigonometric functions,
logarithm functions, square root, reciprocal, etc. Our archi-
tecture can efficiently realize any non-uniform segmentation
using a compact LUT cascade, and approximate many numer-
ical functions by quadratic polynomials. Therefore, our archi-
tecture is suitable for automatic synthesis of fast and compact
NFGs. Implementation results on an FPGA show that our syn-
thesis method can approximate a wide range of functions with
a small number of non-uniform segments, and generate NFGs
with small memory size. For 24-bit precision, our NFGs can be
implemented with only 4% of the memory size of NFGs based
on the linear approximation with non-uniform segmentation,
and with only 22% of the memory size of NFGs based on the
5th-order approximation with uniform segmentation. NFGs
based on the linear approximation are faster than the quadratic
ones, but for high-precision, they require a large FPGA due to
the excessive memory size. On the other hand, our quadratic
NFGs can be implemented with more compact and low-cost
FPGA by using hardware resources on the FPGA efficiently.
ACKNOWLEDGMENTS
This research is partly supported by the Grant in Aid for
Scientific Research of the Japan Society for the Promotion of
TABLE VI
FPGA IMPLEMENTATION OF NFGS FOR LINEAR AND QUADRATIC APPROXIMATIONS.
FPGA device: Altera Stratix (EP1S10F484C5: 10570 logic elements, 48 DSP units)
Logic synthesis tool: Altera QuartusII 5.0
Synthesis options: speed optimization, timing requirement: 200MHz
Function 16-bit precision 24-bit precision
f (x) Logic elements DSP units Freq. [MHz] Logic elements DSP units Freq. [MHz]
Linear Quad. Linear Quad. Linear Quad. Linear Quad. Linear Quad. Linear Quad.
2x 167 482 2 4 195 185 604 758 2 10 – 131
1/x 204 376 2 4 234 186 636 859 2 10 – 134√
x 270 496 2 4 237 179 1211 822 2 16 – 124
1/
√
x 186 475 2 4 237 186 402 753 2 10 – 131
log2(x) 163 381 2 4 194 186 597 757 2 10 – 131
ln(x) 170 379 2 4 197 185 416 863 2 10 – 131
sin(πx) 154 424 2 4 197 192 480 646 8 10 – 134
cos(πx) 172 354 2 4 237 179 412 647 8 10 – 131
tan(πx) 234 382 2 4 237 178 655 604 2 10 – 131√− ln(x) 304 623 2 10 215 135 854 942 8 16 – 130
tan2(πx)+1 132 282 2 4 194 215 991 720 2 10 – 135
Entropy 141 403 2 4 235 206 1370 914 2 16 – 128
Sigmoid 167 430 2 4 194 191 627 706 2 10 – 131
Gaussian 181 419 2 4 237 186 303 747 2 10 216 129
Average 189 422 2 4 217 185 683 767 3 11 – 131
Linear: Linear approximation [17]. Quad.: 2nd-order Chebyshev approximation. Freq.: Operating frequency.
–: NFGs cannot be mapped into the FPGA due to the excessive memory size.
Memory sizes are omitted in this table (see Table III).
Science (JSPS), funds from Ministry of Education, Culture,
Sports, Science, and Technology (MEXT) via Kitakyushu in-
novative cluster project, and NSA Contract RM A-54.
REFERENCES
[1] R. Andraka, “A survey of CORDIC algorithms for FPGA based com-
puters,” Proc. of the 1998 ACM/SIGDA Sixth Inter. Symp. on Field Pro-
grammable Gate Array (FPGA’98), pp. 191–200, Monterey, CA, Feb.
1998.
[2] J. Cao, B. W. Y. Wei, and J. Cheng, “High-performance architectures
for elementary function generation,” Proc. of the 15th IEEE Symp. on
Computer Arithmetic (ARITH’01), Vail, Co, pp. 136–144, June 2001.
[3] D. Defour, F. de Dinechin, and J.-M. Muller, “A new scheme for table-
based evaluation of functions,” 36th Asilomar Conference on Signals,
Systems, and Computers,, Pacific Grove, California, pp. 1608–1613,
Nov. 2002.
[4] J. Detrey and F. de Dinechin, “Second order function approximation
using a single multiplication on FPGAs,” Proc. Inter. Conf. on Field
Programmable Logic and Applications (FPL’04), pp. 221–230, 2004.
[5] N. Doi, T. Horiyama, M. Nakanishi, and S. Kimura, “Minimization of
fractional wordlength on fixed-point conversion for high-level synthe-
sis,” Proc. of Asia and South Pacific Design Automation Conference
(ASPDAC’04), pp. 80–85, 2004.
[6] H. Hassler and N. Takagi, “Function evaluation by table look-up and
addition,” Proc. of the 12th IEEE Symp. on Computer Arithmetic
(ARITH’95), Bath, England, pp. 10–16, July 1995.
[7] T. Ibaraki and M. Fukushima, FORTRAN 77 Optimization Program-
ming, Iwanami, 1991 (in Japanese).
[8] Y. Iguchi, T. Sasao, and M. Matsuura, “Realization of multiple-output
functions by reconfigurable cascades,” International Conference on
Computer Design: VLSI in Computers and Processors (ICCD’01),
Austin, TX, pp. 388–393, Sept. 23–26, 2001.
[9] D.-U. Lee, W. Luk, J. Villasenor, and P. Y.K. Cheung, “Non-uniform
segmentation for hardware function evaluation,” Proc. Inter. Conf. on
Field Programmable Logic and Applications, pp. 796–807, Lisbon, Por-
tugal, Sept. 2003.
[10] D.-U. Lee, W. Luk, J. Villasenor, and P. Y.K. Cheung, “A hardware
Gaussian noise generator for channel code evaluation,” Proc. of the
11th Annual IEEE Symp. on Field-Programmable Custom Computing
Machines (FCCM’03), Napa, CA, pp. 69–78, April 2003.
[11] J. H. Mathews, Numerical Methods for Computer Science, Engineering
and Methematics, Prentice-Hall, Inc., Englewood Cliffs, NJ, 1987.
[12] J.-M. Muller, Elementary Function: Algorithms and Implementation,
Birkhauser Boston, Inc., Secaucus, NJ, 1997.
[13] S. Nagayama and T. Sasao, “Compact representations of logic functions
using heterogeneous MDDs,” IEICE Trans. on fundamentals, Vol. E86-
A, No. 12, pp. 3168–3175, Dec. 2003.
[14] S. Nagayama, T. Sasao, and J. T. Butler, “Error analysis for pro-
grammable numerical function generators based on quadratic approx-
imation,” http://www.lsi-cad.com/Error-QNFG/.
[15] T. Sasao, M. Matsuura, and Y. Iguchi, “A cascade realization of
multiple-output function for reconfigurable hardware,” Inter. Workshop
on Logic Synthesis (IWLS’01), Lake Tahoe, CA, pp. 225–230, June 12–
15, 2001.
[16] T. Sasao, J. T. Butler, and M. D. Riedel, “Application of LUT cas-
cades to numerical function generators,” Proc. the 12th workshop on
Synthesis And System Integration of Mixed Information technologies
(SASIMI’04), Kanazawa, Japan, pp. 422–429, Oct. 2004.
[17] T. Sasao, S. Nagayama, and J. T. Butler, “Programmable numerical
function generators: architectures and synthesis method,” Proc. Inter.
Conf. on Field Programmable Logic and Applications (FPL’05), Tam-
pare, Finland, pp. 118–123, Aug. 2005.
[18] Scilab 3.0, INRIA-ENPC, France, http://scilabsoft.inria.fr/
[19] M. J. Schulte and J. E. Stine, “Approximating elementary functions
with symmetric bipartite tables,” IEEE Trans. on Comp., Vol. 48, No. 8,
pp. 842–847, Aug. 1999.
[20] J. E. Stine and M. J. Schulte, “The symmetric table addition method
for accurate function approximation,” Jour. of VLSI Signal Processing,
Vol. 21, No. 2, pp. 167–177, June 1999.
[21] J. E. Volder, “The CORDIC trigonometric computing technique,” IRE
Trans. Electronic Comput., Vol. EC-820, No. 3, pp. 330–334, Sept.
1959.
