Numerical function generators using LUT cascades by Sasao, Tsutomu et al.
Calhoun: The NPS Institutional Archive
Faculty and Researcher Publications Faculty and Researcher Publications
2007-06
Numerical function generators using
LUT cascades
Sasao, Tsutomu
T. Sasao, S. Nagayama and J. T. Butler, "Numerical function generators using LUT cascades,"




Tsutomu Sasao, Fellow, IEEE, Shinobu Nagayama, Member, IEEE, and
Jon T. Butler, Fellow, IEEE
Abstract—This paper proposes an architecture and a synthesis method for high-speed computation of fixed-point numerical functions
such as trigonometric, logarithmic, sigmoidal, square root, and combinations of these functions. Our architecture is based on the
lookup table (LUT) cascade, which results in a significant reduction in circuit complexity compared to traditional approaches. This is
suitable for automatic synthesis and we show a synthesis method that converts a Matlab-like specification into an LUT cascade design.
Experimental results show the efficiency of our approach as implemented on a field-programmable gate array (FPGA).




NUMERICAL functions such as trigonometric, logarithmic,square root, reciprocal, and combinations of these
functions are extensively used in computer graphics, digital
signal processing, communication systems, robotics, astro-
physics, fluid physics, and so forth. To compute elementary
functions, iterative algorithms such as the COordinate
Rotation DIgital Computer (CORDIC) algorithm [1], [35]
have often been used. Although the CORDIC algorithm
achieves accuracy with compact hardware, its computation
time is proportional to the precision (that is, the number of
bits) [34]. For a function composed of elementary functions,
the CORDIC algorithm is slower since it has to compute
each elementary function sequentially. It is too slow for
numerically intensive applications.
Implementation by a single lookup table (LUT) for a
numerical function fðxÞ is simple and fast. For low-
precision computations of fðxÞ (for example, x and fðxÞ
have 8 bits), this implementation is efficient since the size of
the LUT is small. For high-precision computations, how-
ever, the single LUT implementation is impractical due to
the huge table size.
To reduce the memory size, polynomial approximations
have beenused [4], [6], [13], [14], [19], [30], [31], [32], [33], [36].
These methods approximate the given numerical functions
by piecewise polynomials and realize the polynomials with
hardware.Linear orquadratic approximations offer fast and
relatively high-precision evaluation of numerical func-
tions. However, most existing methods are intended only
for standard elementary functions and no systematic
method for arbitrary functions exists. This paper considers
fast and compact numerical function generators (NFGs)
for arbitrary functions and proposes an architecture and a
systematic synthesis method for them. Our architecture
and synthesis method use linear approximations and are
applicable not only to continuous functions but also to
noncontinuous functions. The circuit’s compactness is due,
in large part, to the LUT cascade [12], [25], [26]. Our
synthesis method is automated so that fast and compact
NFGs can be produced by nonexperts.
Fig. 1 shows our synthesis system. The user specifies
only the numerical function fðxÞ, the domain over x, and
accuracies for input x and output fðxÞ. Our system can
accept an arbitrary numerical function expressed either
algebraically (for example, sinðxÞ) or as a table of input/
output values. The user defines the numerical function by
using the syntax of Scilab [29], a free Matlab-like numerical
software. This is applied directly to our system, along with
the domain and accuracy. The user can either use a defined
function in Scilab or specify it directly. Note that, by
changing the parser of our system, any format can be used
for the design entry.
First, our system partitions the given domain over x into
segments and generates the begin/end points of all the
segments. Next, an error analysis is performed to determine
the precision of the various internal computations, as well
as the word width of the memory needed to store the
various coefficients. The smallest internal precision needed
to achieve the (external) accuracy specified by the user is
chosen. Finally, the Hardware Description Language (HDL)
code is generated, which is then applied to a vendor-
supplied tool that produces the field-programmable gate
array (FPGA) implementation.
This paper is organized as follows: Section 2 introduces
terminology. Section 3 discusses a piecewise linear approx-
imation for numerical functions and proposes a nonuniform
826 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 6, JUNE 2007
. T. Sasao is with the Department of Computer Science and Electronics,
Kyushu Institute of Technology, Iizuka-shi, Fukuoka-ken, 820-8502, Japan.
E-mail: sasao@cse.kyutech.ac.jp.
. S. Nagayama is with the Department of Computer Engineering, Hiroshima
City University, Hiroshima-shi, Hiroshima-ken, 731-3194, Japan.
E-mail: s_naga@cs.hiroshima-cu.ac.jp.
. J.T. Butler is with the Department of Electrical and Computer Engineering,
Naval Postgraduate School, Code EC/Bu, Monterey, CA 93943-5121.
E-mail: Jon_Butler@msn.com.
Manuscript received 1 Feb. 2006; revised 23 Oct. 2006; accepted 9 Jan. 2007;
published online 20 Feb. 2007.
For information on obtaining reprints of this article, please send e-mail to:
tc@computer.org, and reference IEEECS Log Number TC-0034-0206.
Digital Object Identifier no. 10.1109/TC.2007.1033.
0018-9340/07/$25.00  2007 IEEE Published by the IEEE Computer Society
segmentation algorithm. Section 4 shows two different
architectures for NFGs. Section 5 presents an example NFG
design. Section 6 evaluates the performance of our
architecture and synthesis system. The error analysis for
NFGs is discussed in the Appendix.
2 PRELIMINARIES
2.1 Number Representation and Precision
Definition 1. The binary fixed-point representation of a
value r has the form
ðdl1 dl2 . . . d1 d0: d1 d2 . . . dmÞ2;
where di 2 f0; 1g, l is the number of bits for the integer part,
and m is the number of bits for the fractional part of r. This
representation is two’s complement and, so,




Definition 2. Error is the absolute difference between the exact
value and the value produced by the hardware. Approxima-
tion error is the error caused by a function approximation.
Rounding error is the error caused by a binary fixed-point
representation. It is the result of truncation or rounding,
whichever is applied. However, both operations yield an error
that is called rounding error. Acceptable error is the
maximum error that an NFG may assume. Acceptable
approximation error is the maximum approximation error
that a function approximation may assume.
Definition 3. Precision is the total number of bits for a binary
fixed-point representation. Specially, n-bit precision specifies
that n bits are used to represent the number, that is,
n ¼ lþm. An n-bit precision NFG has an n-bit input.
Definition 4. Accuracy is the number of bits in the fractional
part of a binary fixed-point representation. Specially, m-bit
accuracy specifies that m bits are used to represent the
fractional part of the number. In this paper, an m-bit
accuracy NFG is an NFG with an m-bit fractional part of the
input, an m-bit fractional part of the output, and a
2m acceptable error.
Definition 5. Truncation is the process of removing lower order
bits from a binary fixed-point number. If r is represented as
r ¼ ðdl1 dl2 . . . d0: d1 . . . dm . . . dqÞ2 and is trun-
cated to r0 ¼ ðdl1 dl2 . . . d0: d1 . . . dmÞ2, then the re-
sulting rounding error is at most 2m  2q, where m  q.
Truncation never produces a value larger than r.
Definition 6. Rounding is a truncation if dm1 ¼ 0. If
dm1 ¼ 1, then 2m is added to the result of the truncation.
That is, if r is represented as
r ¼ ðdl1 dl2 . . . d0: d1 . . . dm dm1 . . . dqÞ2;
then r is rounded to
r0 ¼ ðdl1 dl2 . . . d0: d1 . . . dmÞ2 þ dm12m:
The error caused by rounding is at most 2ðmþ1Þ.
Truncation can cause a larger error than rounding. On
the other hand, truncation requires less hardware. In our
architecture, we use both truncation and rounding. This is
discussed in the Appendix.
2.2 Binary Decision Diagram (BDD)
Definition 7. A binary decision diagram (BDD) [2] is a rooted
directed acyclic graph representing a logic function
f0; 1gn ! f0; 1g. The BDD is obtained by repeatedly applying
the Shannon expansion to the logic function. It consists of two
terminal nodes representing function values 0 and 1,
respectively, and nonterminal nodes representing input
variables. Each nonterminal node has two outgoing edges,
namely, 0-edge and 1-edge, that correspond to the values of
input variables.
Definition 8. A multiterminal BDD (MTBDD) [5] is an
extension of the BDD and represents an integer function
f0; 1gn ! Z, where Z is a set of integers. The MTBDD has
multiple terminal nodes representing integer values.
Example 1. Fig. 2a shows an integer function of five
variables. It represents a segment index function, which
will be explained in Section 4.2. Fig. 2b shows an
MTBDD for the integer function. In Fig. 2b, dashed lines
and solid lines denote 0-edges and 1-edges, respectively.
In the MTBDD, terminal nodes represent function
values. Thus, to evaluate the function, we traverse the
MTBDD from the root node to a terminal node according
to the input values and obtain the function value (an
integer) at a terminal node. This MTBDD will be used for
the design of the segment index encoder, as discussed in
Section 4.2.
SASAO ET AL.: NUMERICAL FUNCTION GENERATORS USING LUT CASCADES 827
Fig. 1. Synthesis system for NFGs.
3 PIECEWISE LINEAR APPROXIMATION BASED ON
NONUNIFORM SEGMENTATION
3.1 Segmentation Problem
A straightforward method to realize a given function fðxÞ is
to use a single memory in which the address is the binary
representation of the value of x and the content of that
address is the corresponding value of fðxÞ. However, with
this simple approach, the memory size can be very large.
For example, if x and fðxÞ are represented by 24-bit binary
numbers, then 224  24 ¼ 48 Mbytes of memory is needed.
In addition, memory is not used efficiently since adjacent
locations contain nearly the same value of fðxÞ.
A more efficient realization takes advantage of the last
observation. Divide a domain over x into segments and
realize fðxÞ in each segment as the linear function c1xþ c0.
In this way, the number of memory locations is the number
of segments. Although it is now necessary to store two
numbers, c1 and c0, the total memory needed is typically
much lower.
Many existing methods [4], [6], [13], [14], [19], [30], [31],
[32], [33], [36] use uniform segmentation, which divides a
domain over x into segments with the same size. In such a
segmentation, the segment size is determined by the least
significant bits of x. In this case, the most significant bits can
be used to specify a segment number, which is used to find
a memory location that stores c1 and c0. The number of
uniform segments is determined by the smallest segment
size needed to achieve the specified approximation error.
One then must choose the size of all segments to be the
same as this smallest size. This can yield too many
segments, depending on the function. Since the memory
size depends on the number of segments, one seeks a
segmentation with the fewest segments.
A more sophisticated approach is to choose all segment
sizes to be as large as possible while maintaining the
specified approximation accuracy. In this case, the segments
are likely to have different widths. Cantoni [3] has proposed
an algorithm to approximate a given function by using
nonuniform segments. This introduces a problem of design-
ing an additional circuit that maps values of x to a segment
number. Potentially, this is a complex circuit.
To simplify the additional circuit, Lee et al. [16] have
proposed a special nonuniform segmentation called the
hierarchical segmentation scheme. The hierarchical segmenta-
tion partitions a domain into nonuniform segments by
using one of four segmentation types, {P2S(US), P2SL(US),
P2SR(US), US(US)}. Since these nonuniform segmentations
can be realized by simple circuits, the hierarchical segmen-
tation method results in fewer segments, as well as faster
and more compact NFGs, than produced by existing
uniform segmentation methods. However, in their method,
the designer has to choose the segmentation type and
segment widths for the given function by trial and error.
We seek a method in which the optimum segmentation is
generated automatically, with no input from the user. It was
shown in [26] that an LUT cascade can compactly realize
any nonuniform segmentation in which the memory size
depends on the number of segments. As a result, for fast
and compact NFG design, we can use any nonuniform
segmentation algorithm, such as Cantoni’s algorithm [3],
and can design the nonuniform segmentation circuit for the
given function.





, where x has a 5-bit accuracy and the
acceptable approximation error is 27. The number of
uniform segments is 32, whereas the number of nonuni-
form segments is 6.
Note that the segmentation shown in Fig. 3b is
identical to the segmentation represented by the table
and MTBDD shown in Fig. 2.
3.2 Nonuniform Segmentation Algorithm
The algorithms proposed by Cantoni [3] find the piecewise
linear functions with the minimum approximation error for
the given number of segments. However, we want to find the
fewest segments to approximate a given function for a
given approximation error. Thus, we use another algorithm.
Fig. 4 presents a heuristic nonuniform segmentation
algorithm based on the Douglas-Peucker algorithm, which
has been used in rendering curves for graphics displays
[9]. Since this is a dedicated algorithm for piecewise linear
approximation, this cannot be directly extended to higher
order polynomial approximations. However, since this
algorithm is simple and robust, it can be applied to a wide
range of functions (even noncontinuous functions) and it is
fast. The inputs of the segmentation algorithm are a
numerical function fðxÞ, a domain ½a; b over x, an accuracy
min of x, and an acceptable approximation error "a. Note
that "a has to be smaller than 2
ðmoutþ1Þ to produce an
mout-bit accuracy NFG because the total rounding error is
larger than 2ðmoutþ1Þ (see the Appendix). In our system, "a is
set to 2ðmoutþ2Þ by default, but the user can choose any
"a < 2
ðmoutþ1Þ. This algorithm begins by forming one
segment over the whole domain ½a; b. This is an initial
piecewise approximation by a linear function whose end
points are ða; fðaÞÞ and ðb; fðbÞÞ. If the current segment fails
to provide an acceptable approximation error, then it is
partitioned into two segments joined at a point ðp; fðpÞÞ
828 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 6, JUNE 2007
Fig. 2. MTBDD for an integer function. (a) Function table. (b) MTBDD.
where the maximum error occurs. By iterating this process
until the two subsegments approximate fðxÞ to within the
acceptable approximation error, this algorithm finds the
nonuniform segmentation with a small number of seg-
ments. Note that this algorithm does not always produce
the fewest segments because it is a heuristic. In the Douglas-
Peucker algorithm, once a point is chosen, it never moves.
For example, in the sinðxÞ function for 0  x  1, the point
chosen after the two end points is x ¼ 1=2. Therefore, if the
optimum segmentation requires a segment with x ¼ 1=2
internal to the segment, then the optimum segmentation is
not achievable. However, the Douglas-Peucker algorithm
will place many segments in regions where the function is
nonlinear and fewer segments where it is close to linear.
Therefore, it is close to optimum. The correction values vi
are used to reduce the approximation errors. In Fig. 4,
maxfg and minfg denote the maximum positive error and
the maximum negative error, respectively. These errors are
equalized by a vertical shift of linear function gðxÞ with vi.
This vertical shift can produce minimax approximations
when fðxÞ is convex or concave on the segment ½s; e.
Remez’s algorithm is usually used to obtain minimax
approximations [20]. However, our algorithm can produce
minimax approximations without using Remez’s algorithm
because our algorithm is based on a linear approximation
[21]. In Fig. 4, pmax and pmin can be found by scanning
values of x over ½s; e. However, it is time consuming. We
use a nonlinear programming algorithm [11] to find these
values efficiently.
Example 3. Fig. 5 shows the segmentation process when the




over the domain [0, 1).
First, we compute a linear function gðxÞ and find pmax
and pmin (Fig. 5a). In this case, the entire function is
approximated as a single line and pmax is the point
causing the maximum error. In the next iteration of the
algorithm, the function is approximated by two seg-
ments joined at pmax.
Then, we compute the correction value v and the
approximation error (Fig. 5b). If the approximation error
is larger than the given acceptable approximation error
"a, then the segment is partitioned at the point pmax that
causes the maximum error (Fig. 5c). Finally, the nonuni-
form segmentation is obtained by iterating this step
recursively (Fig. 5d).
Example 4. For
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi lnðxÞp , where 232  x  1, our algo-
rithm produces 25 segments. It requires a coefficients
table with 32 words. On the other hand, the nonuniform
segmentation method of Lee et al. [17] produces
59 segments for
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi lnðxÞp . It requires a coefficients table
with 64 words. For both approximations, the maximum
approximation error is 0.020 and x has a 32-bit accuracy.
SASAO ET AL.: NUMERICAL FUNCTION GENERATORS USING LUT CASCADES 829





Fig. 4. Nonuniform segmentation algorithm for the domain.
3.3 Computation of Approximated Values
For each segment ½si; ei, the numerical function fðxÞ is
approximated by the corresponding linear function giðxÞ.
That is, the approximated value y of fðxÞ is computed as
y ¼ giðxÞ ¼ c1ixþ c0i; ð1Þ
where
c1i ¼ fðeiÞ  fðsiÞ
ei  si and c0i ¼ fðsiÞ  c1isi þ vi:
By substituting c0i into (1) and, simplifying, we have
giðxÞ ¼ c1iðx siÞ þ fðsiÞ þ vi: ð2Þ
Note that, in this form of the piecewise linear approxima-
tion, x is offset by si, which is the start of the segment.
Comparing (2) with (1) reveals that ðx siÞ will be (often
much) smaller than x. As a result, a smaller multiplier is
needed to realize (2) than (1). This results in a lower
multiplication delay.
When fðxÞ is convex or concave on the segment ½si; ei,
this linear function is identical to the minimax approxima-
tion to fðxÞ [21]. For many functions, there are few
inflection points and, so, in most segments, the function is
entirely concave or convex. Therefore, our piecewise linear
approximation yields an accurate representation of the
given function with few segments.
4 ARCHITECTURE FOR NFGS
4.1 Architectures Based on Nonuniform and
Uniform Segmentations
Fig. 6 shows two architectures for the NFGs realizing (2).
Fig. 6a shows an architecture based on nonuniform segmen-
tation. It uses five units: the segment index encoder that
produces the segment index i for ½si; ei, given the value x, the
830 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 6, JUNE 2007
Fig. 5. Process of nonuniform segmentation algorithm. (a) Linear approximation with pmax and pmin. (b) Vertical shift of gðxÞ and error. (c) First
segmentation. (d) Nonuniform segmentation.
Fig. 6. Two architectures for NFGs. (a) Architecture for nonuniform
segmentation. (b) Architecture for uniform segmentation.
coefficients table for si, c1i, and fðsiÞ þ vi, the adder for
xþ ðsiÞ, the multiplier, and the output adder. Fig. 6b
shows an architecture based on uniform segmentation. In
uniform segmentation, the segment index i of ½si; ei and the
value of ðx siÞ can be obtained from the most significant
bits and the least significant bits of x, respectively. There-
fore, the architecture in Fig. 6b does not use the segment
index encoder.
4.2 Architecture of Segment Index Encoder
Fig. 7a shows a segment index function
seg funcðxÞ : f0; 1gn ! f0; 1; . . . ; t 1g;
that is, the function realized by the segment index encoder,
where x has n-bit precision and t denotes the number of
segments. This is used in the case of nonuniform segmenta-
tion. It converts a value of x into a segment index, which is
then applied to the address inputs of the coefficients
memory. Fig. 7b shows the circuit. Each block labeled
LUT is a combinational logic circuit. Thus, the complete
circuit is combinational logic. For all but the leftmost LUT,
part of the input comes from the output of another LUT and
the other part of the input comes from x. For the leftmost
LUT, all inputs come from x. The outputs of the rightmost
LUT are the segment index encoder outputs and they
encode the segment index. These outputs and all connec-
tions between LUTs are called rails. With a fixed number of
input lines representing x, the number of rails determines
the complexity of this circuit. It was shown in [26] that the
LUT cascade in Fig. 7b has reasonable complexity.
Specifically, [26] shows the following lemma:
Lemma 1. Let seg funcðxÞ be a segment index function with
t segments. Then, there exists an LUT cascade for seg funcðxÞ
with at most dlog2 te rails.
From Lemma 1, we have the following theorem:
Theorem 1. Consider a segment index function
seg funcðxÞ : f0; 1gn ! f0; 1; . . . ; t 1g, where n is the
precision of x and t is the number of segments. Then,
seg funcðxÞ can be implemented by an LUT cascade, as
shown in Fig. 7, where each LUT has kþ 2 inputs and
k outputs (rails), the number of LUTs is dnk2 e, the total
memory size is 2kþ1kðn kÞ bits, and k ¼ dlog2 te.
Proof. By Lemma 1, the number of the rails of the LUT
cascade is at most k ¼ dlog2 te. In addition, we can easily
see that the LUT cascade consists of dnk2 e LUTs, where
each LUT has kþ 2 inputs and k outputs. When ðn kÞ is
an even number, the total memory size of the LUT
cascade is
2kþ2  k n k
2
¼ 2kþ1kðn kÞ bits:
When ðn kÞ is an odd number, the memory size of the
rightmost LUT is 2kþ1  k bits. Thus, the total memory
size of the LUT cascade is
2kþ2  k n k 1
2
þ 2kþ1  k ¼ 2kþ1kðn kÞ bits:
Hence, we have the theorem. tu
The above theorem shows that the memory size of the
LUT cascade depends only on the precision n and the
number of segments t. Thus, an approximation with fewer
segments requires a smaller segment index encoder. We can
use Theorem 1 to estimate the memory size of the segment
index encoder from n and t. This will be shown in
Section 4.4.
The LUT cascade is obtained by functional decomposi-
tion using an MTBDD for seg funcðxÞ [12], [25], [26]. Our
synthesis system first finds a nonuniform segmentation,
then generates the MTBDD, and, finally, decomposes it to
find the LUT cascade.
Example 5. Consider the segment index function in Fig. 2a.
Let the binary fixed-point representation of x be
ð: x1 x2 x3 x4 x5Þ2. Fig. 2b is the corresponding
MTBDD. By decomposing the MTBDD in Fig. 2, we
obtain two different LUT cascades in Fig. 8. Fig. 8
illustrates the relationship between each LUT and
decompositions of the MTBDD. In these figures, the
“ri” in each LUT denotes the rails that represent
subfunctions in the MTBDD. In the MTBDD, numbers
assigned to edges that cut across the horizontal lines
represent subfunctions. The LUT cascade in Fig. 8a
requires a memory size of 23  3þ 24  3þ 24  3 ¼ 120
bits and three LUTs. On the other hand, the LUT cascade
in Fig. 8b requires a memory size of 24  3þ 24  3 ¼ 96
bits and two LUTs. Note that the best realization is the
LUT cascade consisting of a single LUT. As shown in
Theorem 1, since k ¼ dlog2 6e ¼ 3, there exists an LUT
cascade consisting of a single LUT with 3þ 2 inputs and
3 outputs. Its memory size is 25  3 ¼ 96 bits.
As shown in Example 5, the memory size and the
number of LUTs in an LUT cascade also depend on the
decomposition and the variable order of the MTBDD. To
find the best LUT cascade, we use an optimization
algorithm for heterogeneous multivalued decision dia-
grams (MDDs) [22], [23]. In an FPGA implementation of
an NFG, each LUT in an LUT cascade is implemented with
SASAO ET AL.: NUMERICAL FUNCTION GENERATORS USING LUT CASCADES 831
Fig. 7. Segment index encoder. (a) Segment index function. (b) LUT
cascade.
a block RAM in an FPGA. In this case, our synthesis system
decomposes an MTBDD so that each LUT has the same size
as the available block RAM (for example, 4 or 18 Kbits).
Note that this automatically produces a pipelined imple-
mentation of an LUT cascade since block RAMs in a
modern FPGA are synchronous.
To the best of our knowledge, this is the first realization
method for any general segment index function.
4.3 Size Reduction of Multiplier
The coefficients table in Fig. 6a has 2k words, where
k ¼ dlog2 te, and t is the number of segments. FPGA
technology restricts memory sizes to a power of 2. There-
fore, if t < 2k, then we can increase the number of segments
up to t ¼ 2k, without increasing the memory size. From
Lemma 1, the size of the LUT cascade also depends on the
value of k rather than t. Thus, increasing the number of
segments to t ¼ 2k rarely increases the size of the LUT
cascade. To reduce the multiplier size, we reduce the
maximum value of ðx siÞ (that is, the maximum width of
segments ½si; ei that is defined as ðei  siÞ) by dividing the
widest segment into two equal-sized segments up to t ¼ 2k.
This reduction of the segment width reduces the rounding
error as well (see the Appendix).
To reduce the multiplier size further, we use the scaling
method shown in [15]. We represent c1i as c1i ¼ c1i  2hi 
2hi and compute c1iðx siÞ by using the multiplier for the
right-shifted multiplicand c1i  2hiðx siÞ and a left shifter
to restore the term. Applying a right shift reduces the
number of bits for c1i  2hi (that is, the multiplier size) by
rounding the least significant hi bits of c1i while increasing
the rounding error. In the Appendix, we find the largest hi
for each segment i that preserves the given acceptable error.
When the largest his are 0 for all of the segments, we do not
use the scaling method. For simplicity, Fig. 6a omits the
scheme for the scaling method.
Note that these techniques utilizing the variation in
segment sizes do not apply to the architecture based on
uniform segmentation.
Example 6. Consider a 16-bit-precision NFG for
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi lnðxÞp
for 0 < x  1 and its FPGA implementation with the
Altera Stratix EP1S20F484C5. By using the scaling
method, c1i and hi have 16 and 3 bits, respectively, and
they produce an NFG with 241 logic elements (LEs), two
multipliers (digital signal processors (DSPs)), a memory
size of 168,960 bits, an operating frequency of 208 MHz,
and a delay of 38 ns. On the other hand, without the
scaling method, c1i has 22 bits and it produces an NFG
with fewer LEs (153 LEs) but more DSPs (eight DSPs), a
larger memory size (172,032 bits), a lower operating
frequency (130 MHz), and a longer delay (54 ns).
4.4 Memory Size Estimate for Each Architecture
Our synthesis system estimates the memory size for each of
the two architectures in Fig. 6 and generates the HDL code
that implements the architecture with the smaller memory
size. That is, our synthesis system chooses an architecture
based on the estimate of the memory size. The estimate is
discussed below.
4.4.1 Memory Size Estimate for Nonuniform
Segmentation
The coefficients table in Fig. 6a has 2k words, where
k ¼ dlog2 te, and t is the number of nonuniform segments
obtained by the algorithm in Fig. 4. We assume that three
coefficients, si, c1i, and fðsiÞ þ vi, have mout bits, respec-
tively, where mout is the output accuracy of NFG given by
the specification. Then, the memory size of the coefficients
table is estimated as
2k  3mout bits: ð3Þ
We assume that when we generate an LUT cascade with
the minimum memory size, each LUT in the LUT cascade
has kþ 2 inputs and k outputs (the number 2 of kþ 2 is
based on the results in [28]). Then, from Theorem 1, the
memory size of the LUT cascade (segment index encoder) is
estimated as
2kþ1  kðnin  kÞ bits; ð4Þ
832 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 6, JUNE 2007
Fig. 8. Example of LUT cascades, where x ¼ ð: x1 x2 x3 x4 x5Þ2. (a) Decomposition into three LUTs. (b) Decomposition into two LUTs.
where nin is the input precision of NFG given by the
specification.
The total memory size of an NFG based on nonuniform
segmentation is estimated as (3) + (4):
2k  f3mout þ 2kðnin  kÞg bits:
4.4.2 Memory Size Estimate for Uniform Segmentation
First, we estimate the number of uniform segments by using
the nonuniform segmentation results obtained by the
algorithm in Fig. 4. Let ½a; b and ½smin; emin be the given
domain and the narrowest segment produced by the
algorithm, respectively. Then, the number of uniform
segments tu is estimated as






w ¼ log2ðemin  smin þ 2minÞb c
and min is the input accuracy of the NFG given by the
specification. Thus, the coefficients table in Fig. 6b has
2ku words, where ku ¼ dlog2 tue. We assume that the
coefficients have the same number of bits as in the
nonuniform case. Then, the memory size of the coefficients
table is estimated as
2ku  2mout ¼ 2kuþ1mout bits:
This is the total memory size estimate of the NFG based on
uniform segmentation. Note that, when ku ¼ nin, the total
memory size is estimated as a single LUT of size 2ninnout,
where nout is the output precision.
5 EXAMPLE DESIGN
In this section, we show in detail the design flow of a 5-bit





for 0  x < 1. For simplicity, we do
not use the multiplier reduction techniques shown in
Section 4.3.
First, we partition the given domain 0  x < 1 into
nonuniform segments by using the algorithm shown in
Fig. 4. This produces six segments, as shown in Fig. 3b.
Note that the acceptable approximation error is set to
2ð5þ2Þ ¼ 27, as discussed in Section 3.2.
Then, we analyze the errors of the NFG and compute the
number of bits needed to store coefficients by the method
shown in the Appendix. Applying this analysis yields the
following required number of bits: si, 4 bits; c1i, 9 bits
(3 integer bits and 6 fractional bits); and c0i, 7 bits
(7 fractional bits).
Finally, we design the NFG shown in Fig. 6a by using
these required numbers of bits. From the segments
produced by the algorithm in Fig. 4, we generate the
MTBDD, as shown in Fig. 2. By decomposing the MTBDD
as shown in Fig. 8b, we obtain the LUT cascade. Moreover,
from the segments and correction values, we generate the
linear functions and then the memory data for the
coefficients table. Since the number of segments is six, the
memory size of the coefficients table is
2dlog2 6e  ð4þ 9þ 7Þ ¼ 23  20 ¼ 160 bits
and the memory size of the LUT cascade is
24  3þ 24  3 ¼ 96 bits:
Thus, the total memory size of the NFG is 256 bits. Fig. 9
shows the generated NFG.
Our synthesis system automates this design flow.
6 EXPERIMENTAL RESULTS
6.1 Comparison of the Number of Uniform and
Nonuniform Segments
Table 1 compares uniform and nonuniform segmentations
for two acceptable approximation errors: 217 and 225.
Eleven functions are shown. The last one, Matlab’s
“humps” function, is a quotient of polynomials:
humps ¼ 0:0004xþ 0:0002
x4  1:96x3 þ 1:348x2  0:378xþ 0:0373 :
Table 1 shows that, for all functions, nonuniform
segmentation yields fewer segments than uniform segmen-
tation. Especially for logarithmic, square root, reciprocal,
and their compound functions, the number of segments
needed in nonuniform segmentation is only a few percent
of the number of segments needed in uniform segmenta-
tion. However, for the sinðxÞ and ex functions, the
additional segments needed in a uniform segmentation is
not so large—only about twice that needed for a nonuni-
form segmentation. This suggests that uniform segmenta-
tion is a better choice for these functions. Many existing
polynomial approximation methods are based on uniform
segmentation. For the sinðxÞ and ex functions, the methods
based on uniform segmentation require relatively few
segments and, therefore, a small memory size. However,
for logarithmic, square root, reciprocal, and their compound
functions on the domains listed in Table 1, the uniform
SASAO ET AL.: NUMERICAL FUNCTION GENERATORS USING LUT CASCADES 833





method requires an excessive number of segments. For
standard elementary functions such as 1=x, lnðxÞ, and ffiffiffixp ,
range reduction [20] is often used to translate the given
domain into the domain suitable for uniform segmentation
and to reduce the number of segments. Thus, the uniform
method requires range reduction for these functions even if
the given domain is a small range, as shown in Table 1. On
the other hand, the nonuniform method approximates a
wide range of functions with fewer segments without using
range reduction. That is, our method requires no range
reduction when the given domain is not large. This is useful
for nonexperts who are unfamiliar with range reduction
and for functions for which there is no known range
reduction technique, as in the case of compound functions.
In addition, Table 1 shows that the CPU time to compute
the segmentation is strongly dependent on the number of
segments. Therefore, a smaller acceptable approximation
error requires more segments and longer CPU time.
However, for all of the functions in Table 1, the CPU times
needed to compute nonuniform segmentations are shorter
than 2 sec when the acceptable approximation error is 225.
These results show that, for various functions, our
segmentation algorithm generates fewer nonuniform seg-
ments quickly and it is useful for automatic synthesis.
6.2 Memory Sizes for Two Architectures
Table 2 compares the memory sizes for the two architec-
tures shown in Fig. 6. These correspond to nonuniform and
uniform segmentations. In Table 2, the values under the
column “Estimated mem. size” are obtained by the
estimates derived in Section 4.4. As shown in Section 4.2,
the memory size of the LUT cascade depends on the
decomposition and variable order of the MTBDD. In this
experiment, we used the optimum decomposition and
variable order that minimize the memory size [22], [23].
Table 2 shows that, for logarithmic, square root, reciprocal,
and their compound functions, the memory size for
nonuniform segmentation is smaller than that for the
uniform segmentation. On the other hand, for sinðxÞ and
ex functions, the memory size for uniform segmentation is
smaller than that for the nonuniform segmentation. As
shown in Table 1, for sinðxÞ and ex functions, the
difference between the numbers of uniform and nonuni-
form segments is not so large. Thus, for such functions, the
architecture based on uniform segmentation requires a
smaller memory size than that based on nonuniform
segmentation because it needs no segment index encoder.
In addition, Table 2 shows that the estimated memory sizes
are accurate. Designing the two architectures for each
function can take minutes of CPU time. To reduce the
design time, our synthesis system first estimates the
memory sizes, then selects the better architecture, and,
finally, designs only the better one.
As mentioned in the previous section, for 1=x, lnðxÞ, andffiffiffi
x
p
, the number of uniform segments can be reduced by
using range reduction, thus reducing the memory size for
834 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 6, JUNE 2007
TABLE 1
Numbers of Uniform and Nonuniform Segments
TABLE 2
Memory Sizes (Bits) for Uniform and Nonuniform Segmentations
uniform segmentation. Even in such cases (reduced
domains), our memory size estimates are accurate and, so,
our synthesis system accommodates range reduction.
6.3 FPGA Implementation
We implemented 16-bit-precision NFGs by using the Altera
Stratix EP1S20F484C5 FPGA. Note that both of our
architectures contain only combinational logic. For applica-
tion-specific integrated circuit (ASIC) implementations,
such a design yields the smallest delay. It represents one
extreme in a range of NFG designs. However, in an FPGA
implementation, this form does not always yield the smallest
delay due to the interconnection delay. The pipelined design
often achieves higher performance. In addition, modern
FPGAs have synchronous block RAMs and this requires
pipelined implementations. Thus, we adopt another extreme
design, that is, a fully pipelined design. In this design, all
elements, LUTs, adders, multipliers, and memory are
buffered. Note that the delay of a small unit such as an adder
is much smaller than a larger combinational logic circuit.
Table 3 compares the FPGA implementation results of two
architectures shown in Fig. 6. In this table, the “Delay”
columns showthe total delay timeof eachNFGfrom the input
to the output in nanoseconds. These results correspond to the
delays of combinational NFGs.
The NFGs based on uniform segmentation require fewer
pipeline stages and have shorter delays than the nonuniform
segmentation because they have no segment index
encoder. However, for logarithmic, square root, and
reciprocal functions, the NFGs based on uniform segmen-
tation are not so easily implemented in an FPGA due to
the excessive memory size. Table 3 shows that they cannot
be mapped into the FPGA due to the insufficient memory
blocks. Note that NFGs that have only one pipelined stage
in Table 3 are realized with a single LUT due to the
excessively many segments. On the other hand, for various
functions, the NFGs based on nonuniform segmentation
achieve a high operating frequency.
To show the efficiency of our automatic synthesis
system, we compare our NFGs with the ones reported in
[15] and [16]. NFGs in [15] are based on nonuniform
segmentation and linear approximation and they are
manually designed. Moreover, NFGs in [16] are not only
also based on nonuniform segmentation, but also on
quadratic approximation. We generated the NFGs with the
same precision as in [15] and [16]. Note that, in [15], only
one result for three functions is reported. Thus, the detailed
result for each function is unavailable. Table 4 compares
FPGA implementation results for the NFGs. In our NFGs,
the segment index encoders are realized only by RAMs.
Thus, our NFGs require more block RAMs than the NFGs in
[15] and [16]. However, our NFGs require fewer slices and
achieve a shorter delay time.
SASAO ET AL.: NUMERICAL FUNCTION GENERATORS USING LUT CASCADES 835
TABLE 3
FPGA Implementation of NFGs Based on Two Architectures
TABLE 4
Comparison with Existing Methods
In [16], the 24-bit-precision NFGs for x lnðxÞ and the
humps functions are implemented with the FPGA
(XC2V4000) in Table 4. On the other hand, our 24-bit NFGs
for those functions require a larger FPGA (Virtex-II Pro,
XC2VP70) since our NFGs are based on linear approxima-
tion and require more block RAMs. Although NFGs based
on linear approximation require a larger memory size than
NFGs based on quadratic approximation, they are faster.
This is because linear approximation requires only one
multiplier [24].
7 CONCLUSION AND COMMENTS
We have proposed an architecture and a synthesis method
for programmable NFGs for trigonometric functions,
logarithm functions, square root, reciprocal, and so forth.
Our architecture uses the LUT cascade to compactly realize
various numerical functions and is suitable for automatic
synthesis. To the best of our knowledge, this is the first
architecture appropriate for any nonuniform segmentation.
Experimental results show that our synthesis system
1) efficiently implements NFGs for wide range of numerical
functions, 2) generates more suitable NFGs for the given
functions between two architectures by using an accurate
estimate of memory sizes, and 3) generates the NFGs with
comparable performance to manually designed ones.
Since our NFGs are based on linear approximation, the
practical precision for an FPGA implementation is up to
24 bits because of the memory size. Therefore, our method
is useful to generate fast NFGs for a wide range of functions
in up to 24-bit precision.
APPENDIX A
We analyze the error for our NFGs and show a method to
obtain the appropriate bit sizes for the units in our
architectures. Our synthesis system applies the error
analysis method proposed in [8]. We show only the error
analysis for the NFG using nonuniform segmentation in
Fig. 6a. The analysis for the NFG using uniform segmenta-
tion in Fig. 6b is similar.
A.1 Error of NFG
In our NFG, there are two kinds of errors: approximation
error and rounding error. The approximation error, "a, was
discussed in Section 3. Thus, this section focuses on
rounding error. Since truncation and rounding occur only
in the fractional bits, we ignore the integer bits in this
analysis. We assume that there is at least one fractional bit.
Errors in coefficients. The coefficients, c1i and fðsiÞ þ vi,
used in the linear approximation giðxÞ ¼ c1iðx siÞ þ
fðsiÞ þ vi are rounded to u1 and u0 bits, respectively. That
is, due to rounding, the coefficients c1i and fðsiÞ þ vi
become
c1i þ 1 ðj1j  2ðu1þ1ÞÞ
fðsiÞ þ vi þ 0 ðj0j  2ðu0þ1ÞÞ;
where 1 and 0 are rounding errors of c1i and fðsiÞ þ vi,
respectively. In this case, we need no hardware for
rounding coefficients because they are precomputed before
storing in the coefficients table. Since rounding yields a
more accurate result than truncation, we choose rounding.
Errors in multiplication. When input x has min bits, our
synthesis system ensures that si also hasmin bits. Therefore,
the result of the subtraction operation ðx siÞ also has
min bits and has no rounding error. As shown in the
previous paragraph, the value of c1i is stored as c1i þ 1 in
the coefficients table. Thus, the original product c1iðx siÞ
is changed to
ðc1i þ 1Þðx siÞ ¼ c1iðx siÞ þ 1ðx siÞ: ðiÞ
The value of (i) has ðu1 þminÞ bits and it is truncated to
u0 bits in order to match the addition with fðsiÞ þ vi. Note
that u0  u1 þmin. We choose truncation because it does
not require additional hardware, whereas rounding re-
quires half adders at the multiplier output. Following the
method shown in [30], the ðu0 þ 1Þth fractional bit dðu0þ1Þ
in (i) is set to 1 after truncation. The rounding error 2 for
truncation from ðu1 þminÞ to u0 bits is
0  2  2ðu0þ1Þ  2ðu1þminÞ:
Thus, the linear part c1iðx siÞ of the approximation has the
value
c1iðx siÞ þ 1ðx siÞ  2:
This is added to the constant coefficient fðsiÞ þ vi to form
the complete approximation.
Errors in addition. From the previous paragraphs, the
original sum c1iðx siÞ þ fðsiÞ þ vi is changed to
c1iðx siÞ þ fðsiÞ þ vi þ 0 þ 1ðx siÞ  2: ðiiÞ
This value has u0 bits and this is rounded to mout bits (the
output accuracy given in the specification), where
mout  u0. Since dðu0þ1Þ in (i) is set to 1, dðu0þ1Þ in (ii) is
also 1. Note that dðu0þ1Þ is not implemented in hardware
[30]. The error 3 for this rounding is
j3j  2ðmoutþ1Þ  2ðu0þ1Þ:
Thus, the value of the approximation giðxÞ is
c1iðx siÞ þ fðsiÞ þ vi þ 0 þ 1ðx siÞ  2 þ 3:
It follows that
0 þ 1ðx siÞ  2 þ 3 ðiiiÞ
is the total rounding error for an NFG implemented with
nonuniform segmentation. The absolute value of (iii) is
maximum when the rounding errors are
0 ¼  2ðu0þ1Þ; 1 ¼ 2ðu1þ1Þ;
2 ¼ 2ðu0þ1Þ  2ðu1þminÞ; and
3 ¼  ð2ðmoutþ1Þ  2ðu0þ1ÞÞ:
Since the smallest value of x in the ith segment is the
segment’s starting point si, the value of ðx siÞ is positive.
Therefore, the maximum total rounding error "r is
"r ¼ 2ðu0þ1Þ þ 2ðu1þ1Þmax seg 2ðu1þminÞ
þ 2ðmoutþ1Þ;
where
max seg ¼ max
0it1
ðei  siÞ
and t is the number of segments.
836 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 6, JUNE 2007
A.2 Calculation of Bit Sizes for Units
For any unit in Fig. 6a, the number of bits in the integer part is
dlog2ðmax valueþ 1Þe þ 1;
where max value is an integer that denotes the maximum
absolute value of output of each unit. That is,
max value ¼ maxðjmax valj; jmin valjÞ, where max val and
min val are the maximum and minimum values, respec-
tively, of the unit’s output.
On the other hand, the number of bits for the fractional
part is calculated using the result of the error analysis. From
the error analysis, an NFG with an acceptable error " is
achieved when
"a þ "r < "; ðivÞ
where "a and "r are the approximation error and the
rounding error, respectively. To generate fast and compact
NFGs, we find the minimum u0 and u1 that satisfy (iv) by
using nonlinear programming [11], where the objective and
constraint are
Minimize : u0 þ u1
Constraint : "a þ "r < ":
Calculation of shift bit hi. From (i), the maximum error
in the multiplication is
2ðu1þ1Þmax seg;
where max seg is the maximum value of ðei  siÞ over all
segments i. There is the largest positive integer hi that
satisfies the following relation for each segment i:
2ðu1þ1Þ2hiðei  siÞ  2ðu1þ1Þmax seg
2ðu1hiþ1Þðei  siÞ  2ðu1þ1Þmax seg:
ðvÞ
From (v), for each segment, coefficient c1i can be rounded to
ðu1  hiÞ bits within an acceptable error. Note that hi, as
shown in (vi), can be used in the multiplier scaling method













This research is partly supported by the Grant-in-Aid for
Scientific Research of the Japan Society for the Promotion of
Science (JSPS), funds from the Ministry of Education,
Culture, Sports, Science, and Technology (MEXT) via the
Kitakyushu innovative cluster project, and a contract with
the US National Security Agency. The authors appreciate
Dr. Marc D. Riedel for the work in [26]. The comments of
the three reviewers were useful in improving this paper.
This paper is an extended version of [27].
REFERENCES
[1] R. Andraka, “A Survey of CORDIC Algorithms for FPGA-Based
Computers,” Proc. Sixth ACM/SIGDA Int’l Symp. Field Program-
mable Gate Array (FPGA ’98), pp. 191-200, Feb. 1998.
[2] R.E. Bryant, “Graph-Based Algorithms for Boolean Function
Manipulation,” IEEE Trans. Computers, vol. 35, no. 8, pp. 677-
691, Aug. 1986.
[3] A. Cantoni, “Optimal Curve Fitting with Piecewise Linear
Functions,” IEEE Trans. Computers, vol. 20, no. 1, pp. 59-67, Jan.
1971.
[4] J. Cao, B.W.Y. Wei, and J. Cheng, “High-Performance Architec-
tures for Elementary Function Generation,” Proc. 15th IEEE Symp.
Computer Arithmetic (ARITH ’01), pp. 136-144, June 2001.
[5] E.M. Clarke, K.L. McMillan, X. Zhao, M. Fujita, and J. Yang,
“Spectral Transforms for Large Boolean Functions with Applica-
tions to Technology Mapping,” Proc. 30th ACM/IEEE Design
Automation Conf., pp. 54-60, June 1993.
[6] F. de Dinechin and A. Tisserand, “Multipartite Table Methods,”
IEEE Trans. Computers, vol. 54, no. 3, pp. 319-330, Mar. 2005.
[7] N. Doi, T. Horiyama, M. Nakanishi, and S. Kimura, “Minimization
of Fractional Wordlength on Fixed-Point Conversion for High-
Level Synthesis,” Proc. Asia and South Pacific Design Automation
Conf. (ASPDAC ’04), pp. 80-85, 2004.
[8] N. Doi, T. Horiyama, M. Nakanishi, and S. Kimura, “An
Optimization Method in Floating-Point to Fixed-Point Conversion
Using Positive and Negative Error Analysis and Sharing of
Operations,” Proc. 12th Workshop Synthesis and System Integration of
Mixed Information Technologies (SASIMI ’04), pp. 466-471, Oct. 2004.
[9] D.H. Douglas and T.K. Peucker, “Algorithms for the Reduction of
the Number of Points Required to Represent a Line or Its
Caricature,” The Canadian Cartographer, vol. 10, no. 2, pp. 112-122,
1973.
[10] H. Hassler and N. Takagi, “Function Evaluation by Table Look-
Up and Addition,” Proc. 12th IEEE Symp. Computer Arithmetic
(ARITH ’95), pp. 10-16, July 1995.
[11] T. Ibaraki and M. Fukushima, FORTRAN 77 Optimization
Programming, 1991 (in Japanese).
[12] Y. Iguchi, T. Sasao, and M. Matsuura, “Realization of Multiple-
Output Functions by Reconfigurable Cascades,” Proc. Int’l Conf.
Computer Design: VLSI in Computers and Processors (ICCD ’01),
pp. 388-393, Sept. 2001.
[13] V.K. Jain, S.A. Wadekar, and L. Lin, “A Universal Nonlinear
Component and Its Application to WSI,” IEEE Trans. Components,
Hybrids, and Manufacturing Technology, vol. 16, no. 7, pp. 656-664,
Nov. 1993.
[14] V.K. Jain and L. Lin, “High-Speed Double Precision Computation
of Nonlinear Functions,” Proc. 12th IEEE Symp. Computer
Arithmetic (ARITH ’95), pp. 107-114, July 1995.
[15] D.-U. Lee, W. Luk, J. Villasenor, and P.Y.K. Cheung, “Non-
Uniform Segmentation for Hardware Function Evaluation,” Proc.
11th Int’l Conf. Field Programmable Logic and Applications (FPL ’03),
pp. 796-807, Sept. 2003.
[16] D.-U. Lee, W. Luk, J. Villasenor, and P.Y.K. Cheung, “Hierarchical
Segmentation Schemes for Function Evaluation,” Proc. IEEE Conf.
Field-Programmable Technology (FPT ’03), pp. 92-99, Dec. 2003.
[17] D.-U. Lee, W. Luk, J. Villasenor, and P.Y.K. Cheung, “A Gaussian
Noise Generator for Hardware-Based Simulations,” IEEE Trans.
Computers, vol. 53, no. 12, pp. 1523-1534, Dec. 2004.
[18] D.-U. Lee, A.A. Gaffar, O. Mencer, and W. Luk, “Optimizing
Hardware Function Evaluation,” IEEE Trans. Computers, vol. 54,
no. 12, pp. 1520-1531, Dec. 2005.
[19] D.M. Lewis, “Interleaved Memory Function Interpolators with
Application to an Accurate LNS Arithmetic Unit,” IEEE Trans.
Computesr, vol. 43, no. 8, pp. 974-982, Aug. 1994.
[20] J.-M. Muller, Elementary Function: Algorithms and Implementation.
Birkhauser Boston, 1997.
[21] J.-M. Muller, “Partially Rounded Small-Order Approximations for
Architecture, Hardware-Oriented, Table-Based Methods,” Proc.
16th IEEE Symp. Computer Arithmetic (ARITH ’03), pp. 114-121,
June 2003.
[22] S. Nagayama and T. Sasao, “Compact Representations of Logic
Functions Using Heterogeneous MDDs,” IEICE Trans. Fundamen-
tals, vol. E86-A, no. 12, pp. 3168-3175, Dec. 2003.
[23] S. Nagayama, T. Sasao, “On the Optimization of Heterogeneous
MDDs,” IEEE Trans. Computer-Aided Design of Integrated Circuits
and Systems, vol. 24, no. 11, pp. 1645-1659, Nov. 2005.
SASAO ET AL.: NUMERICAL FUNCTION GENERATORS USING LUT CASCADES 837
[24] S. Nagayama, T. Sasao, and J.T. Butler, “Programmable Numerical
Function Generators Based on Quadratic Approximation: Archi-
tecture and Synthesis Method,” Proc. Asia and South Pacific Design
Automation Conf. (ASPDAC ’06), pp. 378-383, 2006.
[25] T. Sasao and M. Matsuura, “A Method to Decompose Multiple-
Output Logic Functions,” Proc. 41st Design Automation Conf. (DAC
’04), pp. 428-433, June 2004.
[26] T. Sasao, J.T. Butler, and M.D. Riedel, “Application of LUT
Cascades to Numerical Function Generators,” Proc. 12th Workshop
Synthesis and System Integration of Mixed Information Technologies
(SASIMI ’04), pp. 422-429, Oct. 2004.
[27] T. Sasao, S. Nagayama, and J.T. Butler, “Programmable Numerical
Function Generators: Architectures and Synthesis Method,” Proc.
Int’l Conf. Field-Programmable Logic and Applications (FPL ’05),
pp. 118-123, Aug. 2005.
[28] T. Sasao, “Design Methods for Multiple-Valued Input Address
Generators,” Proc. 36th IEEE Int’l Symp. Multiple-Valued Logic
(ISMVL ’06), May 2006.
[29] Scilab 3.0, INRIA-ENPC France, http://scilabsoft.inria.fr/, 2004.
[30] M.J. Schulte and J.E. Stine, “Symmetric Bipartite Tables for
Accurate Function Approximation,” Proc. 13th IEEE Symp.
Computer Arithmetic (ARITH ’97), vol. 48, no. 9, pp. 175-183, 1997.
[31] M.J. Schulte and J.E. Stine, “Approximating Elementary Functions
with Symmetric Bipartite Tables,” IEEE Trans. Computers, vol. 48,
no. 8, pp. 842-847, Aug. 1999.
[32] M.J. Schulte and E.E. Swartzlander Jr., “Hardware Designs for
Exactly Rounded Elementary Functions,” IEEE Trans. Computers,
vol. 43, no. 8, pp. 964-973, Aug. 1994.
[33] J.E. Stine and M.J. Schulte, “The Symmetric Table Addition
Method for Accurate Function Approximation,” J. VLSI Signal
Processing, vol. 21, no. 2, pp. 167-177, June 1999.
[34] N. Takagi, T. Asada, and S. Yajima, “Redundant CORDIC
Methods with a Constant Scale Factor for Sine and Cosine
Computation,” IEEE Trans. Computers, vol. 40, no. 9, pp. 989-995,
Sept. 1991.
[35] J.E. Volder, “The CORDIC Trigonometric Computing Technique,”
IRE Trans. Electronic Computers, vol. 8, no. 3, pp. 330-334, Sept.
1959.
[36] W. Wong and E. Goto, “Fast Evaluation of the Elementary
Functions in Single Precision,” IEEE Trans. Computers, vol. 44,
no. 3, pp. 453-457, Mar. 1995.
Tsutomu Sasao (S’72-M’77-SM’90-F’94) re-
ceived the BE, ME, and PhD degrees in
electronics engineering from Osaka University,
Osaka, Japan, in 1972, 1974, and 1977,
respectively. He has held faculty/research posi-
tions at Osaka University, Japan, the IBM T.J.
Watson Research Center, Yorktown Heights,
New York, and the Naval Postgraduate School,
Monterey, California. He is now a professor in
the Department of Computer Science and
Electronics at the Kyushu Institute of Technology, Iizuka, Japan. He
has served as the program chairman for the IEEE International
Symposium on Multiple-Valued Logic (ISMVL) many times. In addition,
he was the symposium chairman of the 28th ISMVL held in Fukuoka,
Japan, in 1998. He has also served as an associate editor of the IEEE
Transactions on Computers. His research areas include logic design
and switching theory, representations of logic functions, and multiple-
valued logic. He has published more than nine books on logic design,
including Logic Synthesis and Optimization, Representation of Discrete
Functions, Switching Theory for Logic Synthesis, and Logic Synthesis
and Verification (Kluwer Academic, 1993, 1996, 1999, and 2001,
respectively). He received the National Institute of Water and Atmo-
spheric Research (NIWA) Memorial Award in 1979, the Distinctive
Contribution Awards from the IEEE Computer Society Technical
Committee on Multiple-Valued Logic (MVL-TC) for papers presented
at ISMVL in 1986, 1996, 2003, and 2004, and the Takeda Techno-
Entrepreneurship Award in 2001. He is a fellow of the IEEE and a
member of the IEEE Computer Society.
Shinobu Nagayama (S’02-M’04) received the
BS and ME degrees from Meiji University,
Kanagawa, Japan, in 2000 and 2002, respec-
tively, and the PhD degree in computer science
from the Kyushu Institute of Technology, Iizuka,
Japan, in 2004. He is now a research associate
at Hiroshima City University, Hiroshima, Japan.
He received the Outstanding Contribution Paper
Award from the IEEE Computer Society Tech-
nical Committee on Multiple-Valued Logic (MVL-
TC) in 2005 for a paper presented at the 34th IEEE International
Symposium on Multiple-Valued Logic (ISMVL ’04) and the Excellent
Paper Award from the Information Processing Society of Japan (IPS) in
2006. His research interests include numerical function generators,
decision diagrams, software synthesis, and embedded systems. He is a
member of the IEEE and the IEEE Computer Society.
Jon T. Butler (S’67-M’67-SM’82-F’89) received
the BEE and MEngr degrees from Rensselaer
Polytechnic Institute, Troy, New York, in 1966
and 1967, respectively, and the PhD degree
from Ohio State University, Columbus, in 1973.
Since 1987, he has been a professor at the
Naval Postgraduate School, Monterey, Califor-
nia. From 1974 to 1987, he was with North-
western University, Evanston, Illinois. During
that time, he served two periods of leave at the
Naval Postgraduate School: first as a National Research Council senior
postdoctoral associate (1980-1981) and second as the NAVALEX chair
professor (1985-1987). He served one period of leave as a foreign
visiting professor at the Kyushu Institute of Technology, Iizuka, Japan.
His research interests include logic optimization and multiple-valued
logic. He has served on the editorial boards of the IEEE Transactions on
Computers, Computer, and IEEE Computer Society Press. He served
as the editor-in-chief of Computer and the IEEE Computer Society
Press. He received the Award of Excellence, the Outstanding
Contributed Paper Award, and a Distinctive Contributed Paper Award
for papers presented at the International Symposium on Multiple-Valued
Logic (ISMVL). He received the Distinguished Service Award, two
Meritorious Awards, and nine Certificates of Appreciation for service to
the IEEE Computer Society. He is a fellow of the IEEE and a member of
the IEEE Computer Society.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
838 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 6, JUNE 2007
