Programmable numerical function generators: Architectures and synthesis system by Sasao, Tsutomu et al.
Calhoun: The NPS Institutional Archive
Faculty and Researcher Publications Faculty and Researcher Publications
2005-08
Programmable numerical function
generators: Architectures and synthesis system
Sasao, Tsutomu
þÿ T .   S a s a o ,   S .   N a g a y a m a ,   a n d   J .   T .   B u t l e r ,    P r o g r a m m a b l e   n u m e r i c a l   f u n c t i o n   g e n e r a t o r s :
Architectures and synthesis system," FPL2005 ,Tampere, Aug.24-26, 2005, pp.118-123.
http://hdl.handle.net/10945/35861






, Jon T. Butler

 
Dept. of CSE, Kyushu Institute of Technology, Japan

Dept. of CE, Hiroshima City University, Japan

Dept. of ECE, Naval Postgraduate School, U.S.A.
ABSTRACT
This paper presents an architecture and a synthesis method
for programmable numerical function generators of trigono-
metric functions, logarithm functions, square root, recipro-
cal, etc. Our architecture uses an LUT (Look-Up Table) cas-
cade as the segment index encoder, compactly realizes var-
ious numerical functions, and is suitable for automatic syn-
thesis. We have developed a synthesis system that converts
MATLAB-like specification into HDL code. We propose
and compare three architectures implemented as a FPGA
(Field-Programmable Gate Array). Experimental results show
the efficiency of our architecture and synthesis system.
1. INTRODUCTION
Numerical functions, such as trigonometric functions, log-
arithm, square root, reciprocal, etc, are extensively used in
computer graphics, digital signal processing, communica-
tion systems, robotics, astrophysics, fluid physics, etc. High-
level programming languages, such as C and FORTRAN,
usually have software libraries for standard numerical func-
tions. However, for high-speed applications, a hardware
implementation is needed. Hardware implementation by a
single look-up table for a numerical function
	
is sim-





has a small number of bits, this implementation is
straightforward. For high-precision computations, however,
the single look-up table implementation is impractical due
to the huge table size. For such applications, the CORDIC
(COordinate Rotation DIgital Computer) algorithm [1, 16]
has been used. Although faster than software approaches, it
is iterative and therefore slow.
This paper proposes an architecture and a synthesis for
NFGs (Numerical Function Generators) using linear approx-
imations. By using the LUT cascade [7, 11], our architecture
can realize various numerical functions quickly, and is suit-
able for the automatic synthesis. Fig. 1 shows the synthe-
sis flow for the NFG. It generates HDL (Hardware Descrip-
tion Language) code from the design specification described
by Scilab [14], a MATLAB-like numerical calculation soft-




, a domain  
 for the

, and an acceptable error.
This system first partitions the given domain for

into seg-




linear function for each segment. Next, it analyzes the er-
ror, and derives the necessary precision for computing units
in the NFG. Then, it generates HDL code that maps into an
FPGA, using an FPGA vendor tool.
This paper is organized as follows. Section 2 introduces
terminology. Section 3 proposes a linear approximation al-
gorithm for numerical functions. Section 4 shows three dif-
ferent architectures for NFGs. Section 5 describes the FPGA
implementation method. Section 6 evaluates the performance
of our architecture and synthesis system. Due to the page
limitation, the error analysis for our NFGs is omitted. It is
available in [13]. This paper builds on [12].
2. PRELIMINARIES
Definition 2.1 The binary fixed-point representation of a
value  has the form
 ffflfiffi ff
! " # $%fiffi'& ()*fi+)




#687 , 9 :;9*< is the number of bits for the integer
part, and 9

5>= is the number of bits for the fractional part















Definition 2.2 Error is the absolute difference between the
original value and the approximated value. Approximation
error is the error caused by a function approximation, and
rounding error is the error caused by a binary fixed-point
representation. Acceptable error is the maximum error that
an NFG may assume. Acceptable approximation error is
the maximum approximation error that a function approxi-
mation may assume.
Definition 2.3 Precision is the total number of bits for a
binary fixed-point representation. Specially, 9 -bit precision





>=L?M9 . An 9 -bit precision NFG has an
9 -bit input.
Definition 2.4 Accuracy is the number of bits in the frac-
tional part of a binary fixed-point representation. Specially,
N
-bit accuracy specifies that N bits are used to represent
the fractional part of the number; that is, 9  5>=O? N . An
N
-bit accuracy NFG is an NFG with N -bit fractional part



















Fig. 1. Synthesis flow for NFGs.
3. LINEAR APPROXIMATION ALGORITHM








, the linear approximation method yields
small approximation error with relatively few segments. In-
deed, in such cases uniformly wide segments yield good ap-
proximations. Uniform segments have been used in previ-
ous studies [2, 5, 15] to simplify the circuits. However, for





form segmentation requires too many segments. To approx-
imate such functions using fewer segments, a partitioning
method of the domain into non-uniform segments is pro-
posed [8]. Unfortunately, their segmentation is fixed; it is
not optimized for the given function. We improve on this by
adapting the segmentation so that relatively few segments
are needed. This reduces the memory required.
3.1. Segmentation Algorithm
Fig. 2 presents the segmentation algorithm, where the inputs
are a numerical function


, a domain   J for

, and an
acceptable approximation error = . This algorithm begins by
forming one segment over the whole domain  
J . This














. If the current
segment fails to provide the acceptable approximation error,
it is partitioned into two segments joined at a point  where
the maximum error occurs. This process iterates until the
two subsegments approximate
	
to within the acceptable
approximation error. The correction values 

are used to re-





denote the maximum positive error and the maximum nega-
tive error, respectively. These errors are equalized by verti-









can be found by scanning values of

over  '" . How-
ever, it is time-consuming. We use a nonlinear programming
algorithm [6] to find these values efficiently.
The algorithm is based on the Douglas-Peucker algo-
rithm [4] that is used in rendering curves for graphics dis-
plays.
3.2. Computation of Approximated Values




 is denoted by 

; thus, the segments






 " # 

ffflfi
. For each 

, the numerical function
	














































































































with the correction value






able approximation error using sufficiently many segments.
4. ARCHITECTURE FOR NFGS
4.1. Overview
Although Equation (2) and Equation (3) represent the same
values, the architectures for the NFGs realizing them are dif-
ferent. Fig. 3 (a) shows the architecture for Equation (2); it
uses four units: the segment index encoder that computes the









; the multiplier; and the adder. On
the other hand, Fig. 3 (b) shows the architecture for Equa-
tion (3); it uses five units: the four units used in Fig. 3(a),
where A&

is stored in the coefficients table, and an addi-




















index : of the segment 















has the 9 -bit precision. Therefore, in
this case, the linear approximations are realized using only











; the multiplier; and the adder. Note that
this architecture realizes a uniform segmentation.
We use the architecture shown in Fig. 3 (b) to produce
fast and compact NFGs. In Section 6, we will compare the
performances of three different architectures.
4.2. Segment Index Encoder
A segment index encoder converts an input value

into a
segment index : for 










 # " 
 <
A 687 shown in Fig. 4
(a), where  has 9 -bit precision, 3 ? 3(4 #687 , and < denotes
the number of segments. In [8], to simplify the segment in-




are restricted. That is,
the restrictive non-uniform segmentation is used for the seg-
ment index encoder. Such segmentation increases the num-
ber of segments and is unsuitable for automatic segmenta-




, Domain  
J for

, Acceptable approximation error = .



















 # " 
 
ff*fi
Process: This is recursive procedure. Initial segment is set to  
J .

















































in   # ,










, where N 
 .	 4
.































 , and let L?ffi
P 
otherwise.




















6. If (55  = , then declare  '" to be a completed segment. If all segments are completed, stop.
7. For any segment  '# that is not completed, partition   # into two segments    - and  $# , and iterate the
same process for each new segment recursively.




































































Fig. 3. Three architectures for NFGs.
shown in Fig. 4 (b) to realize any  0/ 9fl= 
 . It can be
designed by functional decomposition using BDDs (Binary
Decision Diagrams) representing  0/ 9fl= 	 . That is, our
synthesis system uses the nonrestrictive segmentation. This
is suitable for automatic synthesis. In LUT cascades, the in-
terconnecting lines between adjacent LUTs are called rails.
The size of an LUT cascade depends on the number of rails.
Thus, to produce a compact LUT cascade, a small number
of rails is sought. The next theorem shows that the segment
index functions are realized by compact LUT cascades.
Theorem 4.1 [12] Let  0/ 9fl= 
 be a segment index func-






with at most    < rails.
Our synthesis system uses heterogeneous MDDs (Multi-valued
Decision Diagrams) [10] to find compact LUT cascades.
Since the LUT cascade is suitable for pipeline processing, it
offers a fast and compact circuit. Experimental results will
show that LUT cascades have sizes comparable for the seg-
ment index encoder using uniform segmentation for certain
functions, like trigonometric functions, and much smaller

















(a) Segment index function.
LUT LUT LUT
(b) LUT cascade.
Fig. 4. Segment index encoder.
5. IMPLEMENTATION WITH FPGAS
Modern FPGAs consist of logic elements, synchronous mem-
ory blocks, multipliers (DSP units), etc. Our synthesis sys-
tem efficiently generates NFGs using these components. Each
unit for the NFG shown in Fig. 3 (b) is implemented by the
following components in an FPGA: 1) Segment index en-
coder (LUT cascade) and coefficients table (ROM): by syn-
chronous memory blocks; 2) Multiplier: by DSP units; and
3) Adder: by logic elements. Our synthesis system derives
the optimum bit-width for each component by automatic er-
ror analysis [13].
3
Table 1. Number of pipeline stages for NFGs.
Name of units Pipeline stages
1. LUT cascade 9 =K 














5. Shifter (optional) 0 or 1
6. Two’s complementer (optional) 0 or 1













9 =K  : Number of LUTs for LUT cascade.
5.1. Size Reduction of Multiplier
Although modern FPGAs have dedicated multipliers, large
multipliers are slow. In our architecture, the multiplier of-
ten has the longest delay time among all the units. Thus, to
generate a fast NFG, reducing the size of the multiplier is
important. Since the size of multiplier depends on the num-
ber of bits for =
fi 
, we reduce the number of bits for =
fi 
.
First, we consider the case where the absolute value of
=
fi 




 is large, many bits are required to
represent =
fi 
in binary fixed-point. To reduce the number
of bits for such =
fi 


















Instead of the original value of =
fi 































the left. The increase of 


reduces the number of bits to rep-






, while increasing the rounding






within the acceptable error [13]. When





for all the segments 

, no







mented with the multiplier.
Next, we consider the case where the range of =
fi 
in-
cludes negative values. In this case, our synthesis system
stores the absolute value of =
fi 
and the sign bit for =
fi 
sep-
arately in the coefficients table, and first uses the unsigned









, and then a two’s comple-
menter to produce the signed value with the sign bit. When
=
fi 
is positive for all segments 

, no two’s complementer







with an unsigned multiplier.
For simplicity, Fig. 3 omits the schemes for the scaling
method and the two’s complementer.
5.2. Pipeline Processing
To implement a high-throughput NFG, our synthesis system
inserts pipeline registers between all the units in the archi-
tecture. Since all the units operate in parallel, and each unit
has a short delay time, our NFGs achieves high throughput.
Table 1 shows the units and the number of pipeline stages
for them. Our NFGs may have from 9 =K 
F 
to 9 =K 
F
pipeline stages, where 9 =K  is the number of LUTs for the
LUT cascade.
Table 3. Numbers of segments for non-uniform and uniform
segmentations.
Acceptable approximation error : C
*fi



























































6.1. Computation Time for Segmentation Algorithm
Table 2 shows the CPU time for the segmentation algorithm
applied to 12 of the 14 functions used in [12] with various
acceptable approximation errors. In this table, the Sigmoid













The segmentation algorithm is recursive, and computa-
tion time depends on the number of segments. Smaller ac-
ceptable approximation error requires more segments and
longer computation time. However, Table 2 shows that for
all the functions in the table, the CPU times were smaller




. These results show that our segmentation algorithm
generates non-uniform segments quickly.
6.2. Comparison of Three Architectures
This section compares the three architectures for NFGs shown
in Fig. 3. Let Arc A, Arc B, and Arc C denote the archi-
tectures shown in Fig. 3 (a), (b), and (c), respectively. To
compare these architectures for various functions, we imple-
mented 6

-bit precision NFGs on the same FPGA (Altera
Stratix EP1S10F484C5), using an the acceptable approxi-
mation error of C
*fi
for each function.
Table 3 compares the numbers of segments for the non-
uniform and the uniform segmentations. It shows that, for
all the functions, the number of non-uniform segments is
less than the half that of uniform segments. Since Arc A
and Arc B use non-uniform segmentation, they implement
various numerical functions with small coefficients tables.
On the other hand, Arc C uses uniform segmentation. Thus,
although Arc C implements functions, such as trigonomet-
ric functions, with a relatively small coefficients table, it re-
quires a large coefficients table for other functions. In this
experiment, there were not enough memory blocks in the
FPGA (EP1S10F484C5) to implement the non-trigonometric
functions using Arc C.
Tables 4 and 5 compare the amount of hardware and per-
formances for the three architectures. These tables show
that, for trigonometric functions, Arc C implements the short-
est latency and most compact NFGs among the three, since
Arc C requires no segment index encoder. Therefore, when
the number of uniform segments is relatively small, Arc C is
4
Table 2. CPU time [msec] for the segmentation algorithm.





















































































 7 0.1 112 10 1787 50





C5 2 0.1 32 10 512 10
AAE: Acceptable Approximation Error. #seg.: Number of segments.
Experiment environment
CPU: Pentium4 Xeon 2.8GHz memory : 4GB
OS: Redhat (Linux 7.3) C compiler : gcc -O2
Table 4. Amount of hardware for three architectures.
Precision, Accuracy: 6

-bit precision, 6  -bit accuracy
FPGA device: Altera Stratix (EP1S10F484C5)
Logic synthesis tool: Altera QuartusII 4.1 (default option)
Function Arc A Arc B Arc C































182 159826 8 183 160861 2 145 557119 2








226 164944 8 230 164957 2 206 1114112 0
The domains of functions


are the same as Table 3.
#LEs: Number of logic elements. Memory : Memory size [bit].
#DSPs: Number of 9-bit
)
9-bit DSP units.
smaller and faster than Arc A and Arc B. However, Arc C
cannot implement the square root or reciprocal functions us-
ing the FPGA due to the excessive size of the coefficients
tables. In Table 5, for 

and 
 A  


, Arc C used large
single look-up tables. From these results, we can see that
Arc C is suitable only for trigonometric functions, and is
unsuitable for square root, reciprocal, etc.
Arc B implements various functions with fewer DSP units
than Arc A, because Arc B requires a smaller multiplier than
Arc A. Note that the FPGA synthesis system uses more DSP
units for a multiplier with more bits. Thus, Arc B offers
a fast and compact implementation. In Arc A, for all the
functions except for 

, the multiplier has the longest de-
lay time among all units. On the other hand, in Arc B, for
all the functions, the multiplier is not the slowest unit, and
the coefficients table or the adder has the longest delay time
among all units. For 

, Arc A that has a smaller coeffi-
cients table was faster than Arc B. From these results, we
can conclude: 1) To implement a fast NFG with an FPGA,
the size reduction of multiplier size is important. 2) Arc B
is the most efficient for various numerical functions among
three architectures.
6.3. Comparison with an Existing Method
To show the efficiency of our automatic synthesis system,
we compare our NFGs with ones reported in [8]. NFGs in
[8] are also based on non-uniform segmentation, while they
are designed by hand. We generated the NFGs with the same
precision as [8]. Table 6 shows that our NFGs have compa-
rable performances to [8]. Our system generated C   -bit pre-
cision NFGs with the operating frequency of more than 6(C 
MHz for some functions in [12]. Due to the page limitation,
the results are omitted.
7. CONCLUSION AND COMMENTS
We have proposed an architecture and a synthesis method
for programmable NFGs for trigonometric functions, log-
arithm functions, square root, reciprocal, etc. Our archi-
tecture using an LUT cascade compactly realizes various
numerical functions, and is suitable for automatic synthe-
sis. Experimental results show that: 1) Our architecture effi-
ciently implements NFGs for wide range of numerical func-
tions; and 2) Our synthesis system generates the NFGs with
comparable performance to those designed by hand.
Currently, we are working for the NFGs using the quadratic
approximation algorithm to reduce the memory size.
5
Table 5. Comparison of performances for three architectures.
Precision, Accuracy: 6

-bit precision, 6  -bit accuracy
FPGA device: Altera Stratix (EP1S10F484C5)
Logic synthesis tool: Altera QuartusII 4.1 (default option)
Function Arc A Arc B Arc C






























124 9 73 178 10 56 – 4 –







125 9 72 176 10 57 – 1 –
The domains of functions
	
are the same as Table 3.
“–” shows that the function could not be implemented.
Freq.: Operating frequency [MHz]. #stages: Number of pipeline stages.
Latency: [nsec].
Table 6. Performance comparison with existing method.
FPGA device: Xilinx Virtex-II (XC2V4000-6)
Logic synthesis tool: Xilinx ISE 6.3 (default option)
Function Domain In prec. Out prec. Our method Method in [8]
	






























 0 16 1 8 164 11 67 133 14 105
In prec.: Precision of input. Out prec.: Precision of output.




This research is partly supported by the Grant in Aid for Sci-
entific Research of the Japan Society for the Promotion of
Science (JSPS), funds from Ministry of Education, Culture,
Sports, Science, and Technology (MEXT) via Kitakyushu
innovative cluster project, and NSA Contract RM A-54. We
appreciate Dr. Marc D. Riedel for the work of [12].
8. REFERENCES
[1] R. Andrata, “A survey of CORDIC algorithms for FPGA based com-
puters,” Proc. of the 1998 ACM/SIGDA Sixth Inter. Symp. on Field
Programmable Gate Array (FPGA’98), pp. 191–200, Monterey, CA,
Feb. 1998.
[2] J. Cao, B. W. Y. Wei, and J. Cheng, “High-performance architectures
for elementary function generation,” Proc. of the 15th IEEE Symp.
on Computer Arithmetic (ARITH’01), Vail, Co, pp. 136–144, June
2001.
[3] N. Doi, T. Horiyama, M. Nakanishi, and S. Kimura, “An optimiza-
tion method in floating-point to fixed-point conversion using posi-
tive and negative error analysis and sharing of operations,” Proc. the
12th workshop on Synthesis And System Integration of Mixed Infor-
mation technologies (SASIMI’04), Kanazawa, Japan, pp. 466–471,
Oct. 2004.
[4] D. H. Douglas and T. K. Peucker, “Algorithms for the reduction of
the number of points required to represent a line or its caricature,”
The Canadian Cartographer, Vol. 10, No. 2, pp. 112–122, 1973.
[5] H. Hassler and N. Takagi, “Function evaluation by table look-up and
addition,” Proc. of the 12th IEEE Symp. on Computer Arithmetic
(ARITH’95), Bath, England, pp. 10–16, July 1995.
[6] T. Ibaraki and M. Fukushima, FORTRAN 77 Optimization Program-
ming, Iwanami, 1991 (in Japanese).
[7] Y. Iguchi, T. Sasao, and M. Matsuura, “Realization of multiple-
output functions by reconfigurable cascades,” International Con-
ference on Computer Design: VLSI in Computers and Processors
(ICCD’01), Austin, TX, pp. 388–393, Sept. 23–26, 2001.
[8] D.-U. Lee, W. Luk, J. Villasenor, and P. Y.K. Cheung, “Non-uniform
segmentation for hardware function evaluation,” Proc. Inter. Conf.
on Field Programmable Logic and Applications, pp. 796–807, Lis-
bon, Portugal, Sept. 2003.
[9] J.-M. Muller, Elementary function: Algorithms and implementation,
Birkhauser Boston, Inc., Secaucus, NJ, 1997.
[10] S. Nagayama and T. Sasao, “Compact representations of logic func-
tions using heterogeneous MDDs,” IEICE Trans. on fundamentals,
Vol. E86-A, No. 12, pp. 3168–3175, Dec. 2003.
[11] T. Sasao and M. Matsuura, “A method to decompose multiple-output
logic functions,” 41st Design Automation Conference, San Diego,
CA, pp. 428–433, June 2–6, 2004.
[12] T. Sasao, J. T. Butler, and M. D. Riedel, “Application of LUT cas-
cades to numerical function generators,” Proc. the 12th workshop on
Synthesis And System Integration of Mixed Information technologies
(SASIMI’04), Kanazawa, Japan, pp. 422–429, Oct. 2004.
[13] T. Sasao, S. Nagayama, and J. T. Butler, “Error analysis
for programmable numerical function generators,” http://www.lsi-
cad.com/Error-NFG/.
[14] Scilab 3.0, INRIA-ENPC, France, http://scilabsoft.inria.fr/
[15] M. J. Schulte and J. E. Stine, “Approximating elementary functions
with symmetric bipartite tables,” IEEE Trans. on Comp., Vol. 48,
No. 8, pp. 842–847, Aug. 1999.
[16] J. E. Volder, “The CORDIC trigonometric computing technique,”
IRE Trans. Electronic Comput., Vol. EC-820, No. 3, pp. 330–334,
Sept. 1959.
6
