This paper presents an architecture and a synthesis method for programmable numerical function generators of trigonometric functions, logarithm functions, square root, reciprocal, etc. Our architecture uses an LUT (Look-Up Table) cascade as the segment index encoder, compactly realizes various numerical functions, and is suitable for automatic synthesis. We have developed a synthesis system that converts MATLAB-like specification into HDL code. We propose and compare three architectures implemented as a FPGA (Field-Programmable Gate Array). Experimental results show the efficiency of our architecture and synthesis system.
INTRODUCTION
Numerical functions, such as trigonometric functions, logarithm, square root, reciprocal, etc, are extensively used in computer graphics, digital signal processing, communication systems, robotics, astrophysics, fluid physics, etc. Highlevel programming languages, such as C and FORTRAN, usually have software libraries for standard numerical functions. However, for high-speed applications, a hardware implementation is needed. Hardware implementation by a single look-up table for a numerical function is simple and fast. For low-precision computations of , i.e., when has a small number of bits, this implementation is straightforward. For high-precision computations, however, the single look-up table implementation is impractical due to the huge table size. For such applications, the CORDIC (COordinate Rotation DIgital Computer) algorithm [1, 16] has been used. Although faster than software approaches, it is iterative and therefore slow. This paper proposes an architecture and a synthesis for NFGs (Numerical Function Generators) using linear approximations. By using the LUT cascade [7, 11] , our architecture can realize various numerical functions quickly, and is suitable for the automatic synthesis. Fig. 1 shows the synthesis flow for the NFG. It generates HDL (Hardware Description Language) code from the design specification described by Scilab [14] , a MATLAB-like numerical calculation software. The design specification includes a numerical function , a domain for the , and an acceptable error. This system first partitions the given domain for into segments, and then approximates the given function by a linear function for each segment. Next, it analyzes the error, and derives the necessary precision for computing units in the NFG. Then, it generates HDL code that maps into an FPGA, using an FPGA vendor tool. This paper is organized as follows. Section 2 introduces terminology. Section 3 proposes a linear approximation algorithm for numerical functions. Section 4 shows three different architectures for NFGs. Section 5 describes the FPGA implementation method. Section 6 evaluates the performance of our architecture and synthesis system. Due to the page limitation, the error analysis for our NFGs is omitted. It is available in [13] . This paper builds on [12] . 
PRELIMINARIES

LINEAR APPROXIMATION ALGORITHM
For functions that are approximately linear, such as , the linear approximation method yields small approximation error with relatively few segments. Indeed, in such cases uniformly wide segments yield good approximations. Uniform segments have been used in previous studies [2, 5, 15] to simplify the circuits. However, for some kinds of numerical functions such as , uniform segmentation requires too many segments. To approximate such functions using fewer segments, a partitioning method of the domain into non-uniform segments is proposed [8] . Unfortunately, their segmentation is fixed; it is not optimized for the given function. We improve on this by adapting the segmentation so that relatively few segments are needed. This reduces the memory required. Fig. 2 presents the segmentation algorithm, where the inputs are a numerical function , a domain for , and an acceptable approximation error . This algorithm begins by forming one segment over the whole domain . This is an initial piecewise approximation by a linear function whose endpoints are and . If the current segment fails to provide the acceptable approximation error, it is partitioned into two segments joined at a point where the maximum error occurs. This process iterates until the two subsegments approximate to within the acceptable approximation error. The correction values are used to reduce the approximation errors. In Fig. 2 , and denote the maximum positive error and the maximum negative error, respectively. These errors are equalized by vertical shift of linear function with . In Fig. 2 , and can be found by scanning values of over . However, it is time-consuming. We use a nonlinear programming algorithm [6] to find these values efficiently.
Segmentation Algorithm
The algorithm is based on the Douglas-Peucker algorithm [4] that is used in rendering curves for graphics displays.
Computation of Approximated Values
A segment is denoted by ; thus, the segments generated by the segmentation algorithm are denoted by , . For each , the numerical function is approximated by the corresponding linear function . Therefore, the approximated value of is computed as follows: (2) where is the linear function for the segment ,
, and
By substituting into Equation (2), and simplifying it, we have (3) Let and . Then, we have . By substituting this equation into Equation (3), we have . This is the first-order Taylor expansion around for with the correction value . Our algorithm can approximate with any acceptable approximation error using sufficiently many segments.
ARCHITECTURE FOR NFGS
Overview
Although Equation (2) and Equation (3) represent the same values, the architectures for the NFGs realizing them are different. Fig. 3 (a) shows the architecture for Equation (2); it uses four units: the segment index encoder that computes the index for segment including the value ; the coeffi- cients table for and ; the multiplier; and the adder. On the other hand, Fig. 3 (b) shows the architecture for Equation (3); it uses five units: the four units used in Fig. 3(a) , where is stored in the coefficients table, and an additional adder for computation of . In Equation (3), when the most significant bits of , the index of the segment is equal to the most significant bits, and is equal to the least significant bits of , where has the -bit precision. Therefore, in this case, the linear approximations are realized using only three units as shown in Fig. 3 (c) : the coefficients table for and ; the multiplier; and the adder. Note that this architecture realizes a uniform segmentation.
We use the architecture shown in Fig. 3 (b) to produce fast and compact NFGs. In Section 6, we will compare the performances of three different architectures.
Segment Index Encoder
A segment index encoder converts an input value into a segment index for . It realizes the segment index function shown in Fig. 4  (a) , where has -bit precision, , and denotes the number of segments. In [8] , to simplify the segment index encoder, the values of and are restricted. That is, the restrictive non-uniform segmentation is used for the segment index encoder. Such segmentation increases the number of segments and is unsuitable for automatic segmentation. Our synthesis system uses the LUT cascade [7, 11, 12] For any segment that is not completed, partition into two segments and , and iterate the same process for each new segment recursively. shown in Fig. 4 (b) to realize any . It can be designed by functional decomposition using BDDs (Binary Decision Diagrams) representing . That is, our synthesis system uses the nonrestrictive segmentation. This is suitable for automatic synthesis. In LUT cascades, the interconnecting lines between adjacent LUTs are called rails. The size of an LUT cascade depends on the number of rails. Thus, to produce a compact LUT cascade, a small number of rails is sought. The next theorem shows that the segment index functions are realized by compact LUT cascades. 
IMPLEMENTATION WITH FPGAS
Modern FPGAs consist of logic elements, synchronous memory blocks, multipliers (DSP units), etc. Our synthesis system efficiently generates NFGs using these components. Each unit for the NFG shown in Fig. 3 (b) is implemented by the following components in an FPGA: 1) Segment index encoder (LUT cascade) and coefficients table (ROM): by synchronous memory blocks; 2) Multiplier: by DSP units; and 3) Adder: by logic elements. Our synthesis system derives the optimum bit-width for each component by automatic error analysis [13] . 
Size Reduction of Multiplier
Although modern FPGAs have dedicated multipliers, large multipliers are slow. In our architecture, the multiplier often has the longest delay time among all the units. Thus, to generate a fast NFG, reducing the size of the multiplier is important. Since the size of multiplier depends on the number of bits for , we reduce the number of bits for .
First, we consider the case where the absolute value of is large. When is large, many bits are required to represent in binary fixed-point. To reduce the number of bits for such , we use the scaling method shown in [8] .
When is large, we represent as . Instead of the original value of , we store the values of and in the coefficients table. In this case, the product is computed using the multiplier for and the shifter for an -bit shift to the left. The increase of reduces the number of bits to represent the value of , while increasing the rounding error. Our synthesis system finds the optimum value of for each segment within the acceptable error [13] . When the optimum values of are for all the segments , no shifter is implemented, that is, is directly implemented with the multiplier.
Next, we consider the case where the range of includes negative values. In this case, our synthesis system stores the absolute value of and the sign bit for separately in the coefficients table, and first uses the unsigned multiplier to compute , and then a two's complementer to produce the signed value with the sign bit. When is positive for all segments , no two's complementer is implemented. That is, is directly implemented with an unsigned multiplier.
For simplicity, Fig. 3 omits the schemes for the scaling method and the two's complementer.
Pipeline Processing
To implement a high-throughput NFG, our synthesis system inserts pipeline registers between all the units in the architecture. Since all the units operate in parallel, and each unit has a short delay time, our NFGs achieves high throughput. Table 1 shows the units and the number of pipeline stages for them. Our NFGs may have from to pipeline stages, where is the number of LUTs for the LUT cascade. 6. EXPERIMENTAL RESULTS Table 2 shows the CPU time for the segmentation algorithm applied to 12 of the 14 functions used in [12] with various acceptable approximation errors. In this table, the Sigmoid and the Gaussian are defined as follows:
Computation Time for Segmentation Algorithm
Sigmoid Gaussian
The segmentation algorithm is recursive, and computation time depends on the number of segments. Smaller acceptable approximation error requires more segments and longer computation time. However, Table 2 shows that for all the functions in the table, the CPU times were smaller than seconds when the acceptable approximation error was . These results show that our segmentation algorithm generates non-uniform segments quickly.
Comparison of Three Architectures
This section compares the three architectures for NFGs shown in Fig. 3 . Let Arc A, Arc B, and Arc C denote the architectures shown in Fig. 3 (a), (b) , and (c), respectively. To compare these architectures for various functions, we implemented -bit precision NFGs on the same FPGA (Altera Stratix EP1S10F484C5), using an the acceptable approximation error of for each function. Table 3 compares the numbers of segments for the nonuniform and the uniform segmentations. It shows that, for all the functions, the number of non-uniform segments is less than the half that of uniform segments. Since Arc A and Arc B use non-uniform segmentation, they implement various numerical functions with small coefficients tables. On the other hand, Arc C uses uniform segmentation. Thus, although Arc C implements functions, such as trigonometric functions, with a relatively small coefficients table, it requires a large coefficients table for other functions. In this experiment, there were not enough memory blocks in the FPGA (EP1S10F484C5) to implement the non-trigonometric functions using Arc C. Tables 4 and 5 compare the amount of hardware and performances for the three architectures. These tables show that, for trigonometric functions, Arc C implements the shortest latency and most compact NFGs among the three, since Arc C requires no segment index encoder. Therefore, when the number of uniform segments is relatively small, Arc C is 8  107  20061  2  82  14848  2  136  19543  8  116  20169  2  67  15417  2  106  19355  8  116  20039  2  83  29696  2  153  172102  8  172  172119  2  112  278594  2  182  159826  8  183  160861  2  145  557119  2  191  43610  2  175  44359  2  195 1048576  0  226  164944  8  230  164957  2  206 1114112  0  The domains of functions  are the same as Table 3 . #LEs: Number of logic elements.
Memory : Memory size [bit]. #DSPs: Number of 9-bit 9-bit DSP units.
smaller and faster than Arc A and Arc B. However, Arc C cannot implement the square root or reciprocal functions using the FPGA due to the excessive size of the coefficients tables. In Table 5, for and , Arc C used large single look-up tables. From these results, we can see that Arc C is suitable only for trigonometric functions, and is unsuitable for square root, reciprocal, etc.
Arc B implements various functions with fewer DSP units than Arc A, because Arc B requires a smaller multiplier than Arc A. Note that the FPGA synthesis system uses more DSP units for a multiplier with more bits. Thus, Arc B offers a fast and compact implementation. In Arc A, for all the functions except for , the multiplier has the longest delay time among all units. On the other hand, in Arc B, for all the functions, the multiplier is not the slowest unit, and the coefficients table or the adder has the longest delay time among all units. For , Arc A that has a smaller coefficients table was faster than Arc B. From these results, we can conclude: 1) To implement a fast NFG with an FPGA, the size reduction of multiplier size is important. 2) Arc B is the most efficient for various numerical functions among three architectures.
Comparison with an Existing Method
To show the efficiency of our automatic synthesis system, we compare our NFGs with ones reported in [8] . NFGs in [8] are also based on non-uniform segmentation, while they are designed by hand. We generated the NFGs with the same precision as [8] . Table 6 shows that our NFGs have comparable performances to [8] . Our system generated -bit precision NFGs with the operating frequency of more than MHz for some functions in [12] . Due to the page limitation, the results are omitted.
CONCLUSION AND COMMENTS
We have proposed an architecture and a synthesis method for programmable NFGs for trigonometric functions, logarithm functions, square root, reciprocal, etc. Our architecture using an LUT cascade compactly realizes various numerical functions, and is suitable for automatic synthesis. Experimental results show that: 1) Our architecture efficiently implements NFGs for wide range of numerical functions; and 2) Our synthesis system generates the NFGs with comparable performance to those designed by hand.
Currently, we are working for the NFGs using the quadratic approximation algorithm to reduce the memory size. 
