We show the architecture and design of a numeric function generator that realizes, at high speed, arithmetic functions, like log x, sin x, 1 x , etc.. This approach is general; different circuits are not needed for different functions. Further, composite functions, like log (sin ( 1 x )) can be realized as easily as individual functions. A tutorial description of the method is presented, followed by descriptions of the design considerations that must be made. For example, we discuss how circuit complexity increases as the desired approximation error decreases. Also, we discuss enhancements of the basic numeric function generator approach, including higher order polynomial approximations, floating point, and multi-variable implementations.
INTRODUCTION
The realization of arithmetic functions like sin x, log x, and 1 x with high-speed and accuracy has been an important problem since the beginning of computers. More than 150 years ago, Babbage devised a mechanical computer for computing tables of logarithms and triginometric functions, in his difference machine. Although he never completed his machine, one was completed at the The Science Museum in London, U.K. in 1991 using his plans. A second machine was completed and was on display at the Computer History Museum in Mountain View, CA [3] . In the time of Babbage, the critical application was navigation. It has been suggested that sailors lost their lives due to errors in tables used for navigation [7] , which, at that time, depended on human calculation.
Fifty years ago, Volder [16] introduced the CORDIC algorithm for computing logarithmic and trigonometric functions. In this iterative algorithm, successively more accurate bits are computed until the desired accuracy is achieved [1] . The advantage of CORDIC is the relatively modest amount of hardware needed [1] . Indeed, it has been used in hand calculators, beginning in 1972 with Hewlett-Packard's HP-35 [2] . The CORDIC algorithm was also used in Intel's 8087 numeric co-processor [13] .
By some measures, the CORDIC algorithm is still fast. It may be implemented in a pipeline, where each stage quickly computes one bit of the result. Typically, the latency or number of clocks needed to compute the entire result is large because of the need to compute successively more accurate bits. If the system in which a CORDIC algorithm computation is embedded is itself a pipeline, this may be acceptable. In a hand calculator, computation speed need not be high because of much slower speed by which a human can input digits.
Thus, CORDIC achieves high-throughput, but has high latency. In order to achieve low-latency and high-throughput, one can use a simple memory, as shown in Fig. 1 . In this realization, a binary encoding of x is applied to the address inputs of the memory. The output is the value stored at this address; it is an encoding of the value of the realized function f (x). Table I shows the required memory as a function of the number of bits n used to realize x and f (x). For n = 8 and 16 bits, memory size is modest. In this case, the single memory approach is a reasonable implementation. For n = 32 bits, 17 Gigabytes are needed, which is large. For n = 64 and 128 bits, the memory size exceeds by a large margin today's technology capabilities. Fig. 2a) shows the architecture of a numeric function generator that realizes a given numeric function as a piecewise linear approximation. This is is based on a tabular approach to realizing numeric functions [6] . The input x drives a segment index encoder which produces an index of the segment in which the value of x falls. Within this segment, the function is realized as a line c 1 x + c 0 . The values c 1 and c 0 are outputs of the memory. They drive a circuit that realizes f (x) = c 1 x + c 0 . Sasao, Butler, and Riedel [14] show that the segment index encoder is tractably realized as a look-up table (LUT) cascade. Fig. 3 shows how the memory size depends with the approximation error for the sin(πx) function, where 0 ≤ x ≤ 1/2. Plotted vertically is the log 2 of the number of segments versus log 2 of the approximation error. Smaller approximation error values are on the left and larger approximation error values are on the right. The top line, labeled Constant (analytical), corresponds to a constant approximation, in which the approximating line is horizontal. It corresponds to a memory output for c 1 equal to 0. In this case, a multiplier is not needed, which is a source of much delay in the circuit. Note, however, that a large number of segments are needed.
A PIECEWISE LINEAR APPROACH TO REALIZING NUMERIC FUNCTIONS
The next line, labeled Power of 2 Slope (analytical), shows the number of segments needed in the case where c 1 is restricted to be a power of 2. In this case, the multiplier is a shift operation. As such, there is some delay, but not as much as with a full multiplier. The number of segments is smaller, but still large.
The third line, labeled Douglas-Peucker (experimental), shows the number of segments associated with the circuit shown in Fig. 2a ) when the DouglasPeucker algorithm [4] is used is determine the segments. This is a heuristic in which segments are determined iteratively. First, one line is used to approximate the whole domain. Then, the point of maximum error is used to partition the domain into two parts, etc.. This process is repeated until the maximum error is not greater than the desired error over the whole domain.
The bottom line, labeled Unrestricted Slope (experimental), shows the number of segments when the segmentation is optimum. This shows that the DouglasPeucker algorithm is close to optimum, while the constant slope and power of 2 slope are far from optimum.
The circuit in Fig. 2a ) is said to realize a nonuniform segmentation. Fig. 2b) shows the architecture of a numeric function generator that realizes a given numeric function as a piecewise linear approximation in which all of the segment widths are the same. This architecture realizes uniform segmentation. Normally, a segment index encoder would also be used in this circuit. However, we will choose the (uniform) width to be some power of 2. In this case, the segment index encoder can be omitted and the most significant bits of x are applied to the memory address input. Since a linear approximation is still involved, the circuit realizing c 1 x + c 0 remains.
NON-UNIFORM VERSUS UNIFORM SEGMENTATION
It is shown that, for nonuniform segmentations 
where
Further, it is shown that, for uniform segmentations 
where |f | max is the maximum of the absolute value of f (x) over the domain [a, b] . For non-uniform approximation, the number of segments s(ε) depends on the integral of the second derivative over the interval of approximation, which is similar to an average. The theorem requires that the function f (x) be three-times differentiable; this implies the second derivative is integrable. For uniform approximation, the number of segments depends on the maximum value of the second derivative. These values can be quite different, depending on the function. Table II shows the number of segments for 14 numeric functions, as computed from Theorems 1 and 2 and for the two types of segmentation, non-uniform (1) and uniform (3), and for four precisions, 8, 16, 32, and 64 bits. For 64 bit precision, all functions require a very large memory size, while 32 bit precision yields feasible realizations, except for three functions. For example, for √ x, the number of segments needed in a uniform segmentation is much larger than in a non-uniform segmentation. This is due to a large absolute value for the second derivative near x = 0. Indeed, for all four precisions, uniform segmentation requires many more segments than non-uniform segmentation. Similarly, − ln(x) and −(x log 2 x + (1 − x) log 2 (1 − x)) require many more segments using uniform segmentation than for nonuniform segmentation.
In comparing the two types of segmentations, it is necessary to account for the complexity of the segment index encoder. We know of no analytic way to measure its complexity. However, experimental results [15] show that, with uniform segmentation, the ln x, √ x, and 1/x functions cannot be implemented on an Altera Stratix EP1S20F484C5 FPGA, while a non-uniform implementation can.
EXTENSIONS OF THE BASIC NFG
Higher Order Approximating Polynomials A function that is close to linear is efficiently approximated by a linear function, c 1 x + c 0 . From Table  II,   1 1+e −x can be seen to be linear because of the relatively few number of segments needed for both nonuniform and uniform approximations. However, other functions are highly non-linear. This suggests that there is an advantage to using quadratic, cubic, and higher order polynomials. It is known that quadratic polynomial approximations can drastically reduce the number of segments to as little as 4% of the segments needed in a linear approximation [8] .
A disadvantage of higher order polynomials is the need for additional multipliers to realize the higher powers of x. This uses significant FPGA resources and has larger delay. Indeed, it is known [10] that linear and quadratic polynomials yield the highest efficiency designs.
Floating Point
We have discussed so far only fixed point representations. This restricts the domain, as well as the application. Nagayama, Sasao, and Butler [12] have shown the use of edge-valued decision diagrams in the design of floating point numeric function generators for monotone elementary functions.
Multi-Variable Functions
A multi-variable function depends of two or more variables. For example, the multi-variable function f (x, y) = x 2 + y 2 is used in converting from cartesian to polar coordinates. Such a function can be realized by combining three single-variable functions, two realizing α 2 and one realizing √ β. A more efficient approach is to realize it directly using rectangles to approximate a surface [9] , which is analogous to the approach described above for single-variable functions. This approach yields a 58% memory size reduction and a 39% delay time reduction over the approach in which a number of single-variable functions are used [9] . A further simplification can be achieved by observing that this function is symmetric, i.e. f (x, y) = f (y, x) and that, effectively only one-half of the surface need be realized [11] .
CONCLUDING REMARKS
There is a long history of realizing numeric functions, like sin x by computer. Today's FPGAs provide large amounts of flexible logic at reasonable cost. We propose the use of linear and higher-order
