Two-variable numerical functions are widely used in various applications, such as computer graphics and digital signal processing. Fast and compact hardware implementations are required. This paper introduces the bilinear interpolation method to produce fast and compact numerical function generators (NFGs) for two-variable functions. This paper also introduces a design method for symmetric two-variable functions. This method can reduce the memory size needed for symmetric functions by nearly half with small speed penalty. Experimental results show that the bilinear interpolation method can significantly reduce the memory size needed for two-variable functions, and the speed of NFGs based on the bilinear method is comparable to that of NFGs based on tangent plane approximation. For a complicated function, our NFG is faster and more compact than a circuit designed using a one-variable NFG.
INTRODUCTION
The availability of large quantities of logic in LSIs and the need for high-speed arithmetic function computation in modern data processing applications have created a unique research opportunity [7] . High-speed arithmetic function computation will almost certainly result in significant changes in the way engineers perform data processing tasks. For example, image recognition tasks will more likely be performed at the site of the image collection, such as onboard reconnaissance vehicles.
High-speed computation of one-variable arithmetic functions (e.g. sin(x) and log(x)) has been extensively studied [2, 6, 8, [10] [11] [12] . However, significantly less work has been done on the high-speed implementation of multi-variable functions (e.g.
x 2 + y 2 + z 2 and arctan(x/y)) [3, 4, 13] . The existing design approaches apply a different method for each function realized. As far as we know, no systematic design method for generic multi-variable functions has been presented.
A straightforward design method for an arbitrary multivariable function is to use a single memory in which the address is a combination of values of variables and the content of that address is the corresponding value of function. This method is fast, but requires a 2 mn -word memory to implement an m-variable function with n bits for each variable. Thus, unlike one-variable functions, this method is impractical even for low-precision applications.
To produce a practical implementation, multi-variable functions can be designed using a combination of onevariable function generators, multipliers, and adders [3, 4] . This design can reduce the required memory size. However, depending on the function, it can produce a slow implementation because of its complex architecture. Also, complex architecture makes error analysis harder.
This paper proposes a systematic design method for twovariable functions. Since our design method is based on a piecewise polynomial approximation, hardware architecture is simple even for complex functions. However, polynomial approximation methods tend to require large memory size. For multi-variable functions, using higher-order polynomials is not always effective to reduce the memory size. This is because, for multi-variable polynomials, higher polynomial order requires many more polynomial coefficients. Also, higher-order polynomials produce slower NFGs. Thus, for polynomial approximation methods, reducing memory size with a small speed penalty is a key issue. To accomplish this, this paper introduces the bilinear interpolation method. This paper also introduces a design method and an architecture for symmetric two-variable functions. Error analysis for our NFGs is omitted because it is almost the same as [11] . It is guaranteed that the maximum error of our fixed-point NFGs is smaller than 2 −m (i.e., m-bit accuracy NFGs), where m is the number of fractional bits for the inputs and the output.
POLYNOMIAL APPROXIMATION USING BILINEAR INTERPOLATION
Bilinear interpolation is an extension of linear interpolation. It interpolates two-variable functions f (X,Y ) using four points. Let the four points be (
, and (X 2 ,Y 2 ), and let
By expanding and rearranging this, we obtain the following form: g(X,Y ) = C xy XY + C x X + C y Y + C 0 , where
, and
To approximate a given two-variable function using bilinear interpolation, we first partition a given domain of the function into segments. For each segment, we approximate the function using bilinear interpolation. In this case, the memory size and speed of an NFG are strongly dependent on the efficiency of the segmentation algorithm. Thus, effective segmentation algorithms are needed to achieve fast and compact NFGs. We use a recursive planar segmentation algorithm [9] (based on bilinear interpolation).
The algorithm begins by computing a bilinear interpolation using the four corner points of the given domain. This is an initial approximation. If that approximation error is larger than the given acceptable error, then the domain is partitioned into four equal-sized square segments. For each segment, a bilinear interpolation is computed using the four corner points of the segment. The same process is recursively repeated until all segments have an acceptable approximation error. Note that this algorithm creates a segment of size w i × w i , where
in is the number of fractional bits for X and Y , and h i is an integer. That is, all the segmentation points P i and Q i are restricted to values such that the least significant h i bits are 0 (i.e.,
In this algorithm, to reduce the approximation error, the maximum positive error max f g and the maximum negative error min f g are equalized by a vertical shift of g(X,Y ) with correction value v = (max f g + min f g )/2. Thus, the approximation error is (max f g − min f g )/2, and the approximating polynomial is (1) where Fig. 1 shows architectures for two-variable NFGs realizing (1). The Segment Index Encoder converts values of X and Y into a segment number. This, in turn, is applied as the address input of the Coefficients Memory. The coefficients are applied to adders and multipliers to form the polynomial value g(X,Y ) + v. Note the use of bitwise ANDs in Fig. 1 to compute X − B x and Y − B y . In recursive segmentation, we can realize X − B x and Y − B y using AND gates driven on one side by B x and B y , respectively [8] .
ARCHITECTURE BASED ON BILINEAR INTERPOLATION
The segment index encoder realizes the segment index function: 
Segments
Index Fig. 2(a) , where X and Y have n bits, and k denotes the number of segments. We realize this function with the architecture shown in Fig. 2(b) . We use an edge-valued BDD (EVBDD) [5] to design this architecture. In this architecture, the interconnecting lines between adjacent LUT memories determine the position in the EVBDD (labeled rails), and the outputs from each LUT memory to the adders tally the function value (labeled Arails). Consider the design of the LUT cascade and adders in Fig. 2(b) , given the segmentation produced by the algorithm in [9] . We begin by representing the segment index function using a multi-terminal BDD (MTBDD) [1] . Then, we convert the MTBDD into an EVBDD. By decomposing the EVBDD, we obtain the architecture in Fig. 2(b) . For more detail on this architecture, see [8] .
In our architecture, the coefficients memory and the LUT memories of the segment index encoder are implemented by RAMs. Thus, by changing the data for the coefficients memory and the LUT memories, a wide class of two-variable functions can be realized by a single architecture. 
DESIGN METHOD FOR SYMMETRIC FUNCTIONS Definition 1 A two-variable function f (X,Y ) is symmetric if f (X,Y ) = f (Y, X).

Theorem 1 The segmentation of a symmetric function produced by the recursive planar segmentation algorithm is symmetric.
From Lemma 1 and Theorem 1, we can use only one bilinear interpolation to approximate the given symmetric function in symmetric segments. By assigning the same segment index to symmetric segments, we can reduce the size of coefficients memory by nearly half. Fig. 1(b) shows an architecture for symmetric functions. Here, the coefficients memory stores only data for segments such that X ≤ Y . For other segments, approximated values are computed using Lemma 1. Since the comparator and multiplexers operate in parallel with the segment index encoder, there is no speed penalty due to these additional circuits. Table 1 compares the number of segments needed for the bilinear interpolation method with that for the tangent plane approximation 1 [9] for various functions. For those functions that are symmetric, Table 1 shows the number of symmetric segments. In this table, WaveRings, Sombrero, and Gaussian are: Table 1 shows that, for all functions, the bilinear interpolation method requires fewer segments than the tangent plane approximation. And, the number of symmetric segments is much smaller. Thus, the bilinear interpolation method and the symmetric technique significantly reduce the number of words in the coefficients memory. For sin(πX) √ Y , the bilinear interpolation method is also superior to the tangent plane method. However, the number of segments needed in the bilinear interpolation method is only slightly smaller. Table 2 compares the total memory sizes needed for NFGs based on the three design methods. Note that our NFGs have two kinds of memories: coefficients memory and LUT memory; the memory size shown is the sum of the two. Table 2 shows that, for all functions except for sin(πX) √ Y , the bilinear method uses less memory than the tangent plane approximation, even though the bilinear interpolation requires more polynomial coefficients. That is, for many functions, the reduction in the number of segments due to bilinear interpolation is sufficient to compensate for the increase of polynomial coefficients. Especially for symmetric functions, using the symmetric technique shown in Section 4 reduces further the memory size.
EXPERIMENTAL RESULTS
Number of Segments and Memory Sizes
WaveRings = cos √ X 2 +Y 2 √ X 2 +Y 2 + 0.25Sombrero = sin √ X 2 +Y 2 √ X 2 +Y 2 , Gaussian = 1 Y √ 2π e − X 2 2Y 2
FPGA Implementation Results
We implemented 12-bit accuracy NFGs based on the three design methods using the Altera Stratix III FPGA. Since the FPGA has adaptive look-up tables (ALUTs) that can realize fast adders, synchronous memory blocks, and dedicated DSPs (multipliers), our NFGs are efficiently implemented by those hardware resources in the FPGA. Table 3 compares the FPGA implementation results for 12-bit accuracy. In this table, the columns "Delay" show the total delay time of each NFG from the input to the output, in nanoseconds.
The NFGs based on tangent plane approximation are faster because they require fewer multipliers and fewer polynomial coefficients. However, for the 1/ √ X 2 + Y 2 and Gaussian functions, the memory needed for NFGs based on tangent plane approximation is so large that they could not be implemented in the FPGA. On the other hand, NFGs based on the bilinear method require less memory, and their speed is comparable to the speed of the NFGs based on tangent plane approximation. Since the symmetric technique significantly reduces the memory size, it is easier to implement with an FPGA. Table 3 shows that the symmetric technique has some speed penalty. However, it is reasonable. To show this, we designed XY / √ X 2 + Y 2 using onevariable NFG for 1/ √ X, two squaring circuits, an adder, and two multipliers. The one-variable NFG was realized by the method shown in [8] , which is based on linear approximation and non-uniform segment lengths. Table 4 compares the results with our NFGs.
Our NFGs require fewer ALUTs and DSPs than the onevariable implementation, and have much shorter delay. Especially, the NFG designed by the symmetric method requires less memory and shorter delay than the one-variable NFG. This shows that the speed penalty of our methods is small.
CONCLUDING REMARKS
We have proposed a design method and a programmable architecture for two-variable numerical function generators using the bilinear interpolation. To realize a two-variable function in hardware, we partition the given domain of the function into segments, and approximate the given function using the bilinear interpolation in each segment. In this paper, we also presented a design method and an architecture for two-variable symmetric functions. Experimental results show that the proposed method can significantly reduce the memory size needed for two-variable functions with small speed penalty.
ACKNOWLEDGMENTS
