Shinobu Nagayama, †1 Tsutomu Sasao †2 and Jon T. Butler †3 This paper proposes programmable architectures and design methods for numeric function generators (NFGs) of two-variable functions. To realize a twovariable function in hardware, we partition a given domain of the function into segments, and approximate the function by a polynomial in each segment. This paper introduces two planar segmentation algorithms that efficiently partition a domain of a two-variable function. This paper also introduces a design method for symmetric two-variable functions (i.e. f (X, Y ) = f (Y, X)). This method can reduce the memory size needed for symmetric functions by nearly half with small speed penalty. The proposed architectures allow a systematic design of various two-variable functions. We compare our approach with one based on a one-variable NFG. FPGA implementation results show that, for a complicated function, our NFG achieves 57% of memory size and 60% of delay time of a circuit designed based on a one-variable NFG.
Introduction
The ability to compute numeric functions at a high speed is important in many applications 12) , including 3D computer graphics, hardware accelerators for technical computing packages, direct digital frequency synthesizers 4) , and digital signal processing. Various design methods for numeric function generators (NFGs) have been devised for numeric functions on one variable 5),10),14),15),18)- 20) . Only a few methods exist for multi-variable functions (e.g., √ X 2 + Y 2 + Z 2 and arctan(X/Y )) 6),7), 22) . However, these methods are function-specific; different functions require different methods. As far as we know, no systematic design †1 Hiroshima City University †2 Kyushu Institute of Technology †3 Naval Postgraduate School 1 This paper is an extension of two papers 16), 17).
method exists for generic multi-variable functions. A straightforward design method for arbitrary multi-variable function is to use a single memory in which the address is a combination of values of variables and the content of that address is the corresponding value of function. This method produces a fast implementation, but requires a 2 mn -word memory to implement an m-variable function with n bits for each variable. Thus, unlike one-variable functions, even for a computation with a small number of bits, this method is impractical because of large memory size.
To produce a practical implementation, multi-variable functions are often designed in a conventional (trivial) manner that uses a combination of one-variable function generators, multipliers, and adders 6), 7) . For example, the function √ X 2 + Y 2 + Z 2 can be realized using three circuits, each realizing a 2 , two adders, and a square root circuit. This design method may require small memory size. However, depending on the function implemented, it can produce a slow implementation because of long path delays. Also, such circuits make error analysis harder. That is, guaranteeing output accuracy becomes harder. Also, there are many multi-variable functions that cannot be decomposed into one-variable functions, such as probability distributions that are functions of the random variable and a parameter, like variance. This paper proposes a systematic design method for two-variable functions. Since our design method is based on a piecewise polynomial approximation, architectures are simple even for complicated functions. To approximate a given function using piecewise polynomials, we introduce two planar segmentation algorithms that efficiently partition a given domain of a two-variable function. We also introduce programmable architectures that can realize a wide range of two-variable functions.
The rest of this paper is organized as follows: Section 2 introduces the number representation and the decision diagrams used in this paper. Section 3 presents two planar segmentation algorithms and a polynomial approximation method using bilinear interpolation. Section 4 presents programmable architectures for two-variable functions. Section 5 presents an architecture and a design method for symmetric two-variable functions. Section 6 evaluates performance of our segmentation algorithms and architectures for two-variable functions. And, Sec-tion 7 concludes the paper. An error analysis for our NFGs is omitted because it is the almost same as Refs. 14), 19). 12) . In this paper, an m-bit accuracy NFG is an NFG with an m-bit fractional part of the inputs, an m-bit fractional part of the output, and a 1 ULP error.
Preliminaries

Number Representation and Errors
Decision Diagrams
The proposed design uses binary decision diagrams. Fig. 1 (a) . In Fig. 1 (b) 
Planar Segmentation Problem
To approximate a given two-variable function by piecewise polynomials, we partition a given domain of the function into segments, and approximate the function by a polynomial in each segment. By narrowing segments, and thus increasing the number of segments, we can decrease the approximation error to the desired value. In this case, the memory size and speed of an NFG are strongly dependent on segmentation of domain. Thus, to design fast and compact NFGs, we need to solve the following segmentation problem: Given a two-variable function, its domain, and acceptable approximation error, find an optimum segmentation. To find an optimum segmentation, we consider the following: ( 1 ) number of words in the coefficients memory, which is the number of segments, and ( 2 ) complexity of hardware to realize segmentation, called the segment index encoder, which maps values of X and Y to a segment number.
Fewer segments are preferred because the number of segments directly affects the size of the coefficients memory of the NFG. But, the complexity of the segment index encoder is important as well. Even if the number of segments is minimum, a large NFG is produced if the segment index encoder is very large.
For one-variable functions, since the domain is formed in one-dimension (line), any segmentation can be realized compactly. Thus, we considered only the number of segments to find an optimum segmentation 14), 19) . On the other hand, for two-variable functions, since the domain is formed in two-dimensions (plane), the segment index encoders tend to be much more complex than for one-variable functions. Thus, to find the optimum design of two-variable NFGs, it is necessary to carefully consider the complexity of the segment index encoder.
For one-variable functions, we have proposed linear segmentation algorithms 14) , 19) to find an optimum segmentation of a linear domain (an approximation with the fewest segments) efficiently. However, for two-variable functions, a planar segmentation algorithm is now required to find an optimum segmentation of a planar domain. In planar segmentations, we have a higher degree of freedom, and thus, finding an optimum segmentation becomes much more difficult than in linear segmentation. Because many segments may be involved in a practical design, the time needed to find an optimum segmentation can be very long. To produce an efficient planar segmentation in a short computation time, we focus on heuristic planar segmentation algorithms. The next subsection presents two heuristic planar segmentation algorithms that produce an efficient planar segmentation by regularly partitioning a given domain using squares.
Planar Segmentation Algorithms
We first present a recursive planar segmentation algorithm to reduce the hardware complexity of both the coefficients memory (the number of segments) and the segment index encoder. Figure 2 shows this algorithm. Inputs of the algorithm are a numeric function
} for X and Y , an accuracy m in of X and Y , and an acceptable approximation error ε a . This algorithm begins by computing an approximate polynomial g(X, Y ). This is an initial approximation. If that approximation error ε is larger than the given acceptable error ε a , then the domain is partitioned into four equal-sized square segments. For each segment, an approximate polynomial is computed again. The 
and correction
Compute the maximum positive error
Compute the maximum negative error
Compute approximation error ε = (max fg − min fg )/2 and correction values
Repeat Steps 1, 2, . . . , 6 for each new segment recursively, until the maximum approximation errors are smaller than εa in all segments. same process is recursively repeated until all segments have approximation errors smaller than ε a . Note that this algorithm creates a segment of size w i ×w i , where
That is, all the segmentation points P i and Q i are restricted to values such that the least significant h i bits are 0 (i.e.,
This restriction contributes to reduce the complexity of the segment index encoder. Next, we present the planar uniform segmentation algorithm. Since the recursive planar segmentation algorithm produces non-uniform segmentation, a segment index encoder is needed to compute a segment number from values of X and Y . However, in a uniform segmentation where the number of segments is a power of 2, a segment index encoder is not necessary because a segment number is obtained by the most significant bits of X and Y (see Fig. 3 (b) ). This eliminates the delay of the segment index encoder, and produces fast NFGs. To produce a uniform segmentation, we begin by finding the smallest square segment needed to achieve the acceptable approximation error using the recursive segmentation algorithm shown in Fig. 2 . Then, we partition a given domain into square segments all with the same size as the smallest segment.
Approximation Using Bilinear Interpolation Polynomials
For g(X, Y ) in Fig. 2 , we can use any approximating polynomial. In general, higher-order polynomials require fewer segments. However, for multi-variable functions, using higher-order polynomials is not always effective in reducing the memory size of NFGs. This is because, for multi-variable polynomials, higher polynomial order requires many more polynomial coefficients. Also, higher-order polynomials produce slower NFGs. Thus, for polynomial approximation methods, reducing memory size with a small speed penalty is a key issue. To accomplish this, we use the bilinear interpolation polynomials 21) . Bilinear interpolation is an extension of linear interpolation. It interpolates two-variable functions f (X, Y ) using four points. In Fig. 2 , to interpo-
, and f ee = f (E x , E y ). Then, the bilinear interpolation g(X, Y ) is given by:
By expanding and rearranging this, we obtain the following form:
, and
To reduce the approximation error, the maximum positive error max fg and the maximum negative error min fg are equalized by a vertical shift of g(X, Y ) with a correction value v = (max fg + min fg )/2. Thus, the approximation error is (max fg − min fg )/2, and the approximating polynomial is g(X, Y ) + v. 
where (b) Architecture based on uniform segmentation. Fig. 3 Architectures for two-variable NFGs using bilinear interpolation.
Programmable Architectures for Two-Variable NFGs
Architectures Based on Recursive and Uniform Segmentations
efficients memory. The coefficients are applied to adders and multipliers to form the polynomial value g(X, Y ) + v. Note that Fig. 3 (a) uses bitwise AND gates to compute X − B x and Y − B y . In recursive segmentation, we can realize X − B x and Y − B y using AND gates driven on one side by B x and B y , respectively 15) . Note that Fig. 3 (b) has neither a segment index encoder nor bitwise AND gates. In uniform segmentation, the segment index encoder and bitwise AND gates are not necessary. This is because a segment number is obtained by the most significant bits of X and Y , and X − B x and Y − B y , which are realized with bitwise AND gates in Fig. 3 (a) , are obtained by the least significant bits.
Architecture and Design Method for Segment Index Encoder
The segment index encoder realizes the segment index function: Fig. 4 (a) , where X and Y have n bits, and k denotes the number of segments. We realize this function with the architecture shown in Fig. 4 (b) . In this architecture, the values of interconnecting lines between adjacent LUT memories represent sub-functions in the EVBDD (labeled rails), and the outputs from each LUT memory to the adders tally the function value (labeled Arails). Consider the design of the LUT cascade and adders in Fig. 4 (b) , given the segmentation produced in Fig. 2 .
We begin by representing the segment index function using an MTBDD. Figure 5 illustrates the relationship between recursive segmentation and MTBDDs. Then, we convert the MTBDD into an EVBDD. By decomposing the EVBDD, as shown in Fig. 6 , we obtain the architecture in Fig. 4 (b) . In Fig. 6 , the column labeled as 'r i ' in the table of each LUT memory denotes the rails that represent sub-functions in the EVBDD. And, the column 'a i ' in Fig. 6 denotes the Arails that represent the sum of weights of edges. In the EVBDD, "(a i , r i )" assigned to edges that cut across the horizontal lines represents the sum of weights and sub-functions, respectively. For more detail on this architecture, see 15) . In this architecture, the size of LUT memories realizing the recursive segmentation depends on the number of segments. Specifically, Fig. 4 (b) with at most log 2 k rails and at most log 2 k Arails, where k is the number of segments.
Theorem 1 Let seg func(X, Y ) be a segment index function obtained by a recursive planar segmentation. The segment index function can be realized by the segment index encoder shown in
Proof: See Appendix. The segment index encoder satisfying Theorem 1 is obtained when the variable order for the EVBDD is x l−1 , y l−1 , x l−2 , y l−2 , . . . , x −m , y −m from the top to the bottom (see the proof in Appendix). We can also use the optimization techniques for multi-valued decision diagrams presented in 13) to optimize the variable order for EVBDD.
In our architectures, the coefficients memory and the LUT memories of the segment index encoder are implemented by RAMs. Thus, by changing the data for the coefficients memory and the LUT memories, a wide class of two-variable functions can be realized by a single architecture.
Design Method for Symmetric Functions
Symmetric functions are commonly found in practical applications of NFGs. For example, √ X 2 + Y 2 , which is used in converting from rectangular to polar coordinates, is symmetric. This section presents an architecture and a design method taking advantage of the function's symmetry.
Definition 9 A segmentation is symmetric if for every segment
, and E y1 = E x2 . Symmetric segments are a pair of such segments.
Lemma 1 Let f (X, Y ) be a symmetric function, and let g 1 (X, Y ) and
Proof: See Appendix.
Theorem 2 The segmentation of a symmetric function produced by the recursive planar segmentation algorithm is symmetric.
Proof: See Appendix. From Lemma 1 and Theorem 2, we can use only one bilinear interpolation to approximate the given symmetric function in symmetric segments. By assigning the same segment index to symmetric segments, we can reduce the size of the coefficients memory by nearly half. shows an architecture for symmetric functions. Here, the coefficients memory stores only data for segments such that X ≤ Y . For other segments, approximated values are computed using Lemma 1. Since the comparator and multiplexers operate in parallel with the segment index encoder, there is no speed penalty due to these additional circuits. Table 1 shows the number of segments produced by the two segmentation algorithms presented in Section 3, and their computation time for various functions 1) . This table also shows the number of symmetric segments for symmetric functions. In this table, W aveRings and Sombrero are Table 1 shows that, for all functions except sin(πXY ) and Sombrero, the recursive segmentation algorithm produces many fewer segments than the uniform segmentation algorithm. Especially, for higher accuracy, the number of segments needed in recursive segmentation is only a few percent of the number of segments needed in uniform segmentation. And, the number of symmetric segments is even smaller. Thus, the recursive segmentation algorithm and the symmetric technique significantly reduce the number of words in the coefficients memory. For sin(πXY ) and Sombrero, the additional segments needed in uniform segmentation are not so large even for higher accuracy. This means that, for these functions, the uniform segmentation method also produces NFGs with reasonable size.
Experimental Results
Number of Segments and Computation Time for Algorithms
In addition, Table 1 shows that both algorithms produce segments with small CPU time. Such quick segmentation is useful to reduce design time for NFGs. Table 2 compares total memory sizes needed for the three NFGs shown in Fig. 3 and Fig. 7 . Note that the NFGs based on recursive segmentation have two kinds of memories: coefficients memory and LUT memory. The memory size shown is the sum of the coefficients memory size and the LUT memory sizes. Table 2 shows that, for all functions, NFGs based on recursive segmentation require smaller memory size than NFGs based on uniform segmentation, even though NFGs based on recursive segmentation have a segment index encoder. For example, for f 4 (X, Y ) = XY / √ X 2 + Y 2 , the 12-bit accuracy NFG using recursive segmentation requires only 0.8% of memory required by uniform segmentation. Especially for symmetric functions, using the symmetric technique shown in Section 5 reduces the memory size significantly.
Memory Sizes Needed for Numeric Function Generators
To understand the relation between memory size and accuracy, we designed NFGs for XY / √ X 2 + Y 2 with various accuracies. ( 3 ) NFGs using recursive non-uniform segmentation, and ( 4 ) NFGs using the symmetric technique. Interestingly, for this function, the memory size of the NFGs using uniform segmentation increases in the same way as the memory size of a single look-up table.
On the other hand, the memory sizes of the NFGs using recursive segmentation and the NFGs using symmetric technique increase much more slowly than the other two. For 16-bit accuracy, the memory sizes of the NFG using recursive segmentation and the NFG using symmetric technique are only 0.06% and 0.03% of that of the NFG using uniform segmentation, respectively.
FPGA Implementation Results
To show the merits and demerits of the three proposed methods, we compare the performance of the NFGs designed by the three methods. We implemented 12-bit accuracy NFGs using the Altera Stratix III FPGA. Since the FPGA has adaptive look-up tables (ALUTs) that can realize fast adders, synchronous memory blocks, and dedicated multipliers, our NFGs are efficiently implemented by those hardware resources in the FPGA. Table 3 compares the FPGA implementation results of the NFGs. In this table, the three columns labeled "Delay" show the total delay time of each NFG from the input to the output, in nanoseconds.
The NFGs based on uniform segmentation require fewer pipeline stages and have shorter delay than the recursive segmentation because they have no segment index encoder. However, for six functions, the memory needed for NFGs based on uniform segmentation is so large that they could not be implemented on the FPGA. Note that NFGs that have only one pipeline stage in Table 3 are realized with a single look-up table due to the excessively many segments. On the other hand, for all functions except for f 3 (X, Y ) = 1/ √ X 2 + Y 2 , the NFGs based on recursive segmentation do not require excessive memory size and can be implemented on the FPGA. Further, the successful implementations achieve a high operating frequency. Since the symmetric technique significantly reduces memory size, even function f 3 can be implemented with the FPGA. But, the symmetric technique has some speed penalty because it produces a slightly more complex segment index encoder.
In this way, the three methods have different merits and demerits, and thus, we can use the three methods appropriately depending on applications and numeric functions.
Although the recursive segmentation and symmetric technique have some speed penalty as shown in Table 3 , the penalty is reasonable. To show that, we compare our NFGs with an NFG designed in a conventional (trivial) manner that uses a combination of one-variable NFGs and basic operations like addition and multiplication. We implemented f 4 (X, Y ) = XY / √ X 2 + Y 2 with the same FPGA using a one-variable NFG for 1/ √ X, two squaring circuits, an adder, and two multipliers. The one-variable NFG was realized by the method shown in 15), which is based on linear approximation and non-uniform segmentation. Table 4 compares the results with our NFGs. Our NFG based on recursive segmentation requires fewer ALUTs and DSPs than the implementation using one-variable NFG, and has much shorter delay. Especially, the NFG designed by the symmetric method achieves both less memory and shorter delay. This shows that the speed penalties caused by the recursive segmentation and the symmetric method are small enough.
From these results, we can see that by designing two-variable functions using one-variable NFGs, the required memory size can be reduced significantly. However, depending on the functions, it can produce a slow implementation because of additional logic, such as multipliers. Also, complicated architectures using one-variable NFGs make error analysis harder, and it is harder to guarantee output accuracy. This increases design time. On the other hand, for a large class of functions, we can automatically generate fast and compact NFGs whose output accuracy is guaranteed.
Concluding Remarks
We have proposed programmable architectures and design methods for numeric function generators of two-variable functions. To realize a two-variable function in hardware, we partition the given domain of the function into segments, and approximate the given function by a polynomial in each segment. In this paper, we presented two planar segmentation algorithms which partition a given domain of two-variable function efficiently. We also presented a design method for symmetric two-variable functions. To the best of our knowledge, these are the first systematic design methods based on piecewise polynomial approximation for two-variable functions. Experimental results showed that for a complicated function, our automatically generated NFG achieves higher performance than the NFG that is manually designed in a conventional manner.
In the proposed architectures, the coefficients memory and the LUT memories of the segment index encoder can be implemented by embedded RAMs in an FPGA (e.g., M4Ks in Altera FPGAs). Thus, by changing the data for the coefficients memory and the LUT memories, a wide class of two-variable functions can be realized by a single architecture. Since just changing the RAM data can switch functions, we can switch functions without reprogramming the FPGA.
The algorithms and architectures presented in this paper can be easily extended to functions with three or more variables.
