Abstract-This paper presents a method for evaluating functions based on piecewise polynomial approximations (splines) with a hierarchical segmentation scheme targeting hardware implementation. The methodology provides significant reduction in table size compared to traditional uniform segmentation approaches. The use of hierarchies involving uniform splines and splines with size varying by powers of two is particularly well suited for the coverage of nonlinear regions. The segmentation step is automated and supports user-supplied precision requirements and approximation method. Bit-widths of the coefficients and arithmetic operators are optimized to minimize circuit area and enable a guarantee of 1 unit in the last place (ulp) accuracy at the output. A coefficient transformation technique is also described, which significantly reduces the dynamic ranges of the fixed-point polynomial coefficients. The hierarchical segmentation method is illustrated using a set of functions including Index Terms-Circuit synthesis, design automation, digital systems, field-programmable gate arrays (FPGAs), piecewise polynomial approximation.
I. INTRODUCTION
T HE evaluation of mathematical functions is often central to numerous applications in communications, computer graphics, digital signal processing, and scientific computing. Examples of such functions include elementary functions such as and , and compound functions such as and . Software environments such as C or MATLAB usually provide libraries for evaluating functions in floating-point precision. However, software implementations on instruction processors are often too slow for numer-ically intensive and/or real-time applications. The performance of such applications depends on the design of a fast and accurate hardware function evaluation unit, usually implemented on a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC).
The evaluation of functions has received considerable interest in the research community. In particular, evaluation methods involving polynomials and splines have been extensively used in both hardware and software implementations. Spline approximations (piecewise polynomials) are often preferred over polynomial-only approximations due to the wide range of design tradeoffs they offer involving memory, computational complexity, and precision [1] . Traditionally, spline approximations use uniform segmentation, in which all splines cover segments of equal width and the total number of segments is constrained to a power of two. This has the advantage of simple coefficient address computation, but can be problematic for regions of the function in which the first or higher order derivatives have high absolute values.
The hierarchical segmentation method discussed in this paper employs hierarchies involving uniform splines and splines with size varying by powers of two. This segmentation technique is able to adapt to nonlinearities of a function, resulting in significant reduction in the number of splines compared to uniform segmentation. Each spline contains a set of polynomial coefficients corresponding to a particular region of a function. The polynomial is then evaluated in fixed-point arithmetic. One of the main challenges of such arithmetic structures lies in the determination of the required bit-widths of the coefficients and operators. For the work presented here, we use the MiniBit bitwidth optimization approach [2] . MiniBit enables computation of the required integer and fractional bits for each signal in an analytical manner, making it possible to guarantee faithful rounding [1 unit in the last place (ulp) accuracy] at the output.
In the work presented here, compound functions are evaluated directly. This contrasts with the traditional approach involving computation via a series of elementary function evaluations, and the associated delays and rounding error accumulation that occur during range reduction and reconstruction for each component elementary function [3] . Thus, the advantages of direct evaluation as discussed here increase significantly as compound functions become more complex.
In contrast with much of the previous work, in this paper we are targeting environments in which the delay introduced by the coefficient address logic must be kept to a minimum. Additionally, we enable a designer to specify an error tolerance and to automatically obtain a segmentation that: 1) meets this tolerance; 2) requires a small number of segments ; and 3) leads to efficient hardware implementation. The proposed segmentation approach can be applied to a wide range of approximation methods. Without loss of generality, it will be illustrated with spline approximations involving polynomials. Some earlier aspects of this work were presented in [4] . This paper includes a number of additions and improvements, including automated analysis of function characteristics and use of the results in choosing a segmentation method, bit-width optimization to guarantee 1 ulp precision, coefficient transformation to reduce dynamic range, and hardware implementation results that include measurements of combinatorial delay.
The rest of this paper is organized as follows. Section II discusses background of function evaluation and discusses previous work on uniform and nonuniform segmentation. Section III presents an overview of the methodology for function evaluation unit design with hierarchical segmentation. Section IV describes the hierarchical segmentation method. Section V presents the hardware architecture for evaluating functions based on hierarchical segmentation. Section VI presents hardware realization results for a Xilinx Virtex-4 FPGA device. Concluding remarks are given in Section VII.
II. BACKGROUND
Function evaluation methods can be classified into iterative methods and non-iterative methods. Iterative methods [5] , [6] successively refine the output precision and are suitable for applications where arbitrary precisions are desired. However, they usually involve high latencies and low throughputs, making them unsuitable for high-performance applications.
Non -iterative methods include direct table lookups, table  addition methods, polynomial approximations, and rational  approximations. Direct table lookups are widely used for computations involving low-precision inputs, but are impractical  for high-precision inputs due to high table size. Table addi tion methods [7] , [8] use two or more parallel table lookups followed by multi-operand addition. Although this approach gives significant improvements in table size and potential speed  improvements due to reduction in table access times over the direct table lookup approach, it still suffers from large table sizes for high precisions. Polynomial approximations [1] involve the evaluation of a polynomial over a given interval. The approximation accuracy can be controlled by varying the degree of the polynomial and choice of interval. Rational approximation [9] is a generalization of polynomial approximation in which the function is approximated using the ratio of two polynomials. For a given limit on numerator and denominator polynomial degree, it enables higher accuracy than polynomial approximation, but due to the divide operation, the circuit complexity is considerably higher. Non-iterative methods are often combined with segmentation in which the input range is split into multiple segments, each associated with a spline containing a particular set of coefficients.
By far the most common segmentation method is uniform, in which all segment lengths are equal [1] , [7] , [8] , [10] - [12] . In addition, the choice of segment numbers is typically limited to powers of two. While this leads to simple coefficient address computation, in contrast with nonuniform segmentation, it does not allow segment lengths to be customized to the local function characteristics. The benefits of such customization using nonuniform segmentation have been noted in the past, particularly in association with logarithmic number systems (LNS). LNS involves the approximation of highly nonlinear functions for performing addition and subtraction operations in the logarithmic domain. Two-level uniform segmentation has been described by Lewis [13] . The segmentation in both levels is uniform, but the density of the second level segmentation is customized in accordance with local function characteristics. Coleman et al. [14] describe a two-level approach in which the first segmentation has segments that vary by powers of two and the segmentation at the second level is uniform. References [13] and [14] consider a subset of the segmentation options considered here, and neither discusses automating the segmentation process.
Balanced error segmentation is a form of nonuniform segmentation in which the maximum approximation error in all segments is equal. This is desirable since it minimizes the number of segments needed to meet a given overall approximation error constraint , though as discussed further below it is not particularly conducive to efficient hardware mapping. Given a continuous function , interval , and number of segments , the goal in balanced error segmentation is to identify segment boundaries , where and associated degreepolynomials such that the absolute maximum approximation errors (1) are minimized and are equal for all segments [15] . Several algorithms have been described in the literature to address the balanced error segmentation problem [4] , [15] - [17] . However, the main challenge with implementing balanced error approaches is that the arbitrary variation in segment lengths complicates the circuitry needed to identify the segment associated with a particular input. This challenge has been recognized by Sasao et al. [18] - [20] , who present an algorithm for finding the balanced error segmentation and employ a large cascade of lookup tables (LUTs) for computing the coefficient address. This approach leads to the minimum number of segments, but may not always result in designs with minimized delay and/or memory requirements; we shall illustrate this point in Section VI. Fig. 1 gives an overview of the function evaluation unit design methodology based on hierarchical segmentation. The following input parameters must be supplied by the user: 1) function to be evaluated (e.g., , etc.); 2) input interval (e.g., , etc.); 3) input and output precisions (e.g., 16 bits, etc.); 4) approximation method (e.g., degree-1 splines, degree-2 splines, etc.). The "Hierarchical Segmentation" step partitions the splines over the interval of the function in an adaptive manner via a hierarchy of segmentation schemes, and produces segment boundaries and the corresponding polynomial coefficients. The details of the segmentation are described more fully in the following. Chebyshev coefficients [3] are used for the splines. Potentially, a minimax approximation could be performed on the Chebyshev coefficients to find a set of coefficients with slightly lower . Chebyshev is used in this paper instead of minimax due to its faster coefficient generation time, which was necessary due to the large collection of results presented in Section VI.
III. DESIGN FLOW OVERVIEW
The "Bit-Width Optimization" step applies bit-width optimization to the circuit and determines the fixed-point bit-width required for each signal. The bit-widths determined should guarantee overflow protection and faithful rounding at the output signal. The final step, "Hardware Generation," utilizes the segmentation, coefficient, and bit-width information, and produces synthesizable VHDL code suitable for FPGA or ASIC implementation. The entire process is fully automated and has been implemented using MATLAB.
IV. HIERARCHICAL SEGMENTATION METHOD

A. Framework
The segmentation utilizes a selection from a library of four schemes when subdividing an interval, denoted , and , respectively, as illustrated in Fig. 2 . In , segments are uniformly sized. In , the segment sizes increase by powers of two from the beginning of the input interval to the end of the interval, while in the segment sizes decrease by powers of two from start to end. In , segment sizes increase by powers of two until the midpoint of the interval and then decrease by powers of two until the end is reached. As can be seen in Fig. 2 , this range of options offers a way to match the segmentation to portions of a function that have singularities or narrow peaks. The hierarchical aspect arises because segmentation is applied recursively. In the first pass, the entire interval is subdivided using one of the above four methods into smaller segments. In the next pass, each segment can be further subdivided, again using any of the four methods. The decision on which method to use is made independently on each segment. This process can continue to the depth necessary to meet the design requirements. More formally, the segmentation at each level, and of each segment, is selected from . In principle, any of the segments produced in the segmentation at level can be further segmented using any of the four methods in level . In practice, however, it is often sufficient to use the same segmentation on all segments within the same segmentation level. This also simplifies the notation, allowing the full hierarchy to be expressed through , where is the number of levels in the hierarchy. When dealing with functions with multiple singularities distributed throughout the input interval, it may be beneficial to support arbitrary segmentation for each segment and each level. The methodologies presented here still apply in such cases, though the notation to describe the segmentation becomes more cumbersome. Alternatively, the input interval can be divided into multiple subintervals separated at the singularity regions and segmentation can be performed individually on these subintervals. In such cases, the singularities can potentially occur in arbitrary regions of the function, meaning that although the hardware-efficient segmentation described here can be applied within each subinterval, it cannot be applied to the outer most segmentation. Therefore, comparators or multi-valued decision diagrams [19] will need to be utilized for the outer most segmentation, which can add increased area and delay requirements.
This structure can be implemented via cascaded LUTs, where the output of one LUT is supplied as the address of the next (as illustrated in Fig. 8 which will be discussed further in Section V). Cascaded lookup tables are also used in the balanced error approach of Sasao et al. [18] - [20] . However, while Sasao et al. use them to realize multi-valued decision diagrams [21] , in hierarchical segmentation, each LUT is used to store information of a given hierarchy level. Thus, the -bit representation of an input is split into partitions denoted , where denotes the bit-width of the th partition . The number of addressable segments in the th partition is constrained as follows:
Segments within any given level of the hierarchy are indexed by , where . For uniform segmentation , it is clear that segments can be formed, since uniform segments are addressed with bits. However, for the three power of two schemes, , the constraints are not as intuitive. Consider the example of a two-level hierarchy in which , and the first partition is and is addressed using 5 bits; e.g.,
.
As noted earlier, it is assumed that the approximation interval is normalized to the range to (in the case where ), where the leading binary point is implicit. In this example, it is possible to construct a maximum of 10 segments in the first level of the hierarchy as illustrated in Table I . Note that with the exception of the initial and final segments, the segment lengths increase by powers of two until (end of location 4) and start decreasing by powers of two from the midpoint (beginning of location 5) to the end. Fewer segments can be obtained by omitting parts of the table. For , locations 0 to 5 are used with the ending value of location 5 modified to . Analogously, for , locations 4 to 9 are used with the beginning value of location 4 modified to . Computation of the segment address for a given partition is based on detecting the number of leading zeros for segments beginning with a zero, and on detecting the number of leading ones for segments beginning with a one. Specifically, the addresses (segment numbers) can be computed in the following manner:
if if (7) where , and return the number of leading zeros, number of leading ones, and the most significant bit of , respectively.
gives the number of bits used for indexing the segments in (5 in this case, shown in bold Table I ). Let denote the set of bits that remains constant within a given segment in (bits left side of the vertical partition lines Table I ). The th level uses the adjacent bits to the right of . For example, in Table I, for and , and for and . When uniform segmentation is used in the th level of the hierarchy, then , and bits of remain constant over a given segment, so . Therefore, the th level simply uses the bits immediately to the right of . However, if one of the three nonuniform segmentations is used,
, then the number of bits corresponding to depends on the value of , since determines the value of . Consider again the case when . Let denote the th bit from the least significant bit (LSB) of ; e.g., is the LSB of is the bit immediately to the left of the LSB, etc. In formal terms, which has bits, is given by the bits to the right of the following: • for and (segment numbers 0 and 9 in Table I ); • for to (segment numbers 1 to 4 in Table I ); • for to (segment numbers 5 to 8 in Table I ). With appropriate modifications to the asymmetry of the segmentation, an analogous procedure applies for determining in the case of and . The computation of can be performed as follows:
if or if and otherwise (11) In principle, it is possible to have any number of levels of nested segmentation steps , as long as . The more levels are used, the closer the total number of segments will be to the optimal. However as increases the partitioning problem becomes more complex, and the cascade of lookup tables gets deeper, increasing the delay involved in finding desired segment. Experiments show that the rate of reduction of decreases rapidly as increases, so that further increase in delay Fig. 3 . Illustration of a two-level segmentation that is uniform at both levels, e.g., US(US). In this example B = 3, leading to 8 equal-width outer segments, and B , ranges from 0 to 3, so that each outer segment is partitioned into 1, 2, 4, or 8 uniform inner segments. The total number of segments is M = 24.
The black and grey vertical lines indicate the boundaries for the outer segmentation 3 and inner segmentation 3 , respectively. Hardware suitable for such a scheme is illustrated in Fig. 8 , which is also discussed in Section V in detail.
arising from incrementing gives diminishing returns in terms of approximation accuracy.
In most cases, , which consists of an outer segmentation scheme and an inner segmentation scheme , gives an that is close to the optimal while retaining acceptable partitioning complexity and delay. Therefore, in the results presented here is used. For the first level of segmentation is used if the function has strong nonlinearities at both endpoints of the range and .
is used if the function has strong nonlinearities at but not , and is used if the opposite occurs.
is used if the function is nonlinear in regions away from the endpoints. For the inner segmentation is used for the results presented here, enabling straightforward and efficient control of the approximation error of an outer segment . Fig. 3 illustrates an example of , where , leading to 8 equal-width outer segments, and , ranges from 0 to 3, so that each outer segment is partitioned into 1, 2, 4, or 8 uniform inner segments. The total number of segments is .
B. Segmentation Algorithm
The segmentation is demonstrated with the following six elementary and compound functions: (17) where is an -bit number over of the form . The function occurs in image warping algorithms [22] and is used in the Box-Muller algorithm for the generation of Gaussian noise [23] . Functions , and are commonly used elementary functions [3] , while function is an example of a complex high-degree rational function. The determination of the appropriate segmentation hierarchy for a given function plays an important role. Choosing the wrong hierarchy for a given function can result in inefficient segmentation, leading to unnecessarily large numbers of segments. One way of finding the best hierarchy is to apply all possible segmentation hierarchies and pick the one that gives the smallest number of segments . Instead of using this brute-force approach, we first find the balanced error segmentation of the given function. Then for each of the four possible segmentation schemes , we obtain the variance of a histogram that holds the number of balanced error segmentation boundaries within each of the "partitions" created by the outer segmentation. The scheme that results in the smallest variance is chosen, since a small variance indicates a good match with the balanced error segmentation. This approach is illustrated in the pseudo-code shown in Algorithm 1. It can be applied recursively times to find the appropriate inner segmentation schemes. Another important issue is the determination of number of bits to assign for the outer segmentation . If is too small, there will be insufficient granularity in the outer segments follow the local nonlinearities of a function. On the other hand, if is too large, there will be too many outer segments and the total number of segments will be unnecessarily large. Fig. 4 gives an example of how varies with increasing for the approximation of functions and accurate to . The segmentation is used for , while is used for . In both cases there is a single minimum, occurring at 9 bits for and 5 bits for . Algorithm 2 gives the pseudo-code of the hierarchical segmentation method. As noted earlier, four input parameters are required: the input interval , the ulp of the input, the polynomial degree to be used for the splines, and the desired maximum absolute error at the output. The notation concatenates matrix to matrix (e.g., line 42). First, the appropriate segmentation hierarchy is found (line 4 to line 6). Next, hierarchical segmentation is performed while searching for the optimal that minimizes . Initially , which corresponds to uniform segmentation.
Algorithm 1 Select Best Hierarchy Scheme
is then incremented until the segmentation that gives the minimum number of segments is found (line 8 to line 65). The core of the algorithm lies in the for-loop from line 19 to line 51. For each segment in the outer segmentation, the Chebyshev coefficients for the approximating polynomial of appropriate degree are computed. If the approximation error is too high, the number of segments in the inner segmentation is incremented by successive powers of two until the of all inner segments are less or equal to the required error . This process is performed for all outer segments. The final output is the total number of inner segments , the vector containing the segment boundaries, ROM0 which is needed for ROM1 address computation, and ROM1 which holds the polynomial coefficients for each segment. ROM0 stores the and the offset corresponding to each first level segment. The offset is simply the number of rows in ROM1 prior to the row in ROM1 corresponding to the current first level segment. Table II shows the contents of ROM0 for the example in Fig. 3 . 
Algorithm 2 Hierarchical Segmentation Method
C. Segmentation Experiments
Applying the methods described above to the six functions gives , and , respectively. Fig. 5 illustrates the segmentations when Algorithm 2 is applied to the six functions using degree-2 splines and an error requirement of . In this example the number of levels is fixed at two and is used for the inner segmentation for all cases. The six functions require 19, 28, 64, 80, 6, and 7 segments, respectively. The optimal that lead to the minimal are found to be 9, 10, 11, 5, and 6 bits, respectively. The importance of selecting the right scheme for a given function is apparent if is used for instead of . While results in 19 segments, requires 63 segments to deliver the same precision. It can be seen that the hierarchical segmentation closely resembles the balanced error segmentation shown in Fig. 6 . The main difference is the expense of increased . Table III shows a comparison of the number of segments for uniform, hierarchical, and balanced error segmentation for several different error requirements. The balanced error results are obtained from the binary splitting algorithm [4] . It is clear that hierarchical segmentation greatly reduces when compared to uniform segmentation, particularly for highly nonlinear functions or stringent precision requirements. For all of the functions, the hierarchical segmentation requires only modestly more segments than the balanced error approach, and significantly fewer segments than uniform segmentation. In other words, it results in memory requirements that are relatively similar to what is needed in balanced error segmentation, while also delivering significant hardware savings as discussed earlier.
Interestingly, notable improvements are also achieved for relatively linear functions such as and . The approximation of the natural logarithm has been widely studied in the literature, mostly using uniform segmentation. Fig. 7 shows the variation of number of segments with error requirement for uniform and hierarchical segmentation for . Due to the rigid nature of uniform segmentation (i.e., in which the total number of segments must be a power of two), the curve exhibits step discontinuities. However, hierarchical segmentation is more flexible, giving a smoother curve. Depending on the error requirement, can be reduced by over a factor of two relative to uniform segmentation. Since is over . For example, under the error requirement of used in Fig. 5 , is uniformly segmented. For tighter error requirements, nonuniform segmentation becomes superior.
V. HARDWARE ARCHITECTURE Fig. 8 shows the hardware architecture for evaluating functions segmented with the hierarchical segmentation method. The unit performs the address decoding step [ (5)- (7)] and the computation (8)- (11). If , the unit is bypassed. The bit selection unit selects the appropriate bits for , and from the input in conjunction with ROM0. Let denote the set of consecutive bits from the th to th bit of from its LSB. For , the bit selection unit selects , and . A single barrel shifter is required for the selection of and . For other hierarchy schemes, the selection process is more complex, since the bit locations of and can overlap as shown in the example of Table I . For these cases, two barrel shifters are used. The bit selection unit is illustrated in Fig. 9 .
As described in Algorithm 2, ROM1 contains the polynomial coefficients to each segment. The depth of ROM0 is defined in (2)-(4), and the depth of ROM1 is the total number of segments. The size of the two ROMs are defined as follows:
In practice, ROM0 will be significantly smaller than ROM1, since its depth is bounded by which is generally small, and the entries and offset are also small. Horner's rule is used for the evaluation of the polynomials in the following form: (20) where is the input, is the degree, and are the polynomial coefficients. Alternative polynomial evaluation techniques, such as the single multiplication degree-2 method in [10] can potentially be used as well.
Since and are implicitly known for a given segment, is used instead of for the polynomial arithmetic to reduce the size of the operators. Let denote the set of bits corresponding to and , and denote the bit-width of this set.
is scaled occupy the range . If , this involves masking out the bits corresponding to , and shifting by to the left. One problem is the fact that the Chebyshev coefficients have originally been generated under the assumption that will be used for the polynomial evaluation. Let us consider a segment with a degree-2 spline. is given by (21) Rearranging the equation gives (22) Representing a degree-2 polynomial equation with the following coefficient labels: (23) in combination with (22) gives (24) Examining the second, first, and zeroth order terms, the newly transformed polynomial coefficients are (25) (26) (27) Note that the additions in (26) and (27) could lead to cancellation errors in floating-point arithmetic when the precisions are very high. In such cases it is desirable to use a multi-precision library.
Once the segmentation and coefficient generation have been completed, the next challenge is the determination of the signal bit-widths in the arithmetic data-path. Insufficient bit-widths can cause to overflows and error requirement violations, while excessive bit-widths can result in waste of valuable hardware resources. For the architecture in Fig. 8 , bit-widths to and to need to be determined. We use an adaptation of the MiniBit technique, optimized for polynomial-based function evaluation [2] .
There are three main sources of errors when evaluating functions in digital arithmetic: 1) the inherent error due to approximating the function with polynomials; 2) quantization error due to finite precision effects incurred when evaluating the polynomials; and 3) the error of the final output rounding step, which can cause a maximum error of ulp. In the worst case, and will contribute additively, so to achieve faithful rounding, their sum must be less than ulp. We allocate a maximum of ulp for and the rest for , which provides a good balance between the two error sources and will be discussed further in Section VI (see Fig. 10 ).
Quantization is usually performed in two modes: truncation which can cause a maximum error of (1 ulp), and round-to-nearest which can cause a maximum error of ( ulp). Round-to-nearest must be performed at the output signal to achieve faithful rounding, but either rounding mode can be used for the internal signals. Since truncation results in Fig. 8 ).
VI. HARDWARE IMPLEMENTATION RESULTS
A Xilinx Virtex-4 XC4VLX100-12 FPGA is used for experimental implementation. Synplicity Synplify Pro 7.7.1 is used for synthesis, and Xilinx ISE 7.1.04i is used for placement and routing. Implementation results for degree-1 and degree-2 splines using Horner's rule are given. The designs are fully combinatorial with no pipeline registers. For the leading zero/one detectors required for the unit, the design described by Oklobdzija [24] is adopted.
The primary building block of Xilinx Virtex series FPGA is the "slice," which consists of two four-input lookup tables (LUTs), two registers and two multiplexors, and some additional circuitry such as carry logic and AND/XOR gates. The four input LUT can also be used as 16 1 RAM or a 16-bit shift register. Although the Virtex-4 FPGA contains hardwired dedicated RAMs and multiply-and-add blocks, we consider implementations based on slices only, in order to obtain unbiased area and delay comparisons. The "precision" in the results refers to total number of bits (sum of number of integer and fractional bits) at the output. Table IV shows the processing times of the automated hierarchical segmentation tool (see Fig. 1 ) on an Intel Pentium-4 3.4 GHz PC with 2 GB RAM for degree-2 approximations to . The segmentation step described in Algorithm 2 takes the largest portion of the processing time followed by the bit-width optimization and VHDL generation steps. The main bottleneck of the segmentation algorithm is the sequential search for the optimal . If the optimal is used initially for 24-bit precision for instance, the segmentation time is reduced to 57 s. Fig. 10 explores the area variations with allocation discussed in Section V for 16-bit degree-2 approximations to the six functions. When the allocation is close to zero, the number of segments are excessive leading to large areas. On the other hand, when allocation is close to 1/2, errors allocated for quantization effects are too small, resulting larger operator and coefficient table sizes. As mentioned in Section V, we have chosen 1/4 for the allocation which provides a good balance between the two error sources for the case studies concerned in this work. Figs. 11 and 12 examine area and delay behavior of degree-2 approximations. The top (grey), middle (white), and bottom (grey) parts of each bar indicate the area portion for address decoding (top part of Fig. 8 for computing the address for ROM1), ROM1, and polynomial evaluation respectively. The area results indicate that the portion for address decoding is generally small in all cases. The address decoding for is slightly larger than , because employ power of two based outer segmentations which are more complex to decode than the uniform based outer segmentations employed in . The portion for ROM1 increases with precision due to increasing number of required segments (as shown in Table III ). For a given precision, the ROM1 portion varies across functions due to different requirements and the degree of logic minimization performed during synthesis. The portion for polynomial evaluation increases with precision due to increased bit-width requirements in the data path. While an exponential increase in area with precision was observed in Fig. 11 , the results in Fig. 12 indicate that delay increases in a linear fashion with precision. Compared to the single multiplication degree-2 method by Detrey and de Dinechin [10] , our results for indicate similar area characteristics but notably slower speeds. This suggests that the performance of the degree-2 designs shown here can be further improved by utilizing optimized polynomial evaluation architectures over the standard Horner's method.
Figs. 13 and 14 show area and delay comparisons between degree-1 and degree-2 approximations to , and . The area for degree-1 approximations increase more rapidly than the area degree-2 approximations. For , precisions exceeding 13 bits, degree-2 splines start to become more cost effective in terms of area than the degree-1 splines. This intersection point occurs at 11 bits for and at 17 bits for . The delay exhibits a linear increase with precision for all cases due to the increasing internal bit-widths. Degree-1 approximations are faster than degree-2 approximations due to their shallower polynomial computation chains.
Table V compares the area and delay results between uniform and hierarchical segmentation for degree-2 approximations to and . When hierarchical segmentation is used, significant area savings are obtained at the expense of moderate increase in delay. The area savings occur due to the fewer numbers of segments leading to smaller coefficient tables. The delay penalties are caused by the coefficient address decoding circuitry. For function , precisions beyond 16 bits could not be mapped on to the FPGA due to the rapid area increase (mainly memory) of uniform segmentation. Fig. 15 presents area comparisons of uniform and hierarchical segmentations for approximating . Earlier, Fig. 7 and Table III indicated that even for relatively linear functions such as , hierarchical segmentation always leads to smaller number of segments (or memory requirements) than uniform segmentation. However, when the overall hardware area is examined, the results in Fig. 15 show that hierarchical segmentation does not necessarily result in the smallest area; though memory reductions are achieved for ROM1, there are area overheads associated with address decoding. For degree-1 approximations, uniform segmentation is always more area efficient for precisions up to 20 bits. Beyond 20 bits, the most area efficient method depends on the precision. With degree-2 approximations however, there are relatively few segments required for the precisions shown (up to 24 bits). Thus, the reduction in ROM1 achieved with hierarchical segmentation has little impact on the overall hardware area. At higher precisions however, we suspect there will be an intersection point (where hierarchical segmentation can be more area efficient), analogous to the degree-1 case.
For an 8-bit degree-1 approximation to over on a Xilinx Virtex-II XC2V4000-6 device, the balanced error approach by Sasao et al. [19] results in a delay of 67 ns, while the hierarchical segmentation results in 33 ns. We expect that the delay gap between the two methods will be wider for approximations with larger due to the increased burden on the address decoding process. In other words, hierarchical segmentation leads to significant reduction in delay over the balanced error approach with modest increase in (the behavior of is similar to that of and in Table III ). It is also instructive to compare memory requirements between the balanced error approach in [18] and hierarchical segmentation. The two main sources of memory utilization are the tables for coefficient address computation and coefficient storage. In hierarchical segmentation the coefficient table (ROM1) is dominant and memory for coefficient address computation (ROM0) accounts for only a small fraction of the overall memory utilization. By contrast, in the balanced error approach the large number of cascaded LUTs that are used for address computation can account for over 90% of the total memory utilization [18] , with the remaining memory used for the coefficient tables. Thus, while the balanced error method requires fewer segments and thus less memory for coefficients, these savings can be more than consumed by the memory for the address computation.
These issues are illustrated in Table VI with specific reference to the functions over and over . Memory requirements for address computation and coefficient storage are given for hierarchical segmentation. For balanced error, the results presented in [20] for the sum of address computation and coefficient memory are incorporated. The table does not specify the contribution of the address computation memory to the overall memory for balanced error as this information is not directly available in [20] , though, as noted above, the cascaded LUTs for coefficient address computation in the balanced error implementation can consume significantly more memory than the corresponding structure in the hierarchical approach. It should also be noted that the memory utilization for balanced error method could likely be improved through the use of more memory-efficient address computation strategies.
VII. CONCLUSION
We have presented an efficient method for evaluating functions via splines with a hierarchical segmentation scheme. The use of segmentation hierarchies involving uniform splines and splines with size varying by powers of two enables efficient, high precision coverage of highly nonlinear function regions. The methodology is automated given user-specified precision requirements and choice of approximation method. The hierarchical segmentation method results in significant reduction in number of required spline segments for given precision requirements compared to the commonly-used uniform segmentation approach. Compared to balanced error segmentation, the method presented here gives significantly reduced delay and memory requirements. The bit-widths of the fixed-point coefficients and arithmetic operators are optimized via the MiniBit approach, enabling us to guarantee 1 ulp accuracy at the output Current and future work includes exploring the use of hierarchical segmentation for approximation methods other than piecewise polynomials, applying the proposed approach to a wide range of examples, and evaluating its effectiveness for various FPGA devices.
Wayne Luk (SM'06) received the M.A., M.Sc., and
