Abstract-High-speed hardware function generation using table look-up in ROM and high-speed multiplication is considered. The reduced interval of interest, [a, bhl is split into several large partitions. Within each large partition the functionf(x) is evaluated by piecewise polynomials of the same low degree whose coefficients are stored in ROM. Four basic architectures for such a scheme are considered. A nonlinear programming problem is solved for determining the optimal partitioning of the interval [a, b 0. The objective function is the average number of.multiplications, which takes into account the probability distribution r(x) = I/(x In f), for the mantissas of normalized floating-point numbers where , is the radix of the number system. The. constraint is the available number of ROM words. The particular case off(x) = 1/x and , = 2 is considered in detail and results are presented including an estimate of the number of ROM units required.
presenting an algorithm for S performing division using multiplication as the basic operator, many schemes have been considered for.implementing high-speed division using a high-speed multiplier [2] , [3] . In addition, there have been numerous papers on array-like function generators. However, many of these require a different array for each type of function implemented and thus make inefficient use of resources. Considerable work has also been done in the area of digit-by-digit (serial) hardware function generation [4] [5] [6] . Until recently, sequential approaches were competitive with parallel methods because parallel multipliers were expensive [O(n2) gates]. As a result, 0018-9340/83/0200-0147$01.00 © 1983 IEEE fully parallel multipliers were not available and the time required for sequential evaluation was comparable to that used for multiplication, 0(n) [7] , [81, since the parallel schemes, using Taylor or Chebyshev polynomials, required several multiplications. However, this situation has changed with the advent of high density integrated circuits which make compact, high-speed [0(log n)] parallel multipliers feasible [9] [10] [11] [12] [13] [14] [15] [16] . High-speed parallel function generation algorithms using table look-up have been considered previously [2] , [3] , [17] [18] [19] [20] but the high cost of storage has traditionally limited this approach. Again, the advances toward VLSI circuits have eliminated this problem by making economical, large, low-cost read-only memories (ROM's). It is thus quite proper to consider methods for high-speed parallel function generation based on functional approximation and high-speed multiplication.
The method considered here uses piecewise polynomial approximation with high-speed parallel hardware multiplication as the basic operator. The multiplier may be dedicated to the function generator' but need not be. The particular function generated depends on the coefficients stored in ROM. Normalized floating-point numbers with mantissas distributed according to r(x) = 1 /(x ln f1) (where 13 is the radix of the number system) are assumed and functions of a single variable are considered. Coefficient address generation is also implemented in ROM where necessary.
There is clearly a tradeoff between speed (the number of multiplications required as determined by the approximating polynomial degree) and the amount of ROM used. The next section considers this tradeoff in general. In Section III, the four architectures that arise from this approach to function generation are considered. Section IV presents the method for solving the speed-memory tradeoff problem. Section V gives the solution forf(x) = 1 /x and : = 2. Section VI presents a method for approximating the number of ROM's required for two of the architectures and Section VII provides conclusions.
II. GENERAL METHOD
The problem to be solved is to achieve the highest average speed using a given amount of ROM for a given precision [e.g., evaluate f(x) with error 0(3-n)]. -In the -extreme case no multiplications are required and the function value is lwoked-up directly but this is usually impractical for high precision evaluations. The amount of ROM can be reduced by. increasing the degree of the approximating polynomial but the number of multiplications then increases and the average speed decreases. There are two ways, however, by which the amount of ROM can be decreased without paying a penalty in speed or accuracy-and these are discussed below. First, the error criterion and the approximating polynomial must be specified.
The method discussed here assumes that the functionf(x) is defined in a suitable interval [a, b] where x is normalized. Previous methods [3] , [19] This requires that more coefficients be stored but allows lower degree polynomials to be used. It is also more difficult to determine the address of the coefficients in cases where the intervals I, are of varying size. This will be discussed later. With speed as one of the goals here, such a segmented approximation will be used. The increase in storage is not a significant penalty when current or anticipated ROM's are used.
Finally, in computing the speed we will consider the average number of multiplications that are necessary. This being the case, it is possible to take advantage of the logarithmic distribution of the leading digits in floating-point numbers [23] , [24]: r(x)= I x In 13 I-<1x< I where 13 is the radix of the number system. This implies that mantissas close to 1/13 are more likely to occur than mantissas close to 1. If arguments near 1/13 are more probable, the use of a higher degree polynomial near IJ/1 than near 1 is then, on the average, more efficient. By taking advantage of this fact the average speed of evaluation can be effectively increased by a judicious choice of the ROM coefficients. speed. Within a partition, the breakpoint spacing may be uniform or nonuniform; this is discussed in the next section.
It is now necessary to choose the joint locations which will give the best average speed for a given amount of ROM, a given set of polynomial degrees for each partition, and a given error tolerance Eo. Consideration of these choices leads to four possible architectures and these are discussed next.
III. THE FOUR ARCHITECTURES As mentioned previously, it is possible to use polynomials of uniform degree m over the entire interval or the interval can be partitioned such that variable degrees mj are used. It is also possible to have uniform breakpoint spacing h or variable breakpoint spacing hi. In the variable breakpoint spacing case, hi is determined by the criterion C; hi is allowed to be as large as possible and still satisfy C. The variation of hi over [a, b] depends on the behavior of the (mi + 1 )th derivative off. By always using the maximum spacing the number of breakpoints is minimized and the amount of ROM is reduced. The complexity of the coefficient address generation is increased, however, as discussed next.
The two choices above lead to four architectures: 1) El, ED equal intervals, equal degrees 2) VI, ED-variable intervals, equal degrees 3) El, VD-equal intervals, variable degrees 4) VI, VD-variable intervals, variable degrees. The basic method for function evaluation using any of these architectures is shown in Fig. 2 by examining the 1 most significant bits of the (normalized) argument x, coefficient address generation is simple and direct for this case. In addition, since in radix 2 the most significant bit is a one, only I -1 bits need be considered. The El, ED scheme is shown in Fig. 3 . The degree is, of course, the same A ROM is then needed to map a given argument to the proper segment of the coefficient ROM. This is shown in Fig.  4 . The degrees of the polynomials are still equal in this case. C. EI, VD (Equal Intervals, Variable Degrees)
In this case [a, b] is split up into partitions and different degree polynomials are used in each partition but, within each partition, hj is constant (see Fig. 5 ). For example, in [0. 5, 1] near 0.5, degree one may be used, in the middle degree 2, and near 1, degree 3. The average number of multiplications depends on the probability distribution of the argument. Different ROM banks are required for the different partitions and the multiplier will be used mj times for degree mj. This scheme is shown in Fig. 5 . A small ROM is used to select the ROM bank and control the number of multiplications. This is, perhaps, the most appealing approach because it uses less ROM 
L___X___ (4)
A method for determining N(Xj, Xj+1) is given in [25] .
Equations (3) and (4) To use any of the above methods it is necessary to determine the optimal partition giving the minimum average speed subject to a limit on the number of ROM words, Ro, and the error limit, Eo.
IV. SOLVING THE NLP PROBLEM
The problem is to determine the (3) In order to obtain a solution to (5) it is necessary to specify x j=l Xj Eo, the number of partitions s, the degrees, and the types of where X = (XA., *X , IXs+ 1) T, A1 = 0.5, and Xs+, = 1. polynomials. Each solution to (5) XJ (2) XJ (3) XJ(3) > XJ (2) .S with polynomials of degree 1, 2, and 3. Fig. 7 shows a plot of the objective function versus joint position when there is no constraint. Since X3 > X2, only the portion of the X2X3 plane below the 450 line is relevant. Fig. 7 clearly shows how the average number of multiplications decreases as X2 and/or X3 are moved toward 1 and a larger and larger range of the function is approximated by degree one polynomials. Fig. 8 shows a corresponding rapid increase in the amount of ROM needed as the range of the degree one approximation increases. This is so because the breakpoints must be very close together for the degree one approximation. Note that the ROM increases more rapidly with increases in X2 than it does with increases in X3 since increasing X2 controls degree one polynomials whereas increasing X3 controls primarily degree two polynomial-s. Fig. 9 minimum moves toward larger values of X2 and X3 as Ro is increased until, as in Fig. 7 , the constraint is no longer active since Ro is large enough to allow a degree one polynomial over the entire range [0.5, 1]. The optimization program was run for various combinations of Ro, partitions and degrees assuming a VI, VD architecture and using the rms error criterion. Extensive results for two and three partitions and degrees ranging from 1 to 5 are given in [25] . The two partition results for degrees (1, 2) to (3, 4) and the three partition results for degrees from (1, 2, 3) to (1, 4, 5) are presented here in Figs. 12 and 13 . The figures clearly show the tradeoff between storage and speed. For sufficiently large Ro, the lowest degree polynomial may be used over the entire interval. Each point on the optimal curves was obtained as a solution to the NLP problem as described above.
In some cases the optimal curves cross for some value of Ro such as (1, 2, 5) and (1, 3, 4) in Fig. 13 . In other cases some curves are systematically below others. It is easy to generalize to ? rule: "Implementation (i1, jI, kI) is always faster or as fast for any Ro than implementation (i2, 12, k2) if il<i2, 1l<j2, andkI<k2or il = i2, il < j2, and kl < k2 or iI = i2 jI j2, andk, <k2.
These relations define a partial ordering among all implementations (i, j, k) such that i < j < k. This partial ordering is illustrated by the directed graph of Figs. 12 and 13. Similar rules can be inferred for other functions. In some cases the designer may wish to use one implementation over part of the interval and another over the remainder. It is even possible to use several implementations but the control becomes increasingly complex.
The above results assume infinite resolution for hi when, in reality, the breakpoints must be spaced such that hi is a multiple of 2-. This has the effect of displacing the minima in the minimization curves toward more multiplications. Similarly, The overall design procedure then is to first choose the desired architecture and error tolerance co. and that if (n i 2)/m log n = k is kept constant, n___2 1 (n + 2)nk+I n+ RE,ED (k log n / 4wobok logn log nI for k constant. This shows that for the most basic (El, ED) case 0(n2) IC's maximum are needed by using polynomials of degree (n + 2)/k for an n-bit accuracy function evaluation or 0(n2+k/log n) IC's by using polynomials of degree (n + 2)/log n.
B. VI, ED Case
Here each hi must be computed by solving (7) recursively and the function NvI,ED(a, x) so derived is illustrated in Fig. 15.
An approximation of the derivative of N(a, x) can be obtained by noticing that N(a, xi+,) -N(a, xi) AN Forf(x) = l/x between 0.5 and 1. Pm is a Chebyshev polynomial of degree 1. The ratio of the number of breakpoints in the VI, ED case to the number of breakpoints in the El, ED case is then NVI,ED(a, b) and is also the ratio of the number of required ROM circuits.
Notice that the EL, ED case has been improved by a multiplicative factor (in the limiting case). Equation (8) was evaluated forf(x) = 1 /x using Chebyshev polynomials of degree one and compared to the computed value. The agreement is good for h < 2-6 and the result is asymptotically a 40 percent saving in ROM by using nonuniform intervals. The computed ratios using (8) are given in Table I. VII. CONCLUSIONS A hardware function generation method using low-degree polynomial approximation and high-speed multiplication has been described. The scheme is based on coefficient look-up in ROM where the amount of ROM required has been greatly reduced in comparison to a direct look-up approach. The method takes advantage of the logarithmic distribution of normalized floating point numbers and the behavior of the function being approximated. Four basic architectures result and these were described. A design procedure is outlined and the specific case off(x) = 1/x (division) was considdred in detail. The methods described are suitable for single chip, high-speed, hardware function generators. Fig. 16 shows a block diagram for such a chip. Several functions sharing the same address look-up scheme and the same multiplier can be generated on a single chip. 
INTRODUCTION
A NY well-defined notation for representing programs can L--3 be viewed as an architecture in that programs expressed in such a notation can be mechanically "executed." In this sense both the source language and host machine of a computer system define architectures, although they have different design objectives. The intent of source-level architecture is to permit programmers to develop applications efficiently, while the intent of the machine-level architecture is to control 0018-9340/83/0200-0156$01.00 C 1983 IEEE
