It is shown how a relatively simple device can evaluate exponentials, logarithms, ratios and square roots for fraction arguments, employing only shifts, adds, high-speed table lookups, and bit counting. The scheme is based on the cotransformation of a number pair ( x , y ) such that the F ( x , y ) = f ( x o ) is invariant; when x is driven towards a known value xu, y is driven towards the result. For an N-bit fraction about N / 4 iterations are required, each involving two or three adds; then a termination algorithm, based on an add and an abbreviated multiply, completes the process, for a total cost of about one conventional multiply time. Convergence, errors and simulation using APL are discussed.
Introduction
The present paper shows a hardware technique to evaluate we", w + In x , wlx and wlx;, where w and x are given numbers. Each function is evaluated in about one conventional multiply time, using a unified apparatus based on adds, shifts, bit counting and the use of a table of
In (1 + 2-"). This technique is the subject of a recently issued patent [ 1 1.
The scheme is based on the iterative cotransformation of a number pair (x,>>) such that a bivariate function F(x,y) remains invariant. In doing so, x is directed towards xw, where the corresponding yw is the desired result. When x is sufficiently close to x", a termination algorithm using an add and an abbreviated multiply completes the evaluation. The iteration aims to remove the leading one bit in the fraction lxk -xwl, and zero bits are automatically ignored.
Our divide algorithm resembles the IBM System/360 Model 9 1 scheme [2], which cotransforms the dividend and divisor by limited multiplications, until the divisor is close to unity. For In x and e", the use of decimal digitby-digit schemes dates back to Briggs three centuries ago [3]. Meggitt's unified study of pseudo division and pseudo multiplication methods [4] includes the mechanization of Brigg's scheme, square roots and trigonometric functions, each costing about three conventional multiplication times. A recent study by Sarkar and Krishnamurthy [5] showed that a further improvement in speed is possible. 
TIEN CHI CHEN
Walther [6] recently showed a unifying algorithm containing the CORDIC schemes first proposed by Volder. The elementary functions treated include ratios and square roots; included also are exponentials and logarithms, as inferred from hyperbolic functions. The scheme is based on stepwise rotations in any of three coordinate systems (circular, linear, and hyperbolic) and usually employs shift sequences that are ascending integers. For hyperbolic functions a basic shift sequence needs to be invoked repeatedly.
The unified automatic evaluation of logarithms and exponentials of binary numbers has been considered in 1962 by Cantor, Estrin, and Turn [7] , within the context of the "fixed-plus-variable" binary computer design. The method requires rather large tables (at least several hundred words for each function) to transform the argument into the vicinity of 1 and 0, respectively.
Specker [8] suggested the use of a table of In( 1 + 2-") for the evaluation of In x and e", and commented on the possible advantage of hardware implementation. The use of a given table entry here depends on the outcome of a comparison against /xk -x" 1.
Very recently, de Lugish [9] discussed the unified mechanization for multiply, divide, In x , e", square root and trigonometric functions. His technique is based on the redundancy recoding technique in fast division algorithms, and involves the systematic one-bit left shift of x or (1 -x) at every iteration. The shifted quantity, matched against two comparands, determines a value sk = -1, 0 or + l ; (1 + sk2-") is then used to transform (.xk, yk). There is no termination algorithm. T o guarantee convergence with the steady left shift of x, an initiation scheme is often invoked. The redundancy recoding does diminish the number of add-type operations (to about one for every three bit positions); and although one shift and two comparisons are required for every bit position, the de Lugish unified method can be quite advantageous when the add time is the overriding cost factor.
In contrast to the other techniques, the present method relies heavily on automatic bit-counting to avoid redundant actions. It also uses a terminal linear approximation which speeds up the computation and, incidentally, halves both the rounding errors and the table requirements.
Theory
Our algorithm to evaluate z,, =f(x,) is based conceptually on the introduction of a parameter y to form a two-variable function F(x,y) such that 1) There is a known starting value y = yo with F(x,,,y,,)
2 ) There are convenient, known mechanisms to cotransform the number pair (xk,yk), k 3 0, into (x, + l,y, + leaving F invariant, namely F(x, + ,,yk + = F(x,,y,).
3) The sequence of x-transformations tends to a known goal x = x , ; the corresponding y-transformations then tends to y = yo. F is so defined that F(x,,y,) = y,.
4) The iterative part of the algorithm is essentially the repeated application of the cotransformation procedure, carried out in the absence of detailed knowledge about the value 2,).
In terms of solid coordinate geometry, F is easily represented by a curve confined to the z = 2, plane and passing through P,,(x,,,y,,,z,,) as shown in Fig. 1 . We have
and the transformation invariance is merely the requirement that if the point P,(x,,y,,z,) is on the curve F , then so is Pk+I(xk+l,~k+l,zo). Clearly, then,
Requirement 3) means that F passes through the point Q(xwry,,z0). To ascertain this, we adopt the functional form The invariance, Eq. ( 2 ) , then guarantees that Q(x,,y,,z,) indeed lies on F , and
and the iterative transformations of the number pair ( x , y ) in 2) is just a methodology to move a point along the curve F , starting from P,,, through P , , P , , etc., towards Q. We shall limit the transformations to those based only on the current values of (x,y), thus x, + 1 = 4 (X,>Yn. 1 ( 5 4 and Yk + 1 = 4J (X,,Y,) .
(5b)
The choice of 4 is dictated mainly by cost and efficiency considerations. It is convenient to specify C#J to be a function of x, alone, operative for x, being in some closed interval [a,h] ; the x-transformation should yield an x,+, much closer to x, than x, is, and further should lie in the same interval, to facilitate further transformations.
Equation (5b) 
is a known function of (xk,yk), independent of the unknown z,. The point P,+, will be a step closer than P, to the destination
It will be seen in the next section that the actual choices of + and $ are quite straightforward. In any case, starting from (x,y) = (x,,,~,,), the repeated use of Eqs. (Sa) and (5b) should show a steady displacement along the curve F towards the point Q:
if the x-process converges to xw. The corresponding y , should then equal z,.
In reality, it may be difficult or impossible to reach xw exactly, and extrapolation by a Taylor series would be desirable, if F(x,y) is differentiable with respect to x near x, . Thus,
For an N-bit fraction, if \pi < 2FV, just the first term will usually be adequate. On the other hand, both the iteration cost and roundoff error can be halved by including the term linear in p, with 1pl 5 2-1VP2. Within this termination algorithm, the multiplication (if any) involves a half-precision number ( p ) as one of the operands, costing half of a standard multiply time; high-performance multipliers elsewhere in the same machine can now be exploited.
In what follows, we shall assume that linear extrapolation will be used when lpl < 2-,<i2 with N-N > > 1 .
Specific cases
The functions being considered here are ~w " , O 5 x < l n 2 = 0 . 6 9 3 1 4 . . . , with F = y e " , xw = 0 ; w + I n x , 1 / 2 5 x < I , w i t h F = y + I n x , x w = I ;
w/x for 1/2 5 x < 1,
Binary and hexadecimal floating-point arguments can easily be mapped onto the ranges given (the algorithms below actually work for 0 < x < I , but the convergence may be slower). We note that x i can be obtained by setting w = x in the inverse square root case.
TIEN CHI CHEN

Function:
F ( x , y ) = ye".
Termination:
Relative error:
z, = + lnx, = yo + lnx, = ( y , -I n a , ) + I n x , a , = y , + I n x , =~~~ = y , + I n ( l -p ) -y , -p .
Initiation:
y , = w .
Function:
F ( x , y ) = y + l n x .
Transformations:
Initiation:
y o = w .
Function:
Transformations: x,+, = x,a,, y,+] = y,a, .
Initiation:
Transformations: x,+l = x,ak2, ykil = y,a, .
Relativeerror:
Choice of ak
We select ak to be of the form ( 1 + 2-"), such that multiplications by a, can be replaced by a shift and an add. The value m is usually chosen as the position number* of the leading 1-bit in lxk -xwl ; but an increase by 1 is needed for the inverse square root.
1) For wes, m is the position number of the leading 1-bit in xk . We have thus X, = 2-" + p ,
Here p represents the bit pattern after the mth bit. The latter, among all bits in xk , is responsible for over half of the value of Ixk -xwl , and is the leading candidate for removal. With the help of a small table { T , = -In( 1 + 2-")} , we have
where the most objectionable bit is replaced by a secondorder disturbance, and p is largely intact. The y-transformation is done by a right shift of m places, and then by adding to the unshifted yk
The iteration requires a table lookup, a shift and two adds. With m varying from 1 to N/2 , one needs a total of N / 2 tabular entries; even if N = 60 , the table requirement is only 30 words.
2) For w + In x, m is selected to be the position number of the leading 1-bit in (1 -xk) ; here we have
and
3) For w/x, the x-transformation, hence the bit pattern preservation, is the same as in 2); the y-transformation is the same as in 1).
1
4)
For w/xZ, m is chosen as 1 + the position number of the leading 1-bit in (1 -xk) .
The y-transformation is the same as in 1), namely yk+l = Yk + 2-my, .
(1 3c)
The explicit algorithms are summarized in Table 1 .
Convergence and accuracy
Our simple choice uk = 1 + 2-" thus transforms the most significant 1-bit in /xj -xwl to zero, leaving the remaining bit pattern largely unaffected. The larger the value of m (i.e., the greater the position number of the most significant bit) the more negligible is the perturbation on p . Thus, the number of iterations required to convert the leading N 12 bits of x to zero is measured by the number of 1 bits in Ix, , -xwl .
Statistically, half of these N / 2 bits are zeros; hence, the average number of iterations is about N / 4 , and the average advance of m between two successive iterations is Am N 2 .
A convenient measure of convergence is the relative reduction of the distance to x" again the most objectionable bit is replaced by a secondorder disturbance.
Also,
The sequence {x,} converges if IrI 5 r,, < 1 for large k .
Using the results of the previous section, it is easy to using the same tabular values in 1).
show that 0 < r f r,, < 1 and thus the sequence con-*The kth bit after the binary point is said to have position number equaling k.
verges from the same side, with xk+l lying in (xk ,xo) .
This one-sided convergence is extremely desirable for the smooth applications of iterative transformations requiring {x,} to be suitably bounded. Here {x,} all lie in (x,, ,xw), the latter being a valid interval for the x-transformations.
From the previous section one easily obtains
Substituting the bounds 0 5 p < 2-" , and the average value ( p ) = 2-"" , we obtain
A detailed analysis in the Appendix shows that ( Y ) -1 -In 2 = 0.307. The truncation errors have been given in specific cases in the third section. Should these be deemed too large, an extra iteration could be taken. Better still, as these errors bear a definite sign, one could reduce roughly half of the maximum error by adding one bit to p. Thus,
, -.%2
For / E / 5 2-'"-' , N never needs to exceed N , and N = N will be assumed below.
A major source of inaccuracy is rounding error, which can be controlled by using longer intermediate results, rounded tabular values, rounded arithmetic, fewer arithmetic steps, or combinations of all of these. Using J guard bits in the fraction, the total absolute rounding error in computing y , is
where i = the number of iterations, j-2?-" = the rounding error due to each tabular or shifted operand, and k."'-''= the rounding error due to an add (or subtract) operation.
We have It turns out that i, expected to be about N / 4 , is never known to exceed N / 2 , and the choice J = 6 should enfor N 5 6 4 , using rounded operands and rounded operations, or for N 5 32 , using unrounded operands and unrounded operations. The counting of left-zeros being a regular activity in normalized floating-point arithmetic, the implementability of a "standard m-finder" is not an issue.
However, nonstandard integer choices, #I, may be desirable for added efficiency and/or algorithm simplicity. controls, one could persist in using q = 1 ; even for the unlikely event of p = 0 , where q = 0 would be an unusually good choice, with q = 1 rapid convergence is still guaranteed. A similar situation occurs for the inverse square root, with ( 1 + 4) in place of q.
A unified apparatus
The four iteration algorithms can be handled by one unified apparatus (see Fig. 2 ), consisting of an adder A, a shifting mechanism S, and an N/2-word fast memory M holding {-In ( algorithm.
Starting by putting xI, in X , w(= yo) in Y , the sequencing for the evaluations are:
wes":
wlx,,:
The apparatus invokes the termination algorithm by shipping C(X) , C ( Y ) away, to be processed by more conventional equipment. For details of the sequencing see Table 2 . If self-contained operation is desired, the iterations can be allowed to run until p 5 2"' . It would then be necessary to double the table size and add one more guard bit for accuracy. The adder, the shifter, and the sequential bit-counter are within the state of art. A small high-speed memory is needed for wes and (w + Inx) ; this is easily achieved using modem memory devices in an integrated circuit memory technology compatible with computation hardware. Also, this apparatus need not stand alone and can share equipment with other parts of the same system.
T o derive a rough timing estimate we equate memory access time with shift time, and note that a conventional multiplier takes time T = N shift-add times.
The expected N / 4 iterations involve 3 N / 4 shift-adds for the inverse square root and N / 2 for the other three functions; the time estimate is, therefore, 3T/4 and T / 2 , respectively. To these one has to add the timing cost of the terminal algorithm.
For logarithms the complement-add required adds little to the iteration cost. The total timing cost is still measured by T / 2 .
For exponentials, ratios, and inverse square roots a half-multiply is needed. If this is done with a conventional multiplier, the cost would again be T / 2 for a total cost of T (exponential, ratio) and 1.25T (inverse square root). However, if a powerful fast multiplier is invoked here, the half-multiply may cost little more time than an add.
APL simulation
The algorithms have been simulated by a nest of interlocking subroutines written in APL, which handles the bit-pattern manipulations with clarity and precision. The results of a run on the IBM APL terminal system is reported here. The results are summarized in Table 3 ; the number of iterations varies from three to ten, the mean being rather close to 6.3.
For two values of x o , the x-transformations are displayed in Fig. 3 . The automatic sequential bit-counting 386 mechanism is seen to be extremely worthwhile; it allows TIEN CHI CHEN (out of range) (out of range) (out of range) (out of range) (out of range) 6(4 x lo-') 7 (-2 x 1 0-') 7 (-2 x lo-") 5(-2 x 10-8)
both the skipping over of unimportant bits and, no less important, the use of the same m for more than one interation. In Fig. 3(c) , for example, we encounter a case with Am = 0 in one iteration, then Am = 8 in the next. Without the bit counting technique one would have to employ costly trial and error comparison schemes, and/or severe limitations on the arguments to guarantee Am 3 1 .
Conclusion
We have shown how the four functions of x can be evaluated by using shifts, adds, limited table look-ups, and sequential bit counting. The timing cost is low, being of the order of a conventional multiply time and sometimes less. The implementation by a unified piece of hardware is entirely within the state of the art. The possible drawbacks are a) the word-length incompatibility with the rest of the system because of the special guard bits required; and b) the variable timing. Short of actual implementation, the hardware can be emulated by a nest of interrelated microsubroutines for microprogrammable machines. The algorithms can also be written as conventional subroutines; but then the compactness of the table will not be a major advantage, and the manipulations may be difficult using conventional instructions.
Acknowledgments
The writer is indebted to John C. McPherson for suggesting the shift-add approach and for sharing his experience and insight on the problem. The author also benefited from discussions with colleagues, particularly Ralph A. Willoughby, Irving T. Ho, Shmuel Winograd, Gerald S. Shedler, and Andrew R. Heller.
Appendix. Convergence of the iterations
We shall study A m , the average advance of m , and give some detailed estimates for r .
Average advunce of m
We shall denote by Z(a) as the number of leading zero bits in a binary fraction a. Then
In terms of this notation we can write for all four algorithms
- ( Z ( P ) ) -( m -1 ) .
(A21
The average advance of m between two iterations is
is a step function of p : Hence we have for all 2m cases, assuming each to be equally probable: + 2-") .
The minimum convergence speed cannot be readily deduced from the upper bound of r , but x,/ (1 -x k ) monotonically increases with xk , and and r P I3/ 16 = 0.8 13 . As commented in the sixth section, "Nonstandard choices for m ," the range (1/4 , 4/9) can be mapped into (1/2 , 1 ) in one iteration, using rn = 1 . Using this device, the smallest x, becomes 1/2, henceforward r 5 7/ 16 .
