Abstract-A new algorithm is presented which significantly reduces the minimum amount of logic required to calculate sine, cosine, and square root. It is derived from an old method for computing certain inverse functions which was once considered for use in software, hut then abandoned because of efficiency concerns. However, when reversed and combined with a restoring square root algorithm, a unique new design emerges which performs trigonometric calculations without the use of pre-stored constants or any internal operation more complex than binary subtraction. The design has been implemented with common l T L MSI logic and is found to he reasonably fast, very accurate, and to require considerably less hardware than a comparable CORDIC algorithm.
I. INTRODUCTION
Because of the speed demands of real-time microprocessor applications, designers often find that arithmetic computation must be moved from a software to a hardware implementation. Unfortunately, the numerical algorithms which have traditionally been used in software are not necessarily the most efficient when implemented in hardware. For example, using the Chebyshev series expansion to calculate trigonometric functions requires floating-point multiplication, argument pre-processing, and pre-stored constants, each of which adds a considerable amount of silicon to any hardware design [l] .
In an effort to simplify such systems, certain "digit-by-digit" algorithms have been developed, many of which require no internal operation more complex than shifting and adding. While the convergence of these algorithms is generally only linear (with word size), each iteration is simple and may be performed very rapidly.
The CORDIC (Coordinate Rotation DIgital Computer) technique is one such method which has been found to be very useful for evaluating trigonometric functions [ 2 ] , [3] . As with all methods, however, it has certain drawbacks which tend to limit its effectiveness. In particular, the fact that each iteration requires the previous result to be shifted a variable number of times means that either a considerable amount of time is spent clocking a shift register [4] , or a large shifting network must be available [5] . Furthermore, storage must typically be provided for microcode and for pre-computed constants (one for each bit in the result).
In this paper we will develop a new method for calculating sine, cosine, and square root. When compared with a CORDIC implementation, it will be shown to be reasonably fast, very accurate, and to require substantially less hardware. Furthermore, it will require no pre-stored constants or microcode, and no internal operation more complex than binary subtraction. 
Manuscript received September

A NEW ALGORITHM
The arccosine algorithm can be reversed to produce a new algorithm for cosine using the bitwise eXClUSiVe-OR function -f. Step 1: Zn+l := 1.0
Step 2: for t := 11 downto 0 do
Step 3: 1' := Zo As an example, find cos(O.GS75~) with 11 = 4. Clearly the most difficult step in this process is taking the square root at each iteration. But if we consider a system which already requires hardware for computing square roots, then adding this extra trigonometric functionality requires very little additional circuitry (as shown in Fig. 1 ).
The usefulness of this design now hinges on finding a fast and simple root evaluator, and on resolving accuracy concerns.
IV. SQUARE ROOT
The algorithm proposed for evaluating the square root is a binary version of the well known but often overlooked longhand method. In decimal this is a rather tedious process which somewhat resembles restoring division, except that two digits are taken from the radicand for each single digit produced in the result. Fortunately, the binary version of the algorithm is very straightforward since each new bit in the result (0 or 1) is chosen by performing a simple comparison. The full algorithm is given below, and an example showing the longhand binary process for calculating d m ( &% = 13) appears in 
Algorithm [ROOT] Unputl [Output]
P : the radicand (0 5 P < 1)
Qn : the first T I bits of the value 2" fl in binary
Step 1: Ro := P (R, is the partial remainder)
Step 2: for i := 1 to 11 do begin
[Algorithm]
Qo := 0 A detailed description and proof of this method appear in [12] . It has also been implemented in assembly language [13] and as a parallel TTL gate array [14] . In addition, it is a member of the class of algorithms originally described by Morrison [6] and Wensley [7] , and it can be grouped with a class of "direct methods" which has been found to be better suited to realization in hardware than is the traditional Newton-Raphson method [ 151.
This process can be replicated in hardware using just three shift registers and a subtractor (Fig. 3) .
V. ACCURACY
The square root circuit presented here exhibits no inaccuracy, in that it generates the first n bits of the true result, where n is limited by register and subtractor widths. (In this design the subtractor must be n + 2 bits, the remainder II + 1 bits, and the result register n bits wide.)
Unfortunately, the cosine algorithm is not as well behaved. The problem occurs when the angle is such that there are many S, = +1 in a row. This causes Z, = S, J( 1 + 2,+1)/2 to rapidly approach 0.1111.. . = 1.0, after which point any further iterations are likely to produce meaningless values. In such cases, the internal precision may need to be as high as 2n bits for an n-bit angle and cosine. Therefore, for the circuit of Fig. 1 to behave consistently, the square root will have to accept and produce up to twice the number of bits originally anticipated. Fortunately, the design we have chosen admits a simple and effective solution to this problem.
As the square root diagram of Fig. 3 implies, two bits of the radicand are shifted out of the input register for every one bit shifted into the result register. Thus, the input word is exhausted halfway through the process, and the remaining 11 bits which fill out the radicand are zeros. We can take advantage of this fact by doubling the size of the input register so that an ,,-bit root is now produced from a 211-bit radicand with no speed penalty! Extending the precision of the output of the square root is even easier, though less apparent. In the critical cases where values approach 0.11 11 . . . the needed extra bits of precision are not actually lost. In fact, they are immediately available in the remainder register! As demonstrated in the Appendix, the closer 0 is to 1.0, the more accurate the remainder can be in completing a double length output word. (In other cases the remainder does not make an accurate extension of the result, though it is somewhat better than simply appending zero bits.) Fig. 4 shows how the cosine interfaces with the square root to take advantage of the increased accuracy.
VI. OPTIMIZATION
Several additional improvements may now be made. First, rather than using a parallel negator as in Figs. 1 and 4 , two exclusive-oR gates placed at the serial outputs of the square root input register perform an effective one's complement negation. Unfortunately, this leaves only the absolute value of the result available once the iterations are complete. But we can modify the algorithm to calculate
I' = c o s (~S , , / 2 )
where 0 5 S,, < 1 so that 0 5 Ir < 1 also.
Then range reduction and sign assignment are left to software or external circuitry rather than occupying bits now used for precision. This modification is effected by performing one additional cosine iteration, since cos( ~1 , > / 2 ) = J( 1 + cos( K -Y~~ ))/2. A change in the initialization procedure can reduce the total number of required steps. The four possible permutations of the first two iterations of the COSINE algorithm are as follows: series initialization value during the loading cycle, according to the lowest order bits of the angle input -Yrt.
It is also possible to calculate si11(7r17,/2). At the beginning of the last cosine iteration, the square root input registers contain c 0 5~( 7 r -Y~~/ 2 ) .
If this is inverted to make the one's complement 1 -c0s2(7r-Y,,/2), then the final result is si11(7rS,,/2) = ,/1 -C052(7r-Yn/2). Replacing the leading zero of the angle input with a SINE/(COSINE) signal has the desired effect, controlling the polarity of the final SIGN as it comes out of the exclusive-OR gate on the angle register. Finally, it is often desirable to compute both sine and cosine simultaneously, as when preparing for tan(n ) = sin(n )/ c o s ( n ). Simply adding two inverters in a feedback loop around the root input registers allows the cofunction to be generated in a single extra step. This is due to the fact that the final radicand c o s ' (~S ,~/ 2 ) is replaced with 1 -cos2(7rS,,/2) = 4ii2(K-Yrt/2) (or sin2(7rS,,/2) with 1 -sin'( 7r-Y,, /2) = cos2 ( 7r-YT, / 2 ) ) as these values shift out and around. Then executing a square root once more without reloading the radicand produces the cofunction very quickly.
These optimizations are brought together in Fig. 5 . Further implementation details can be found in [16] .
VII. BENCHMARKS
The square root, sine and cosine evaluator has been constructed using common 74F' series TTL (SSI and MSI) circuits. In order to demonstrate simplicity, only 14 and 16 pin DIP IC's were used. No ROM's were employed. For a word size of 16 bits the chip count is 39, which includes a simple microprocessor interface. The entire assembly fits neatly on a 3.5 in x 7.5 in board. Maximum clock speed is approximately 25 MHz, and since loading and iterating require 2+16=18 cycles for square root and 2+16*(1+16)=274 cycles for sine or cosine, computation times are 720 ns and 11 its, respectively.
Calculating the cofunction of an angle requires an additional 1+16=17 cycles=680 ns. The total error is at most 1.3 in 216, and the square root exhibits no error beyond word size restrictions.
The CORDIC implementation presented in [4] was built with similar technology and may be useful for comparison. Though originally designed for coordinate rotation, it can be modified to calculate sine and cosine by rotating the unit vector (1,O). Then the ROM's used to divide the CORDIC magnification factor K [17] out of the result become unnecessary if the vector inputs are fixed at (l/K,O).
Such a circuit would compute sine and cosine simultaneously using 12-bit words in 4.25 its, and would require in excess of 50 chips (the majority of which have 20 or 24 pins) on a 5 in x 10 in board.
(A measure of accuracy can be found in an older CORDIC design which places it at 2 in 215 [18] , [lo] .) Square root evaluation is not supported by this circuit.
If the precision of the new root-based design were also reduced to 12 bits, its computation times would fall to 6.3 i t s for either the sine or cosine (2+12*(1+12)=158 cycles) and an additional 520 ns for the cofunction (1+12=13 cycles). Note that the change in speed with word size is somewhat similar for both implementations ( O ( n 2 )).
(The CORDIC method requires an increasing number of shifts in each step.)
Based on these limited comparisons, the CORDIC circuit appears to be about 50% faster for sine and cosine, but does not perform square roots, and yet requires substantially more hardware, and is a bit less accurate. 
APPENDIX: SQUARE ROOT ACCURACY
precision to the result when the radicand is near 1.0:
Demonstrate that appending the square root remainder can add
Let P be the radicand with up to 2n bits of precision (0 5 P < 1).
The algorithm produces result Qn and remainder D of size n and n + l bits, respectively, where Q%+D = 2'"P SO that Q1, + 2 " D . 
