Abstract-A monolithic processor computes products, quotients, and several common transcendental functions. The algorithms are based on the well-known principles of "CORDIC," but recourse to a subtle novel corollary results in a scale factor of unity. Compared to older machines, the overhead burden is significantly reduced. Also, expansion of the functional repertoire beyond the circular domain, i.e., addition to the menu of hyperbolic and linear operations, is a relatively trivial matter, in terms of both hardware cost and execution time. A bulk CMOS technology with conservative layout rules is used for the sake of high reliability, low-power consumption, and good cycle [2] , under the label of "pseudodivision and pseudomultiplication." In 1971, Walther [3] generalized elegantly the mathematics of CORDIC's, showing that the implementation of a wide range of transcendental functions can be fully represented by a single set of iterative equations.. Cochran [4] benchmarked, about the same time, various algorithms and found that CORDIC techniques surpass alternative methods in scientific calculator applications.
INTRODUCTION
THOUGH the concept of CORDIC Arithmetic is said to be Tquite old [1] , [41, its implementations and applications continue to evolve. The acronym comes from Volder's Coordinate Rotations Digital Computer [1] , developed in 1959 for air navigation and control instrumentation. An avuncular idea, particularly effective in decimal radix computations, was presented by Meggit in 1962 [2] , under the label of "pseudodivision and pseudomultiplication." In 1971, Walther [3] generalized elegantly the mathematics of CORDIC's, showing that the implementation of a wide range of transcendental functions can be fully represented by a single set of iterative equations.. Cochran [4] benchmarked, about the same time, various algorithms and found that CORDIC techniques surpass alternative methods in scientific calculator applications.
The pertinent effort of the Naval Ocean Systems Center (NOSC) culminates in the CORDIC Arithmetic Processor Chip (CAP Chip) of Fig. 1 that simplifies the architecture, boosts the speed, and reduces the power consumption of monolithic arithmetic modules. All computations are based on the execution of either xi+, =xi +yi2-' (la) or X(i+),2 =(1 + 72 x(i+ ), .
(lb) While the first of these equations represents the regular CORDIC iterations, the second [5] , [61 forces the scale factors of circular and hyperbolic functions to unity. ROM instructions govern the selection of either (la) or (lb), but the ± option is executed by the sign bit of one of the operands.
This paper begins with multiplication and division, because the pertinent algorithm compares well, in its own right, with alternative techniques, especially in digital filter applications [71, [8] , [17] . Moreover, the said algorithm is simple and transparent enough to project the feedback principle as the fundamental and common link of the CORDIC [1] , [3] 
Perform the following iterations:
Yj+j1 =yi+xo i2-'
=Yo +Xo (5j2-') for i= 1 throughn (3b) 
A partial flow diagram of the above operation is given in Fig. 2(a) , and a few steps of the zi iteration are developed in Fig. 2(b) . Note that Zi+i1 < 2-t, (5) (3c) although the magnitude of zi+ 1 Equations (6) and (7) demonstrate that the digital feedback algorithm, defined by (2)-(4), leads to practical implementations of functions germane to multiplication and division [3] .
Compared to alternative techniques [9] , digital feedback looks good in division and, as we shall soon see, it becomes even more attractive when circular and hyperbolic functions are considered. The 24-bit data are processed in two 12-bit steps. The lower byte of the word held by the accumulator is released into the adder-subtractor by the local clock, an intermediate step of addition or subtraction is performed, and the result is returned to the accumulator. The upper byte is subjected to similar treatment, beginning with the release of data that now includes the carry generated by the lower byte, and terminating with the acceptance of the result by the accumulator.
The scaler takes up a large part of the chip's surface and a sizable fraction of the cycle time. This is both understandable and acceptable considering its function, namely, the two's complement multiplication of every 8i and every xo8i by 2-i.
The scaler is indeed the centerpiece of CORDIC hardware; the present implementation is distinctly faster than its shiftregister counterparts. The circuitry is really quite simple, owing to the highly efficacious transmission gates of the CMOS technology [12] . A matrix of such gates, arranged as shown in Fig. 3 , propagates the sign bit while it shifts the data by "i" bits. The signal flow matrix of the scaler is square (Fig. 4) with 24 columns for bit locations and as many rows for cycle numbers. However, since there is some redundance in Fig. 4 , the physical matrix need only be half as large as is its model, and that is why we have in Fig. 3 quadrant, while outputs B select one out of the 12 pertinent columns. Only regular CORDIC cycles are counted by the sequencer. Clock signals which pace the "double cycle" and the "scale factor" operations are inhibited by status bits outputted by the instruction ROM (Fig. 10) .
The adder has a configuration which resembles conventional look ahead logic, but its circuitry is unique. Selected fragments of our "dynamic CMOS" circuits are shown in Fig. 6 Fig. 6(b) . 10 000 p2 of surface area are consumed if metalgate bulk-CMOS with 8 ,um spacing is employed. 
The flow diagram of the generalized instruction cycle is given in Fig. 7 . Equations (10) and (11) represent the "double cycle" and the "scaling factor K" operations, respectively. The raison d'etre of these operations is explained below.
Let us focus our attention on the variable "i" in (9 Prominent among the functions used in servo control is the (1 6a) resolver operation defined by (17) [10] . The search for "solid-state" resolver hardware is lively and likely to continue for some time to come. Mathematically, however, one deals with the old and commonplace rotation of axes depicted in Fig. 8 . When a pair of rectangular axes is rotated anticlock-(16b) wise by an angle 0, then the coordinates of a point P transform (9) and (20) . These discrepancies will now be eliminated as much as possible, for the sake of hardware simplicity.
First off, one can factorize cos i80i in (20a) and (20b): xi+1 = cos SiOi(xi +yi tan SiOi) (21a) = cos 0i(xi + 6iyi tan 01) (21b) and yi+ = cos Oi(yi-8ixi tan Oi).
Next, one can make the arbitrary, but highly convenient, substitution Oi= arctan (2-1)
in order to arrive at xi+ ,1 = cos Oi(xi + 8i 2-'yi)
Yi+l =cos Oi(yi -S62-'xi)
Si arctan (2-'). 
(24f) Since it depends on "n" only, the scale factor K,, is a machine constant. Consequently, given x' andy', one can realize x* and y* by many simple methods, including ROM look-up tables and regular combinatorial logic, but the scaling factor K technique, spelled out in (24) and Fig. 7 , is particularly attractive because it offers a host of advantages such as speed, real estate economy, and conceptual simplicity.
INITIALIZATION
While the inverse tangent of 2-1 is only 260, the processor must accommodate angles as large as ±1800. This does not present any great difficulty, but for the sake of compatibility with other functions, it is convenient to implement the range extension in two special "initialization" cycles (Fig. 9) . The first shifts 0 by 900: 10 and 11 show a complete set-up for the execution of rotation and vectoring operations in the linear, circular, and hyperbolic modes. There are six modules, namely, three CAP chips, a 16 X 512 ROM, an ROM ADDRESS counter, a clock, and an I/O box. The I/O box loads the data into the processor and returns the results to the bus; it also sets the two most significant bits of the ROM address. These two bits (A9 and A8) select one of the three sectors of the ROM beginning at address zero for linear operations, address 128 for circular functions, and 256 for hyperbolics. The counter, which generates the other 7 address bits, is first reset and then allowed to advance one bit per block cycle. The operation of the CAP chips is under the control of the instruction section of the ROM: Status bits from the ROM cause the execution of either a regular CORDIC cycle, or a "double" cycle, or a scaling factor K operation; they also signal arrival at the "last" cycle, that is, completion of the computation. The sign bit of either y or z controls the add/subtract options of all three chips.
The external instruction which activates the processor must include three function selection bits: one for either rotation or vectoring and two for either linear or trigonometric or hyperbolic functions. These three bits take care of the two decisions which start-off the signal flow diagram of Fig. 11 . Once the function to be executed has been identified, the operation of the calculator is paced along by the local clock. Linear processing is purely "CORDIC," but the trigonometric routines add scaling factor K cycles to the menu, while hyperbolic algorithms use both the double cycle and scaling factor K supplements. The execution time of circular functions is slightly longer than that of linear functions, and the execution time of hyperbolic functions is longer still, but the implementation of all three classes of functions is equally simple. Simplicity, of both architecture and circuitry, may indeed be the most striking and important feature of the CAP chip implementation of the CORDIC concept. Where reliability is at a premium, nothing scores higher than well-founded simplicity. CONCLUSION Whereas the performance of a chip depends on its architecture, circuit design, and processing, one may want to separate processing from the other two factors when attempting to assess the quality of a device. It is understood that performance is always technology limited. A faster technology will invariably bring about higher speed and, possibly, reduce power dissipation at the same time. For a given circuit schematic, conversion from conservatively laid out metal-gate CMOS to tightly spaced poly-gate SOS will produce spectacular improvement. For that reason, speed alone is hardly a satisfactory measure of circuit design quality; to compare different embodiments of an idea, one must speak of minimum cycle times, expressed in multiples (n) of "typical" gate delays. The propagation delay of a gate sums up the quality of the technology, while "n" gives an estimate of the combined quality of the architecture and the circuit design. Naturally, one needs a definition of the "typical" gate. Physical dimensions present no difficulty-one simply picks a "minimum size" device-but the typical configuration may be open to dispute. We use an inverter with a fan out of three. SPICE analysis [15] of the CAP chip suggests n = 13 as the minimum cycle time of a two-byte (24-bit) operation. The "gate" delay is roughly 100 ns. More than half of the overall delay is attributable to the 2-i scaler. This is understandable, considering the size of the structure in Fig. 3 . Next in order of nuisance ratings comes the carry circuit C12, whose layout is shown in Fig. 6(b) . Taken together, the scaler and the carry determine, just about, the effective "n" of the system. Further improvements in "n" will have to come either from modifications of these elements, or from conversion to single-byte operation. The former approach must await inventive contributions, but the latter is feasible right now. The entire system can be implemented in single-byte format by recourse to three micron layout rules; an "n" of 7.5 can thus be realized without changes in circuitry. Furthermore, even an early vintage edition of submicron CLOSED COSMOS [14] will accommodate a complete single-byte system on just one chip. What we have then, in addition to a system which executes transcendental functions in 40 Ms, is a good candidate for submicron phototyping.
