INTRODUCTION

T
HOUGH the concept of CORDIC Arithmetic is said to be quite old [1] , [4] , its implementations and applications continue to evolve. The acronym comes from Voider's Coordinate Rotations Digital Computer [1] , developed in 1959 for air navigation and control instrumentation.
An avuncular idea, particularly effective in decimal radix computations, was presented by Meggit in 1962 [2] , under the label of "pseudodivision and pseudomultiplication." In 1971, Walther [3] generalized elegantly the mathematics of CORDIC'S, showing that the implementation of a wide range of transcendental functions can be fully represented by a single set of iterative equations.
Cochran [4] benchmarked, about the same time, various algorithms and found that CORDIC techniques surpass alternative methods in scientific calculator applications.
The pertinent effort of the Naval Ocean Systems Center (NOSC) culminates in the CORDIC Arithmetic Processor Chip (CAP Chip) of Fig. 1 that simplifies the architecture, boosts the speed, and reduces the power consumption of monolithic arithmetic modules. All computations are based on the execution of either 
While the first of these equations represents the regular CORDIC iterations, the second [5] , [6] forces the scale factors of circular and hyperbolic functions to unity. ROM instructions govern the selection of either (1a) or (1 b), but the * option is executed by the sign bit of one of the operands. This paper begins with multiplication and division, because
Manuscript received January 4, 1979; revised June 26, 1979 . The authors are with the Microelectronic Circuit Design Branch, NavalOcean Systems Center, San Diego, CA 92152. the pertinent algorithm compares well, in its own right, with alternative techniques, especially in digital filter applications [7] , [8] , [17] . Moreover, the said algorithm is simple and transparent enough to project the feedback principle as the fundamental and common link of the CORDIC Also, once established, it can be easily expanded to trigonometric and hyperbolic functions.
A DIGITAL FEEDBACK LooP
Take three numbers, XO, y., and Z., Z. being restricted to the range O<z" <l.
Perform the following iterations:
and
To the~i operator assign the values of either plus one or minus one, depending on the polarity of zi. In other words, let
A partial flow diagram of the above operation is given in Fig. 2(a) , and a few steps of the zi iteration are developed in Fig. 2(b) . Note that lZi+~I < 2-i,
although the magnitude of zi.~is not necessarily smaller than that of zi. The absolute value of z is gradually reduced towards zero, but the reduction may proceed zig-zag fashion. What we have here is an arrangement which amounts to an autonomous feedback loop of attractive simplicity. The zeroseeking mechanism of the loop is controlled entirely by the sign bit of z; the sign bit determines tii and that operator implements, in turn, the crucial add-subtract option of (3) and (4). Equation (5) implies that a sufficiently high i, say i = n, will justify the approximation Zn+l=z'=o (6a) and will, therefore, lead to the expression (6b) j=l and hence, by substitution into (3 b), to the procluct equation
It goes without saying that we could have reduced to zero the operand y rather than z. Exercising that option, one arrives at
and, therefore, at the quotient equation
Equations (6) and (7) demonstrate that the di@al feedback algorithm, defined by (2)-(4), leads to practical implementations of functions germane to multiplication and division [3] . Compared to alternative techniques [9] , digital feedback looks good in division and, as we shall soon see, it becomes even more attractive when circular and hyperbolic functions are considered.
THE CHIP
Block diagram particulars and layout details of the CAP chip are shown in Fig. 1 (a) and (b), respectively.
There are but three major circuit blocks: the all important 2-i scaler [11 ] , a 12-bit two's complement adder, and a 24-bit accumulator of the shift-register variety. The narrow block at the top of the chip is the "i" counter, called the "sequencer."
The 1/0 buffers are distributed around the periphery of the chip, but all multiplexer are merged with the appertaining functional blocks.
The 24-bit data are processed in two 12-bit steps. The lower byte of the word held by the accumulator is released into the adder-subtracter by the local clock, an intermediate step of addition or subtraction is performed, and the result is returned to the accumulator.
The upper byte is subjected to similar treatment, beginning with the release of data that now includes the carry generated by the lower byte, and terminating with the acceptance of the result by the accumulator.
The scaler takes up a large part of the chip's surface and a sizable fraction of the cycle time. This is both understandable and acceptable considering its function, namely, the two's complement multiplication of every 8 i and every X. a i by 2-i. The scaler is indeed the centerpiece of CORDIC hardware; the present implementation is distinctly faster than its shiftregister counterparts.
The circuitry is really quite simple, owing to the highly efficacious transmission gates of the CMOS technology [12] . A matrix of such gates, arranged as shown in Fig. 3 , propagates the sign bit while it shifts the data by "i" bits. The signal flow matrix of the scaler is square (Fig. 4) with 24 columns for bit locations and as many rows for cycle numbers. However, since there is some redundance in Fig. 4 , the physical matrix need only be half as large as is its model, and that is why we have in Fig. 3 a matrix of 12 rows for 12 exponents and 24 columns for as many bits.
The transmission gates are driven by a sequencer with outputs .4, B, and C (Fig. 5) . Output A enables either the upper or the lower byte and output C picks the first or the second PROCESSOR CHIP 7 quadrant, while outputs B select one out of the 12 pertinent columns. Only regular COF@IC cycles are counted by the sequencer. Clock signals which pace the "double cycle" and the "scale factor" operations are inhibited by status bits outputted by the instruction ROM (Fig. 10) .
The adder has a configuration which resembles conventional look ahead logic, but its circuitry is unique. Selected fragments of our, "dynamic CMOS" circuits are shown in Fig. 6 area are most welcome in the CARRY module which has a total of 98 ports in the Clz gate. The precharge clocks @l and $2 are, of course, synchronized with the clock which controls the timing of the lower and upper byte add-subtract operations.
The adder gate logic utilizes a combination of a XOR-AND-OR element and a HALF-ADDER. Various gate configurations, including the conventional CMOS NOR and the Floating XOR [13] , are employed, but the whole thing adds up to only 26 transistors. The chip layout for this section of logic is shown in Fig, 6(b) . 10000 W2of surface area are consumed if metalgate bulk-CMOS with 8pm spacing is employed. 
GENERAL CORDIC EQUATIONS
Written in conventional format, the CORDIC equations look as follows:
The CAP chip executes the above and two supplementary sets of equations:
and (ha)
The flow diagram of the generalized instruction cycle is given in Fig. 7 . Equations (10) and (11) represent the "double cycle" and the "scaling factor K" operations, respectively. The raison d'etre of these operations is explained below. Let us focus our attention on the variable "i" in (9). We have already come across the relationship 6i= 2-i (12a)
and will yet tackle the functions
and Oi = arctanh (2-i).
Implied in (6), as well as in the concept of feedback itself, is the convergence relationship: 
This inequality is fulfilled by (12a) and (12b) 
This is why we need the "double pass" operation when hyperbolic functions are being processed. Tables I-III illustrate the  point at issue for the specific case of n = 24, showing that 4,7, 12,13, 18,19, and21 .
So much for the double pass capability. Simply put, some CORDIC operations are run twice, in order to comply with inequality (13).
The supplementary operations called out in (11) force the scale factor K to converge toward unity. While the regular iterations cross-link x and y, the scale factor K is adjusted by separate, though identical, manipulations of x and y. For example, Betting the gammaB in (1 1) to plus 1 for i = 2, 4, one multiplies both output variables by 1.32812:
and y*=(l +2-2)(1 +2-4)y = 1.32812y.
The role of the scale factors in the realization of circular and hyperbolic functions will be discussed in a later section.
CIRCULAR FUNCTIONS
Prominent among the functions used in servo control is the resolver operation defined by (17) [10] .
The search for "solid-state" resolver hardware is lively and likely to continue for some time to come. Mathematically, however, one deals with the old and commonplace rotation of axes depicted in Fig. 8 . When a pair of rectangular axes is rotated anticlockwise by an angle 6, then the coordinates of a point P transform 
Interpretation of this result in terms of multiple fragmentations leads to a set of recursive formulas, which read as follows:
and Zi+~= Zi -8i0i.
(20C)
There are no restrictions on the various 0's, other than those considered in (10)- (14), in connection with the double cycle operation. For that matter, (20c) is exactly the same as (9c), though there are significant discrepancies between the other members of sets (9) and (20). These discrepancies will now be eliminated as much as possible, for the sake of hardware simplicity. First off, one can factorize cos~i Oi in (20a) and (20b):
and Yi+l = cos ei(~l-8iXi tan Oi).
Next, one can make the arbitrary, but highly convenient, substitution
in order to arrive at
Zl+~'Z1-61 arctan (2-i).
Finally, one can compare the end results (x*, y*, z*) of iterations (23) with the end results (x', y', z') of iterations (9) and conclude that [ x*= x' fi cos (arctan 2-i)
That spells out the overall dependence between the two sets of numbers as
and y* =Kn y' (24e) but z*=zf (24f)
Since it depends on "n" only, the scale factor Kn is a machine constant. Consequently, given x' and y', one can realize x* and y* by many simple methods, including ROM look-up tables and regular combinatorial logic, but the scaling factor K technique, spelled out in (24) and Fig. 7 , is particularly attractive because it offers a host of advantages such as speed, real estate economy, and conceptual simplicity.
INITIALIZATION
While the inverse tangent of 2 '1 is only 26°, the processor must accommodate angles as large as t 180°. This does not present any great difficulty, but for the sake of compatibility with other functions, it is convenient to implement the range extension in two special "initialization" cycles (Fig. 9) . The first shifts 19by 90°: 'm-
and z = z -8(450).
The 1/~multiplier in (26) had been actually anticipated and incorporated into the geometric Kn, when (24c) was written as
Equation ( 
The coefficient matrix is orthogonal, and so are the three germane matrices shown below: 
Any one of these matrices can be used in expressions equivalent to (29) but, naturally enough, a unique geometrical interpretation must be associated with any particular matrix. For example, taking A~and relating it to the imaginary angle
one gets the hyperbolic relationship:
x* cos jfj -j sin jqi Xo . Y* -j sin jfp cos j~yo cosh I#J sinh @ Xo .
sinh @ cosh rp yo
The last of these equations leads directly to the iterative formulas
and to the scale factor
The hyperbolic routine does not require initialization, but it does call for the "double cycle" operations of ( 10). Tables  II and III, drawn 
as well as perspicuous mutations of these functions.
OVERVIEW
Figs. 10 and 11 show a complete set-up for the execution of rotation and vectoring operations in the linear, circular, and hyperbolic modes. There are six modules, namelly, three CAP chips, a 16X 512 ROM, an ROM ADDRESS counter, a clock, and an 1/0 box. The 1/0 box loads the data into the processor and returns the results to the bus; it also sets the two most significant bits of the ROM address. These two bits (A9 and A8) select one of the three sectors of the ROM beginning at address zero for linear operations, address 128 for circular functions, and 256 for hyperbolic.
The counter, which generates the other 7 address bits, is first reset and then allowed to advance one bit per block cycle. The operation of the CAP chips is under the control of the instruction section of the ROM: Status bits from the ROM cause the execution of either a regular CORDIC cycle, or a "double" cycle, or a scaling factor K operation; they also signal arrival at the "last" cycle, that is, completion of the computation.
The sign bit of either y or z controls' the add/subtract options of all three chips.
The externrd instruction which activates the processor must include three function selection bits: one for either rotation or vectoring and two for either linear or trigonometric clr hyperbolic functions.
These three bits take care of the two decisions which start-off the signrd flow diagram of Fig. 11 . Once the function to be executed has been identified, the operation of the calculator is paced along by the local clock. Linear processing is purely "CORDIC," but the trigonometric routines add scaling factor K cycles to the menu, while hyperbolic algorithms use both the double cycle and scrding factor K supplements. The execution time of circular functions is slightly longer than that of linear functions, and the execution time of hyperbolic functions is longer still, but the implementation of all three classes of functions is equally simple. Simplicity, of both architecture and circuitry, may indeed be the most striking and important feature of the CAP chip implementation of the CORDIC concept.
Where reliability is at a premium, nothing scores higher than well-founded simplicity.
CONCLUSION
Whereas the performance of a chip depends on its architecture, circuit design, and processing, one may want to separate processing from the other two factors when attempting to assess the quality of a device. It is understood that performance is always technology limited. A faster technology will invariably bring about higher speed and, possibly, reduce power dissipation aj the same time. For a given circuit schematic, conversion from conservatively laid out metal-gate CMOS to tightly spaced poly-gate SOS will produce spectacular improvement.
For that reason, speed alone is hardly a satisfactory measure of circuit design quality; to compare different embodiments of an idea, one must speak of minimum cycle' times, expressed in multiples (n) of "typical" gate delays. The propagation delay of a gate sums up the quality of the technology, while "n" gives an estimate of the combined quality of the architecture and the circuit design. Naturally, one needs a definition of the "typical" gate. Physical dnensions present no difficulty-one simply picks a "minimum size" device-but the typical configuration may be open to dispute. We use an inverter with a fan out of three.
SPICE analysis [15] of the CAP chip suggests n = 13 as the minimum cycle time of a two-byte (24-bit) operation. The "gate" delay is roughly 100 ns. More than half of the overall delay is attributable to the 2-i scaler. This is understandable, considering the size of the structure in Fig. 3 . Next in order of nuisance ratings comes the carry circuit C 12, whose layout is shown in Fig. 6(b) . Taken together, the scaler and the carry determine, just about, the effective "n" of the system. Further improvements in "n" will have to come either from modifications of these elements, or from conversion to single-byte operation. The former approach must await inventive contributions, but the latter is feasible right now. The entire system can be implemented in single-byte format by recourse to three micron layout rules; an "n',' of 7.5 can thus be realized without changes in circuitry. Furthermore, even an early vintage edition of submicron CLOSED COSMOS [14] will accommodate a complete single-byte system on just one chip. What we have then, in addition to a system which executes transcendental functions in 40 LU.,is a good candidate for submicron phototyping.
