Abstract
Introduction
Logarithmic number system (LNS) representation has been the subject of considerable theoretical interest since its introduction [10] , and a number of implementations described, e.g. [7] . Arnold recently described arithmetic transformations for efficient software implementations, as well as pointing out the advantages of complex valued LNS (CLNS) [11] . CLNS is potentially attractive in areas such as FFTs, where the powers of unity have exact representation, and complex multiplications can be easily performed using fixed point additions. However, previous VLSI implementations of LNS rely on interpolation of a function of a single variable and do not extend to CLNS. Compared to floating point representation, where a complex multiply requires 4 FP multiplies and 2 FP adds, a CLNS multiply requires only 2 fixed point adds. Consequently, if the cost of a CLNS add can be reduced below 4 FP multiplies and 4 FP adds, the total cost of a CLNS multiply-add will be less than FP.
CORDIC algorithms have long been advocated for trigonometric functions as well as complex valued exponentials and logs [1] [2] [3] . Most efforts in CORDIC have focused on real numbers, and used low radix-2 or radix-4 algorithms. Recently, BKM, a low-radix redundant CORDIC algorithm was described and used for trigonometric functions and complex arithmetic using a linear representation [8] . BKM, as most other CORDIC algorithms is a low radix method, and takes many steps to perform an operation. The simplicity of the hardware implementation of CORDIC is attractive, and a number of successful hardware implementations of CORDIC have been also been described [4] [5] [6] [9] , however, these typically take a large number of stages. A few high radix methods have been described. Baker [12] described high radix CORDIC based algorithms, later extended to carry-save representation by Antelo et al [15] . Ahmed [13] [14] introduced a convergence method that generalized Chen's [3] , useful for describing algorithms as transformations on numbers that maintain some invariant. Ahmed described CORDIC algorithms using a single high-radix step to begin, and also using linear interpolation for the latter half of the algorithm.
This paper is most closely related to Antelo's et. al highradix CORDIC algorithm [15] . It applies to CLNS, and modifies this algorithm, as well as introducing some optimizations specific to CLNS that approximately halve the cost of the algorithm. Some specific points of comparison to [15] are: (1) this paper shows how optimizations specific to CLNS can eliminate approximately half the CORDIC stages (2) this paper advocates exact calculation of the minimal usable radix, instead of using a fixed radix (3) this paper extends the high-radix algorithms to include logarithm algorithms, similar to CORDIC vectoring, which requires more complex digit selection and a different sequence of operations.
The remainder of this paper describes the number representation assumed for CLNS, and the transformations that can be performed on these numbers. Hardware structures for high-radix operations are described for complex exponentiation and logarithm, together with bounds on the values at each stage. An example processor has been designed and verified down to the gate level, and its verification is described.
Number Representation
A complex valued number is represented in CLNS by its logarithm, , such that , where is the base of the system. Both and are fixed point numbers and can be represented CLNS addition is considerably more difficult. To compute the representation of , it is necessary to compute and .
(1)
The functions and are implicitly defined in terms of , as
We will assume that so that the argument to and lies in the right hand half plane. Subtraction is accomplished by adding to the appropriate operand.
Transformations on Complex Number Representations
In order to compute as defined in (5), CORDICbased algorithms can be applied to the computation of the complex exponential and logarithm functions. As in previous descriptions of convergence methods, we define a function that maintains a constant value through each stage in the transformation. Our 4-tuple contains a Cartesian representation of a point using the pair of real values and , and a polar logarithmic representation using the two real values and . The value represented is . Two transformations, scaling and rotation, are define such that the value of is kept constant. The scale transformation performs a linear scaling of the Cartesian value by a factor of , and compensating reduction in the logarithmic value of :
(8)
The rotation transformation performs a rotation of the Cartesian values based on some value and compensating change in the angle of the polar logarithmic representation:
The angle of the rotation is given by . The rotation lengthens the vector by a factor of , and the corresponding change in the polar logarithm magnitude is given by . Both the scale transformation and the rotation transformation preserve the invariant .
Complex exponentiation and complex logarithm are implemented using a sequence of rotation and scaling transformations. In each, a series of stages are cascaded, each of which may be a rotation or scaling transformation according to the design of the algorithm. The inputs are , , , and , and the outputs are , , , and
. In each algorithm, some of the inputs and outputs are constrained to be constants. Thus, the difference in the operation of the algorithms is the way that the values and are determined as a function of the inputs.
Complex Exponentiation Algorithm
In complex exponentiation, the values of and are set to constants, such as 1 and 0 respectively, while and are bounded by some intervals. A series of transformations is performed such that and are constants regardless of the inputs and . From the invariance of 
CORDIC for CLNS Addition
The CLNS addition function can be constructed using a CORDIC exponentiation, adding one, followed by a CORDIC log, as illustrated in the left side of Fig. 1 
For the case that , the multiply disappears, and circuit uses , ,
, and as inputs to the logarithm stages.
This provides a strong incentive to using a base of . Similar improvements apply to the logarithm stage. Consider some stage , and the Taylor series approximation for and
Approximately half the stages can be eliminated for both exponential and logarithm, as shown in the right side of Fig. 1 . Consequently, a CLNS addition can be performed with cost comparable to approximately one CORDIC operation.
Hardware Implementation
To understand the calculation of the values of and , as well as the bounds on the values and precision in each stage of the computation it is useful at this point to introduce the redundant computation in terms of the hardware implementation. The algorithms allow redundant representation using of all the quantities involved in the computation; however, the multiplications of and eliminate any advantage to using a redundant representation of these quantities. Instead, only and are represented using a carry-save form, and the non-redundant value is calculated to the accuracy required. Truncation is explicitly represented in the hardware designs shown below with a truncation operator , and a separate symbol for the truncated value. The precision of any variable in the algorithm, for example some variable , will be expressed as . The precision of a variable is the negative of the position in the binary representation of the least significant bit, i.e., if .
, t h e n .
There are four related hardware blocks, corresponding to the scaling and rotation stages for both the exponentiation and logarithm algorithms. For both algorithms, the values , , , and are assumed to have bits precision. Internally, each value is represented with an additional guard bits, for a total of bits. In carry-save form, and are represented by the pairs and , and by and respectively, all of which have precision . In the exponential scaling stage, shown in Figure 2 , a reduced precision approximation of is calculated by truncating and , which are added to calculate the non-redundant, but lower accuracy approximation , which is input to the digit calculation block. The purpose of this lower precision approximation is to reduce the amount of hardware required for the adder, but more importantly, to reduce the number of bits that the digit selection block must examine, and consequently reduce its associated hardware and increase its speed. All stages described here contain digit selection blocks. Before describing the specific functions implemented in them, a brief description of their logic structure is useful. A digit selection block implements some monotonic function of a single input, and is a piecewise constant approximation to some continuous function. Using the specific example of a digit selection block with as an input and as output, with for some , the function can be expressed in the form (27), where the digit is and is the piecewise constant approximation to the function over some interval . ,
The value of may take any value in the range to , where the bounds are chosen to include the entire range of inputs to the function. The number of distinct values that can be produced is and is referred to as the radix of the value . Expressing the function at this level of detail explicitly provides the range over which each input value produces some output value, and makes it straightforward to bound the result of a calculation. The 
hardware implementation of this will be discussed later, but it is clear that the two primary factors involved are the number of bits in the input value which need to be examined, and the number of distinct output values that can be produced, or equivalently, the radix.
The digit selection block in the scaling exponentiation stage produces and . Assuming that the goal is , we desire that to the precision possible given the fixed number of bits in the representation of . Note that is expressed to the full precision of the datapath, while it is the limited number of bits in that restrict the set of possible values of . This is due to the desire to reduce the size of the digit estimation logic for .
The values of and are also truncated to a lower precision forming and , and a pair of multiplyadders is used to calculate the results specified in Eqns. (6) and (7). The multiplier output is left in carry-save form, which avoids a carry-propagate addition inside it. This is advantageous as the result of the multiplier is later added using a CSA and CPA. The multiplier result is also truncated to the precision of and . Eqn (8) is calculated using a CSA because of the redundant representation of . 
Bounds on Intermediate Results
Antelo et al advocate high-radix CORDIC using a maximally redundant digit set. They suggest choosing a radix, and show that the algorithm will converge using the maximally redundant digit set. In this paper, we suggest explicit calculation of the minimal possible redundant digit set, and exact digit selection based on the truncated data for two reasons First, explicitly calculating the radix means that the The approach in this section is constructive. We define the bounds of the operands at each stage, and determine the relationship between the bound on the input and output for each type of transformation. For each possible digit , the range of inputs that use the digit are specified, and the resulting bound on the output is determined. Given the overall bound on the range of inputs, and the precision of the digits and , it is possible to determine the set of values required to span the entire input range. and consequently determine the radix of the digit set. By computing the union of all output bounds for every possible digit in a stage, an overall bound on the output of a stage can be calculated. It is also necessary to specify the precision of the truncated quantities input to the digit selection logic. It will be useful to have a concise notation for the truncation or rounding of quantities to various fixed point precisions. We use to mean the value of truncated down to bits precision, so the definition is given as:
Similar notation for rounding up, and round to nearest are also used:
It is also necessary to bound the variables at each stage. We introduce and subscripts such that any variable is bounded by its and as in .
Scaling Stage for Exponential
As mentioned, it is desirable that , or equivalently, . Ideally, the digit selection function would implement
Transforming this into the piecewise-constant expression leads to the exact computation of the digit : 
Note that in these equations the rounding of to precision is not specified; however, whatever method is used must be applied consistently across all of these equations to obtain correct results. The definition of is then substituted into (8) and (10) to produce bounds for each given value of :
It is possible to attempt to construct an explicit bound for all possible by substituting in the appropriate and , but it is difficult to guarantee an exact bound in the presence of multiple roundings to various precisions. Instead, we simply iterate across all values, taking the minimum and maximum of these bounds to determine overall bounds on
It is useful to ignore the redundancy and take a first order Taylor series approximation to (37) and (38) into obtain insight into the operation of the algorithm, although the exact form of equation must be used for computing bounds. 
Rotation Stage for Exponentiation
A similar approach can be taken for the rotation stage, with the notable difference that the rotation is computed as a function of , but affects both and . In the rotation stage, we desire .
The piecewise constant approximation is given as
Exact bounds on can be determined by the previous approach and will not be presented.
Bounds on can be found by a similar approach to the scaling stage. This leads to the following bounds on , where the union of all such bounds must be taken for the entire range of
For small , ignoring redundancy and taking a firstorder Taylor series approximation shows that is bounded by approximately half an ULP of .
The choice of is independent of , so the bound on can only be given by subtracting the minimum and 
Because may be signed, it is not possible to express these in a form only using one of or in each equation.
Scaling Stage for Logarithm
The logarithm stages are more complicated to understand as both scaling and rotation affect both and , but each stage must perform a transformation based on only one value. Recall that we desire . In the piecewise constant form, this can be achieved by ,
This establishes following bound on . Although will give the largest range, the calculation should be performed to determine the union of all ranges for all values of using exactly the same rounding as the hardware in order to obtain precise bounds.
(47)
The value of is scaled by the same factor, so can be bounded by the union of the intervals for all values of (49)
Rotation Stage for Logarithm
The rotation stage for logarithm presents the most difficulty, as the goal depends on two values. We simplify this to a function of a single value by using the bound on and taking the midpoint of this range as an approxi- 
tinct values of , and that apply to a rectangular region in and . The figure shows the rotation and stretching of two of these regions, illustrating the center point of each, together with the bounding box for the rotated and stretched rectangle. Because each of the intervals is rotated, the tightest possible bounds of the resulting values and does not form a rectangle; however, for simplicity of analysis, it is considered to be the smallest rectangle that encloses all of the rotated and stretched intervals, and is illustrated with the dotted rectangle in the Fig. 7 . The figure is not to scale, and typically the rectangle bounding the would be much smaller than the input bounds.
The exact definition of is (50) This leads to the following bounds on and ,
where, as usual, all values of must be considered:
Example for 32-bit Complex Numbers
The exponential stages calculate both digits based on a single operand, but the value of affects the value of as well. It is optional whether to interleave rotation and scaling stages, or to perform all of the scaling at the end, as in [9] . In the logarithm stage, both and depend on and . It is necessary to alternate scaling and rotation stages in order to tighten bounds on both simultaneously. As a demonstration of the feasibility of this approach, we have designed a high-radix multiply-add CLNS arithmetic unit with precision comparable to IEEE-754 single precision. The unit performs complex multiplication using two fixed point adders, and uses high-radix exponentiation and logarithm to perform complex addition using the function defined in (4). The design uses and , and some other minor changes to the constants assumed in the derivation above. The number representation uses a mixed base for the representation of the numbers to simplify range reduction. A number X is represented by its complex logarithm , where is a 32-bit 2's complement fixed point number, and is a 27-bit unsigned fixed point number, both of which have 24 fractional bits. The details of the algorithm were designed with the assistance of a program that has as input an architecture description file containing all of the precisions of each variable, and performs exact bit-level modeling of the architecture.
Our design uses a total of 10 stages to perform an exponentiation and a logarithm as required by the logarithmic addition function. Datapath widths were based on 6 bit digits requiring two stages of each of rotation and scaling for exponentiation, and three of each for logarithm. Antelo et al [9] would require 8 stages using 7-bit multipliers to perform a rotation to the same precision; thus, our architecture requires little more hardware to perform a CLNS addition. Beyond this, only two more fixed point adders are required to perform a CLNS multiply-add. Table 1 shows the key parameters of the design. The digit precision refers to the digit generated, either or , and the digit selection precision refers to the input to the digit selection block, such as or other values. Fig. 8 illustrates the simulator's real and imaginary error histogram in ULPs for pseudo-random tests. A bias is clear, and is due to the use of truncation rather than rounding in the datapath. Mean error and bias are each less than 0.4 ULP, but worst case error is 1.5 ULP.
Conclusions
This paper has demonstrated high-radix CORDIC algorithms adapted for CLNS addition. A design example producing six bits per stage as an illustration shows that a CLNS addition can be performed for approximately the same cost as a conventional high-radix CORDIC rotation. Since a CLNS multiply is inexpensive, this allows a CLNS multiply-accumulate to be performed for the cost of a single CORDIC operation. 
