Abstract-A novel on-line Mixed-Scaling-Rotation CORDIC (MSR-CORDIC) VLSI architecture is proposed. This architecture not only maintains the scaling-free property of the original MSR-CORDIC, but also achieves the target of on-line angle computation. Compared with other existing CORDIC solutions, the proposed architecture is faster and more cost-efficient, especially for QRD-RLS filtering systems. Moreover, this on-line MSR-CORDIC can also be adopted by other rotation-based DSP applications.
I. Introduction
The COordinate Rotational DIgital Computer (CORDIC) algorithm [1] is a well-known hardwareefficient iterative algorithm for rotation-based arithmetic functions, such as Fast Fourier transformation (FFT), QRD-RLS filtering, Eigen Value Decomposition (EVD), and Singular Value Decomposition (SVD) [2] . Since CORDIC is computed by only a sequence of shift-andadd operations, it is much preferable to multiplier-based rotators. In the literature, many approaches were proposed to further improve the speed and reduce the hardware cost of the conventional CORDIC [3] [4] [5] . Among these techniques, the MSR-CORDIC [5] has been proved to be the most efficient vector rotational technique due to its low latency. It can successfully eliminate extra scaling iterations and merge them into the rotation iterations. It also has good numerical accuracy with the least iteration number for known rotational angles. Therefore, the MSR-CORDIC can be applied to the applications in which the angles to be rotated are known in advance, e.g. the twiddle factors of the FFT, and able to serve well in speed and accuracy with lower hardware complexity.
However, the MSR-CORDIC sacrifices a very important essence of the conventional CORDIC, angle computation (vectoring mode). The reason is that the originally proposed MSR-CORDIC in [5] searches for all possible elementary angles to either shorten the iteration sequences to pursue computational speed or to annihilate the scaling stages. As a result, the work of [5] cannot be adopted in angle-computation-required applications, such as QRD-RLS filtering [2] .
In this paper, we present a novel on-line MSR-CORDIC, which is capable of conducting vectoring mode. Combined with the ordinary MSR-CORDIC, which can only execute rotation mode, the proposed online MSR-CORDIC can be applied to a QRD-RLS filter. Under the same convergence performance, the proposed architecture can save more than 3000 of hardware cost compared to the conventional CORDIC. It also has the advantage of extremely few iterations, which leads to a very short latency. Compared with other existing solutions, the proposed on-line MSR-CORDIC is more cost-efficient and more suitable for high-speed applications.
II. Review Table I , where z(n) is the summation of the microrotations; P is the scaling factor. [5] Aiming at accelerating the conventional CORDIC for rotation mode, Lin and Wu [5] proposed the MixedScaling-Rotation CORDIC. Firstly, the authors expand the signed power-of-two (SPT) terms in Eq. (2) 
Calculate elementary angle
Amplifyingfactor in the n-th rotation Pn >,2-Sj(n) + '2 t y (n));
Product ofthe amplifyingfactor in the n-th rotation Pn Pn-l x P1 End In Table II , n denotes the n-th iteration and N denotes the total number of iterations; yii(n),u1j(n) E {-1,0,1} ; s1(n),tj(n) e {0,1,...,S} , and S denotes the number of maximum shift. I and J denote the number of signed power-of-two (SPT) terms of sin On and cosO n respectively; O n is the n-th elementary angle; z(n) is the accumulative angle, and z0 is 0; Pn denotes the product of the amplifying factors in the n-th iteration. The initial value of p0 is 1, and P denotes the Scaling Factor.
The principle of MSR-CORDIC algorithm is that it can perform the rotation and scaling operations at the same time. In the conventional and other existing CORDIC algorithms, the norm of a vector is always enlarged after a micro-rotation. On the contrary, Eq. (10) shows that the factor pn can be either greater or less than 1 in MSR-CORDIC algorithm. Therefore, the final scaling factor, P, can definitely be unity with a proper combination of micro-rotations. As a result, the additional scaling compensation can be eliminated. Moreover, Eq. (8) guarantees a much larger elementary angle set against other CORDIC algorithms. Therefore, the MSR-CORDIC can achieve any arbitrary angle in a very limited number of micro-rotations.
Combined with the features of the enlarged angle set and mixed-scaling mechanism, the MSR-CORDIC can dramatically reduce the iteration number. Finally, an extremely high speed can be achieved.
Nevertheless, the original MSR-CORDIC is dedicated to rotation mode. It is not suitable for the applications requiring on-line angle computation, i. e., vectoring mode. In this paper, we overcome these problems and propose a novel on-line MSR-CORDIC.
III. Proposed On-line MSR-CORDIC A. The Algorithm of On-line MSR-CORDIC
The concept of the proposed on-line MSR-CORDIC can be illustrated in Fig. 1 The angle estimation error of the proposed on-line MSR-CORDIC comes from two sources. The first is the difference between the target angle and the total angle rotated. This kind of error is, as mentioned above, limited by the fundamental angle of the last iteration, Of (K) .
The second source of error is that under the constraint of limited adder number, we can not obtain all the desired elementary angles without error. These two kinds of error can both be reduced by increasing the hardware cost. Depending on the application under consideration, the design goal should be minimizing the hardware cost while meeting the system requirement.
Moreover, if we merely acquire the elementary angles without considering the scaling factors, the design will be unfeasible since we will not be able to control the overall scaling factor and need to compensate the final norm dynamically. Therefore, two design constraints, as shown in the following equations, must be satisfied. (17) together ensure that no matter which elementary angle is selected in each iteration, the overall scaling factor is unity. Hence the proposed on-line MSR-CORDIC is scalingfree and therefore needs no scaling factor compensation.
B. Hardware Architectures
The iterative hardware architecture of the proposed on-line MSR-CORDIC in vectoring mode is shown in Fig. 2 . Specifically, it serves as part of a boundary cell of a QRD-RLS adaptive filter, which will be mentioned in Section IV-A. Here we assume I=2, J=2, and base=16. Like conventional CORDIC, the objective is to rotate the vector [x(n), y(n)]T back to the x-axis such that the angle between them can be accrued, in which only the sign of y(n+1) is concerned. However, since now the base is 16, we have 15 elementary angles to be back-rotated, in order to determine the division which the target vector belongs to. If our concern is to minimize the latency, we will need parallel architecture with 15 x 3 =45 adders, where each 3-adder group is to obtain the sign of y(n+1) when back-rotating one of the 15 elementary angles. On the other hand, if our concern is to minimize the hardware area, we will need serial architecture with 4 x 3 = 12 adders, where the four 3-adder groups are cascaded. Fig. 2 shows the architecture in-between, which is taken into comparison in Section IV-C.
The iterative hardware architecture of the proposed on-line MSR-CORDIC in rotation mode is shown Fig. 3 . Specifically, this architecture serves as part of an internal cell of the QRD-RLS adaptive filter. Here we also assume that the design parameters is I=2, J=2, and base=16. The barrel shifter arrays and adders take the parameters from the boundary cell to rotate the vector [x(n), v(n)]T by one elementary angle out of sixteen candidates. Note that as the base increases, as long as the parameters I and J are fixed, the number of adders will not increase accordingly. This is advantageous since with larger base, we can approach the target angle faster. Although the cost of the boundary cell will increase, in the QRD-RLS filtering system, the number of boundary cells is only O(M) while the number of internal cells is O(M2), where M is the tap number of the filter. That is to say, internal cells dominate the hardware cost of a QRD-RLS filter. 
IV. Simulation Results
A. Simulation Environment Fig. 4 shows the block diagram of the simulation environment [7] . The random signal applied to the channel input consists of a Bernoulli sequence. The impulse response of the channel is described by the raised cosine, hn {|2L+cos( -(n-2) j] n=1,2,3 0, otherwise (18) where the parameter W controls the eigenvalue spread the correlation matrix of the tap inputs in the equalize The AWGN block produces white Gaussian noise wit zero mean and variance 0.001. The equalizer is a QRE RLS adaptive filter containing a systolic array, as show in Fig. 5 , for coefficient-calculation. Usually, an on-lir vectoring-mode CORDIC is adopted as a boundary cel and a rotation-mode CORDIC is adopted as an intern; cell in the systolic array. In the following simulations, ti tap-number is set to 11. Multipliers/Dividers Fig. 6 shows the learning curves with the internal and boundary cells of the equalizer constructed of doubleprecision multipliers/dividers. For each eigenvalue spread, an approximation to the ensemble-average learning curve of the adaptive equalizer is obtained by averaging the instantaneous-squared-error curve over 200 independent trials of the computer experiment. The steady-state value of the average squared error with W=3.5 is 0.0044, which can be viewed as a reference value. When we replace the internal and boundary cells with other CORDIC processors, the minimum hardware costs of them which can approximate this reference value will be compared. Fig. 7 shows the learning curves when we replace the internal and boundary cells with conventional CORDIC + processing elements. The iteration number, N, of the CORDIC operation ranges from 5 to 8. As the figure shows, when the iteration number equals 5 or 6, due to the large error induced by the processing elements, the filter does not converge to a steady state. On the other hand, for the cases when N=7 and N=8, the steady-state values of the average squared error are 0.0064 and 0.0046, respectively. Also, the convergence rates of both cases are similar to the case in simulation I. That is to say, the conventional CORDIC processors with 8 iterations are sufficient under this simulation environment. The hardware cost of the conventional CORDIC with N=8 will be taken into comparison in the next section. MSR-CORDIC processing elements. Here the iteration number K equals 2 in the processing element. The base in each of the two iterations is 16. The parameters I and J are both 2. As can be seen, the convergence rates of the learning curves are similar to the case in simulation I. The steady-state value of the average squared error with W=3.5 is 0.0045, which is also similar to the case in simulation I. The hardware cost of the on-line MSR-CORDIC with the specified design parameters will be compared with other works in the next section. Table IV shows the comparison of latency. Due to its extremely few iterations, the total latency of MSR-CORDIC is the shortest among the three schemes. 
C. Comparison results
Based on the same performance of convergence in 11-tap QRD-RLS, the proposed on-line MSR-CORDIC is compared with the conventional CORDIC and another scaling-free solution [6] . To achieve the high-speed requirement, expanded architectures are selected instead of iterative ones in the comparisons. Table III shows the comparison of hardware complexity. In each iteration, the conventional CORDIC and [6] require only 2 adders, and MSR-CORDIC requires 18 and 6 adders in vectoring and rotation mode respectively. However, MSR-CORDIC has the smallest number of adders due to its extremely few iterations. Moreover, when the tapnumber increases, e.g. 30, MSR-CORDIC can save more than 30O number of adders. 
