Abstract-The CORDIC algorithm is a well-known iterative method for the computation of vector rotation. However, the major disadvantage is its relatively slow computational speed. For applications that require forward rotation (or vector rotation) only, we propose a new scheme, the modified vector rotational CORDIC (MVR-CORDIC) algorithm, to improve the speed performance of CORDIC algorithm. The basic idea of the proposed scheme is to reduce the iteration number directly while maintaining the SQNR performance. This can be achieved by modifying the basic microrotation procedure of CORDIC algorithm. Meanwhile, three searching algorithms are suggested to find the corresponding directional and rotational sequences so as to obtain the best SQNR performance. Three SQNR performance refinement schemes are also suggested in this paper. Namely, the selective prerotation scheme, selective scaling scheme, and iteration-tradeoff scheme. They can reduce and balance the quantization errors encountered in both microrotation and scaling phases so as to further improve the overall SQNR performance. Then, by combining these three refinement schemes, we provide a systematic design flow as well as the optimization procedure in the application of MVR-CORDIC algorithm. Finally, we present two VLSI architectures for the MVR-CORDIC algorithm. It shows that by using the proposed MVR-CORDIC algorithm, we can save 50% execution time in the iterative CORDIC structure, or 50% hardware complexity in the parallel CORDIC structure compared with the conventional CORDIC scheme.
CORDIC operation is limited by the large iteration number, , which is generally equal to the internal wordlength, . At algorithmic level, one trivial solution to overcome such a problem is to reduce the iteration number directly; however, signal will be seriously distorted by the approximation and quantization noise in practical implementations [11] . At circuit level, the CORDIC operation can be accelerated by introducing redundant number representation into the internal data processing, which can eliminate the carry propagation encountered in addition/subtraction operations [8] , [12] .
In this paper, we propose an algorithmic-level improved scheme, which is called modified vector rotational CORDIC (MVR-CORDIC) algorithm. It is very suitable for applications that use the CORDIC algorithm in only forward rotation mode (also known as vector rotation mode), 1 i.e., the rotation angles are fixed and known in advance, such as digital lattice filter [6] , [7] , [13] and discrete linear transformation [10] , [14] [15] [16] . The major feature of the aforementioned applications is that the directional sequence, , which controls the rotation direction of each elementary angle in the microrotation phase, can be computed in advance. By reformatting and searching for new sequences, we can reduce the iteration number significantly, while not increase the quantization noise level. This can be achieved by modifying the basic microrotation procedure of conventional CORDIC algorithm. Then, we can improve the speed performance of the conventional CORDIC algorithm.
Similar work has been reported by Hu and Naganathan [17] . In [17] , with the aid of greedy search, the directional sequence, , of the angle rotated by CORDIC algorithm is recoded. However, the length of the resultant recoded sequence, which determines the iteration number in the microrotation phase, is not fixed but varied with the rotation angle. For certain cases, the length of the recoded sequence can be large and quite close to the upper bound of , where is the number of elementary angles [17] . Under such situation, in synchronous circuit design, the overall speed performance of the CORDIC-based arithmetic operation is therefore limited by those angles. Besides, the nonuniform feature of the iteration number is not suitable for modular design in VLSI implementation. To avoid the drawback, in our work, the design parameters are computed under a fixed iteration number. To solve the constrained optimization problem, we propose three searching algorithms for the MVR-CORDIC algorithm. They can provide tradeoff between computational complexity and signal-to-quantization-noise ratio (SQNR) performance.
Moreover, we propose three SQNR refinement schemes for the MVR-CORDIC algorithm. The first SQNR refinement scheme introduces the concept of selective prerotation. It can carry out the microrotation phase of MVR-CORDIC algorithm with a reduced angle approximation error compared with existing approaches. The second scheme, which is employed in the scaling phase of MVR-CORDIC algorithm, is used to reduce the quantization error in the approximation of scaling factor. This can be achieved by using a selective scaling operation that combines two existing scaling techniques. The third refinement technique is called the iteration-tradeoff scheme. With this scheme, we can make tradeoff on the iteration number between the microrotation and scaling phase of the MVR-CORDIC algorithm. It can balance the quantization errors encountered in these two phases so as to further improve the overall SQNR performance.
Next, with aid of the proposed refinement schemes as well as SQNR analysis developed in the Appendix, we provide a systematic design flow and optimization procedure to facilitate the design process of MVR-CORDIC algorithm. The corresponding VLSI architectures show that we can save at least 50% execution time in the iterative CORDIC structure, and 50% hardware complexity in the parallel CORDIC structure, compared with the conventional CORDIC algorithm. Hence, low-power/ high-speed CORDIC VLSI architectures become feasible.
The rest of the paper is organized as follows. In Section II, the conventional CORDIC algorithm is briefly reviewed. Then, we will discuss the strategies of MVR-CORDIC algorithm to accelerate the CORDIC rotation in vector rotational mode. In Section III, three searching algorithms are proposed to solve the problem under the constraints set by MVR-CORDIC algorithm. Then, we compare the computational complexity as well as the error performance of these three searching algorithms. In Section IV, computer simulations are performed to illustrate the relationship between error performance and design parameters. From Sections V-VII, we discuss three SQNR performance refinement schemes. The design flow is also addressed in detail. Finally, two corresponding VLSI architectures of MVR-CORDIC algorithm are derived in Section VIII, followed by the conclusions in Section IX.
II. CORDIC AND MVR-CORDIC ALGORITHM

A. CORDIC Algorithm
The CORDIC algorithm decomposes the rotation angle, , into predefined elementary angles, i.e., (1) where number of elementary angles; -rotation sequence, which determines the rotation angle ; -th elementary angle; residue angle. Based on (1), the CORDIC recurrence equations can be written as (2) for . Due to the nature of the recurrence relation above, for data of bits wordlength, no more than iterations need be performed, i.e.,
. In addition, the final values, and , need to be scaled by an accumulated scaling factor (3) to retain the norm of the initial vector . Several CORDIC-like iteration schemes are proposed to perform the scaling operation [18] [19] [20] . In Table I , we summarize the basic iteration procedure of the CORDIC algorithm in the circular mode. 2 It consists of two major phases: the microrotation phase and scaling phase.
B. SQNR and Performance Indices
Before the derivation of the MVR-CORDIC algorithm, we first introduce the residue angle error as the performance index for the rotational results. The residue angle error is defined as the angle difference between the target angle and the angle that can be represented by CORDIC (or MVR-CORDIC) algorithm. That is (4) Additionally, another performance index, signal-to-quantization-noise ratio (SQNR), is also employed. The usage of SQNR can give a more straightforward view about the signal quality in practical implementations. The detailed discussion of SQNR and its relationship with are addressed in the Appendix.
C. The Proposed MVR-CORDIC Algorithm
In the development of the MVR-CORDIC algorithm, we make the following modifications on the microrotation procedure of the conventional CORDIC algorithm.
• Skip Some Microrotation Angles: As opposed to conventional CORDIC, we are forced to skip some microrotations in the modified scheme. By doing so, we cannot only reduce the iteration number but also improve the error performance for certain angles. For example, we can rotate the angle of by performing the microrotation of elementary angle once, and skipping all remaining microrotations. Then, the residue angle error . On the contrary, the conventional CORDIC has to go through all the microrotations with sequence of , while the residue angle error is for .
• Repeat Some Microrotation Angles:
In the conventional CORDIC of Table I, each microrotation angle, , is allowed to be used only once. However, in our modified scheme, we make it more flexible so that each microrotation can be performed repeatedly. By doing so, for a rotation angle that is times of one microrotation angle, i.e., , we can simply execute the microrotation of by times instead of performing microrotation in the conventional way. Therefore, better error performance can be obtained but with reduced iteration number. For example, we can obtain for rotation angle of by simply rotating elementary angle of twice, 3 whereas the conventional CORDIC uses the sequence and the residue angle error is as large as .
• Confine the Number of Microrotations to : In Table I , all microrotations need to be executed sequentially to complete the CORDIC rotation. To obtain the best performance, is often set to be equal to the internal wordlength, , which is the upper bound of in practical implementations [11] . However, in our modified algorithm, we confine the iteration number in the microrotation phase to ( ). As we will see, this will lead to significant speed improvement in the CORDIC operation. Putting all of these modifications together, we can rewrite (1) as (5) where is the rotational sequence that determines the microrotation angle in the th iteration, and is the directional sequence that controls the direction of the th microrotation of . To see the effects of the above modifications, we show the constellation of reachable angles of MVR-CORDIC in Fig. 1(b) . 3 The example of =2 is used to emphasize the impacts of repeating microrotation angles. However, in practical implementation, rotation of angle =2 can be simply accomplished by setting x = 0y(0) and y = x(0) without going through CORDIC rotation. The wordlength, , is assigned to be 4, and the restricted iteration number, , is 3. The reachable angles of conventional CORDIC for iteration number are also shown in Fig. 1(a) for comparison purpose. Note that the comparison is made under the condition of equal iteration number in the microrotation phase. As we can see in Fig. 1 , the number of reachable angles of the proposed MVR-CORDIC is much more than the conventional CORDIC. This implies that, given the same iteration number, the MVR-CORDIC will outperform the conventional CORDIC in terms of residue angle error. Namely, given a target residue angle error, the MVR-CORDIC rotation requires fewer iterations compared with conventional approach. Consequently, the speed performance of CORDIC-based arithmetic operations can be greatly improved.
In Table II , we summarize the microrotation procedure as well as the scaling operation (to be discussed in Section VI) of the MVR-CORDIC algorithm. The main feature of the proposed MVR-CORDIC algorithm can be stated as follows.
Given a rotation angle , the MVR-CORDIC attempts to accomplish the rotation in a more flexible way. It takes fewer iterations than the conventional CORDIC algorithm, while the SQNR performance is still maintained.
III. SEARCHING ALGORITHMS AND COMPARISON
A. Searching Algorithms
Consider (5) in the previous section. Now the key issue in the MVR-CORDIC algorithm is to find the best sequences of and to minimize , subject to the constraint that the total iteration number is confined to . To solve the constrained problem, we consider the following three searching algorithms. 
1) Greedy Algorithm:
In the greedy algorithm, we try to approach the target rotation angle, , step by step. That is, in each step, decisions are made on and by choosing the best combination of so as to minimize . Specifically, and are determined such that the error function of is minimized, where is the residue angle in th step, defined as
The searching algorithm is terminated if no further improvement can be found, i.e., , or and are determined at the end of the searching algorithm. The detailed flowchart of the greedy algorithm is shown in Fig. 2(a) . This approach is similar to the one used in [17] . However, the major difference is that in [17] , the greedy algorithm terminates only when the residue angle error cannot be further reduced. The total iteration number is not fixed.
2) Exhaustive Algorithm:
The second approach to solve the constrained problem is the exhaustive algorithm. The idea is to search for the entire solution space, i.e., all the possible combinations of , in one single step. Decisions for and , for , are made such that the error function (7) is minimized. Obviously, the exhaustive searching algorithm produces global optimal solution. The flowchart of the algorithm is depicted in Fig. 2 
(b). 3) Semigreedy Algorithm:
The last searching algorithm is the semigreedy algorithm. Actually, we can treat the semigreedy algorithm as a combination of greedy and exhaustive algorithm. The whole searching space of and for are divided into several sections with iterations as a segment. We call such a segment as a block, and is termed as the block length. The segmentation scheme is illustrated in Fig. 3 . In the semigreedy algorithm, the exhaustive search is performed within each isolated block, and the connection between each consecutive blocks is determined in the greedy manner. Specifically, in the th block (corresponds to th step in performing the searching algorithm), the decisions of and for (8) where (9) is the residue angle in the th step. Fig. 2(c) illustrates the detailed flowchart of the proposed semigreedy algorithm.
B. Comparison of Computational Complexity and Error Performance
Next, we compare the computational complexity and error performance for the three searching algorithms developed above. The comparison results are shown in Table III. Three  parameters , and are the number of elementary angles, the restricted iteration number, and the searching block length of the semigreedy algorithm, respectively. The conventional CORDIC algorithm is also included here for comparison purpose. Note that the complexity is represented in terms of loop number. We use it to approximate the execution time, since the number of loops dominates the computational complexity in performing these searching algorithms.
In addition, in Table VI , we show the numerical results of the loop number by setting and . The Normalized Loop Number is defined as
Normalized
Loop number of the proposed searching algorithm Loop number of the conventional CORDIC algorithm (10) It can be used to illustrate the complexity gaps between these searching algorithms. In the last row, we show the averaged residue angle error for and . The ensemble average was carried out over 65 angles from 0 to with equal space, i.e., . Based on the results in Tables III and IV , we can make the following observations:
• The greedy algorithm requires the least computational complexity among the three algorithms, while it generates the sequences with worst error performance . This algorithm can be used to give the designers a quick but rough index about the performance of a specified application by using the MVR-CORDIC algorithm. It can be also applied to the situations where iterative-design is often involved, such as lattice filter design. In this case, designer may have to go through many design iterations to determine the restricted iteration number, , and the wordlength, , so as to meet the filter specification. With the greedy search, the designer can therefore choose these design parameters within short design period.
• The exhaustive algorithm consumes the longest execution time while resulting in the best error performance . This can be applied to those angles that are often employed in DSP applications, such as the twiddle factors, , in FFT/IFFT. In such applications, for a given angle, we only have to perform the algorithm once to find the results of rotational sequence and directional sequence. This algorithm can be also applied to the applications where the performance is of most importance while with no restriction on execution time.
• The exhaustive algorithm can provide the best SQNR performance. However, for large or , it becomes practically impossible due to its heavy computational complexity. In such a situation, the semigreedy algorithm can be employed instead. Actually, the semigreedy algorithm plays the role in providing tradeoffs between the other two algorithms described above. The parameter of the semigreedy algorithm is used to control the algorithm so as to generate well error performance within moderate execution time . We can treat semigreedy algorithm as the generalized version for the three proposed searching algorithms, i.e., the greedy algorithm is the semigreedy algorithm with , and the exhaustive algorithm is the semigreedy algorithm with .
IV. RELATIONSHIP BETWEEN ERROR PERFORMANCE AND DESIGN PARAMETERS
In this section, three experiments are conducted to illustrate the relationship between error performance and the parameters used in the three searching algorithms. As with the previous experiments, the averaged residue angle error, , is obtained based on ensemble averaging of 65 angles from 0 to with equal space.
• Error Performance Versus Number of Elementary Angles ( ): In Fig. 4 , the averaged residue angle error is plotted versus the number of elementary angles, , for a fixed iteration number . Note that, to obtain the best error performance in the conventional CORDIC algorithm, the number of elementary angle is set to be . In Fig. 4 , for all the searching algorithms, the error performance improves as the number of elementary angles increases. This can be explained that we have more choices in approximating the rotation angle, , thus resulting in smaller residue angle error. However, the error curves are gradually saturated as is above certain value. The reason is that the error performance cannot be improved unlimitedly only by increasing when the restricted iteration number is kept fixed. Actually, the saturation phenomenon of error performance suggests that we can perform the searching algorithms by using a smaller number of elementary angles, (say in this case), instead of using directly. By doing so, we can reduce the computational complexity in running the searching algorithms while retaining compatible error performance.
• Error Performance Versus Restricted Iteration Number ( ): Fig. 5 emphasizes on the relationship between the error performance and the restricted iteration number, , for the algorithms of interest. Similar to Fig. 4 , the results presented in Fig. 5 show that increasing has the effect of improving error performance. Also, the error curves also saturate gradually for large . The phenomenon can be explained in the similar way as with Fig. 4 .
One important observation is as follows. First, we use the horizontal dashed line to represent the averaged noise level of conventional CORDIC algorithm with . We can find that by using greedy search, the MVR-CORDIC algorithm can perform equally well as conventional CORDIC algorithm with only 4 iterations . For the semigreedy and the exhaustive search, even fewer iterations ( in this case) are needed to achieve compatible error performance. Recall that the conventional CORDIC algorithm has to go through 8 ( ) microrotations to reach such a noise level. In addition, computer simulations also indicates that by using semigreedy algorithm with moderate , the MVR-CORDIC algorithm requires only iterations (microrotations), in an average sense, to achieve comparable (or even better) error performance compared with the conventional CORDIC algorithm.
• Error Performance Versus Searching Block Length ( ): Fig. 6 depicts the residue angle error versus the searching block length ( ). As can be seen, we can obtain better performance by increasing the searching block length of semigreedy algorithm. The results confirm our argument that, for larger , the semigreedy behaves like exhaustive algorithm. On the other hand, when is small, the essence of greedy algorithm will arise due to the confined searching space. Moreover, the saturation phenomenon suggests that we can use semigreedy algorithm with moderate value of (say in this case) to obtain a near optimum error performance without going through exhaustive search. Meanwhile, the saving in computational complexity is significant. For example, in this experiment with and , the computational complexity rate between exhaustive search and semigreedy search is only about 0.0326%, and the performance difference is below 60 dB.
V. SELECTIVE PREROTATION SCHEME FOR MVR-CORDIC ALGORITHM
A. Conventional Prerotation Scheme
From (5), we can easily verify that all the reachable angles of MVR-CORDIC algorithm are confined to the range of with the setting and , for
. Hence, to perform vector rotation of arbitrary angle, i.e.,
, directly, the MVR-CORDIC requires at least 4 iterations ( ) to cover all possible rotation angles. However, the residue angle error will be increased approximately with the value of for . This can be easily explained by observing the constellation of the reachable angles in Fig. 1(a) : the distribution is sparse for of large value, which results in large residue angle error. In general, the error can be suppressed by dividing the rotation of a large (i.e., ) into two steps (assume )
by an angle of ( ). In step 1), the prerotation can be easily accomplished without going through CORDIC algorithm. In step 2), we can continue to perform the MVR-CORDIC rotation of angle ( ), which is a smaller angle compared with original . By doing so, we can keep the rotation angle in step 2) below , and hence prevent the MVR-CORDIC from rotating a large angle directly. Consequently, better error performance can be obtained.
B. Selective Prerotation Scheme
Based on above design concept, we develop an improved scheme, called the Selective Prerotation Scheme, for the MVR-CORDIC rotation of arbitrary angle. The main concept of the new scheme is that we attempt to approach the rotation angle in either clockwise or counterclockwise. The bidirectional rotation scheme can be achieved by introducing two different prerotation angles, where one prerotation angle is greater than , and the other one is smaller than . In general, these two types of rotation with different prerotation angles behave differently in terms of . Then, from the alternative candidates, the one with smaller is selected so as to perform the MVR-CORDIC rotation of , hence the name of the method.
To be more specific, in the proposed scheme, we first divide the MVR-CORDIC rotation into 4 groups based on the quadrant of the rotation angle in the complex plane. Then, in each group, two rotation types with different prerotation angles are suggested to carry out the rotation. To obtain better SQNR performance, we have to evaluate the respective for each type. Then, choose the better one from these two candidates. Similar to the conventional prerotation scheme, two steps are required to complete each MVR-CORDIC rotation: one step for rotation of prerotation angle, and other step for rotation of remaining angle with MVR-CORDIC algorithm. We summarize the proposed alternative prerotation scheme in Table V . The proposed scheme can provide a better error performance than conventional approach without increasing any hardware complexity.
C. Design Example and Simulation
To illustrate the modified scheme developed earlier, we consider the example of . First, the rotation angle belongs to the second group, i.e., . According to the selective prerotation scheme in Table V , the following two types of rotation procedure may be adopted:
• Type I: Prerotate angle of , followed by MVR-CORDIC rotation of .
• Type II: Prerotate angle of , followed by MVR-CORDIC rotation of . Here, the semigreedy algorithm (with parameters of , and ) is used to search for the directional sequence, , as well as the rotational sequence, , for these two angles. The results for these two rotation types are listed in Table VI , where the directional and rotational sequence are represented in the vector form as
The corresponding SQNR values are calculated based on (A.1) with . 4 It can be seen clearly that, in this case, we can obtain better error performance by using Type-I rotation. As a result, we can perform the rotation of with MVR-CORDIC rotation procedure of Type-I in Group-II.
Moreover, in Fig. 7 , the residue angle error, , is plotted versus 65 angles distributed from 0 to with equal space for three rotation schemes discussed in this section; namely, the direct rotation (no prerotation phase), the conventional scheme, and the proposed modified scheme. The experiment is carried out based on the semigreedy algorithm with parameters of , and . Based on the simulation results in Fig. 7 , we can make the following observations.
• Unlike the direct rotation approach, the error performance of the selective prerotation scheme will not degrade as increases.
• The selective prerotation scheme can provide apparent improvement compared with conventional prerotation scheme for . The reason is that, for , we can still perform the MVR-CORDIC rotation of that is smaller than . However, the conventional scheme has to perform the rotation with angle that is greater than . • The proposed scheme consistently behaves best among the three schemes for all the rotation angles. The averaged residue angle error of the 65 angles for direct rotation scheme, conventional scheme, and proposed scheme are , and , respectively.
VI. SELECTIVE SCALING SCHEME FOR MVR-CORDIC ALGORITHM
A. Scaling Operation
In this section, we consider an implementation issue: the scaling phase of MVR-CORDIC algorithm. The use of scaling operation is intended to ensure the preservation of the norm of the vector, , after the sequence of microrotations. For convenience of representation, the scaling factor, , of MVR-CORDIC algorithm in Table II is represented as [1] [2] [3] (11) 4 These SQNR values are obtained without considering the effects of scaling operation, which will be discussed in the next section. To save hardware complexity, in practical implementation the scaling operation is performed by quantizing the scaling factor, , in two forms [18] [19] [20] , i.e., Type I: (12) Type II:
where quantized value of ; restricted iteration number in the scaling phase; ; and . The corresponding iteration procedures for these two scaling types are as follows:
with , and , . By doing so, we can approximate the multiplication of with only shift-and-add operations, and the scaling operation can share the same circuits with the MVR-CORDIC microrotation module (to be discussed in the Section VIII), which eliminates the significant overhead of scaling multipliers.
As one can expect, this process will introduce some quantization noise, and the noise increases as decreases. Similar to the microrotation phase described in Section II-B, we introduce another performance index, , to describe the amount of error introduced by the approximation process of (12) and (13) . The scaling approximation error, , is defined as (16) The relationship between and SQNR performance is discussed in the Appendix.
B. Selective Scaling Scheme
In Fig. 8(a) and (b) , we illustrate the distribution of the values that can be represented by (12) and (13) with and . It is interesting to see that the distributions for these two types of representation are quite different, i.e., the distribution of Type-I is dense in the region of ; on the contrary, the distribution of Type-II concentrated in the different region of . Based on the observation and apply the similar idea used in Section V-B, we are motivated to propose a novel scaling operation, called Selective Scaling Scheme for the MVR-CORDIC algorithm.
The basic idea of the selective scaling scheme is to combine these two types scaling operation in (12) and (13) . That is, for a given scaling factor, a better strategy to quantize is to find out the smallest (the closest to ) with respect to these two types of representation. We can then choose the one with smaller from the two candidate scaling types. Hence, we can carry out the scaling operation with better SQNR performance.
C. Design Example
We use the example of to demonstrate the proposed scaling procedure. First, by substituting into (11), we obtain Assume that . We summarize the results for these two scaling types in Table VII, where and for are represented in vector form as respectively. The SQNR result before performing the such a quantized scaling operation (i.e., assume floating-point scaling) is 60.76 dB. In Type-I scaling, the SQNR value drops to 47.93 dB due to the introduced quantization error of . On the other hand, Type-II scaling has relatively low quantization noise of , and the SQNR degradation is below 0.01 dB. Thus, by carefully choosing the proposed scaling scheme, we can achieve better SQNR performance of the MVR-CORDIC algorithm.
VII. ITERATION-TRADEOFF SCHEME AND DESIGN FLOW
Due to the similar nature and operations of the microrotation phase and scaling phase, one question may arise in the application of MVR-CORDIC algorithm: how to determine and in an optimal sense? In this section, we provide a systematic design flow to determine these two important design parameters for the MVR-CORDIC algorithm.
A. Iteration-Tradeoff Scheme for and
From previous discussions, we known that to carry out the complete MVR-CORDIC algorithm in Table II , we have to go through two separate phases with a total of iterations. We define as (17) where and are the restricted iteration number in microrotation phase and scaling phase, respectively. When the total iteration number is of major concern, (17) implies that we can make tradeoff between and . That is, we can change and to and , respectively, subject to the constraint (18) We are justified to do so since the basic iterative operations in these two phases are almost the same. The modification can help to further improve the SQNR performance for certain rotation angles. In the following, we develop two tradeoff schemes depending on the characteristics of the rotation angle in the MVR-CORDIC algorithm.
• Case I: Trading for In this case, we attempt to gain additional rotation accuracy in the microrotation phase at the expanse of degrading precision in the scaling operation. This can be applied to the MVR-CORDIC rotation when the residue angle error, , dominates the overall SQNR performance ( ). It is suitable for the situation that the rotation angle lies in the region with relatively sparser distribution [see Fig. 1(a) ]. Meanwhile, the corresponding scaling factor, , can be well represented with iterations. Of course, it is only meaningful that the extra SQNR gained in microrotation phase can compensate for the SQNR loss in scaling phase. An example of Case I is presented in Table VIII. • Case II: Trading for On the other hand, when the overall SQNR performance is dominated by the introduced quantization error in scaling operation, ( ). We attempt to obtain more accurate representation of the scaling factor but sacrificing the angle resolution in microrotation phase. It is suitable for the situation that the scaling factor lies in the region with relatively sparser distribution (see Fig. 8 ), while the rotation angle, , can be well represented with elementary angles. An example of such a tradeoff case is presented in Table IX .
B. Design Flow for the MVR-CORDIC Algorithm
Based on the above iteration-tradeoff scheme, we derive the design flow as well as the optimization procedure to determine the optimal and for the MVR-CORDIC algorithm.
• Step 1: Determine . In practical implementation, must be determined according to the system requirement, such as speed, power consumption, silicon area, and, of course, the SQNR performance.
• Step 2: Initialize and . Once is determined, we have to allocate and for microrotation and scaling phase, respectively. As a rule of thumb, we initially set and (assume is an even integer). With this initial setting, the design flow is likely to converge to the optimum solution within fewest design iterations, in an average sense.
• Step 3: Perform the Searching Algorithm with and . With and , we can apply the semigreedy algorithm to compute , , , and , as well as and in these two phases. For iterative design process, it is better to use moderately small (say , 2, or 3) to accelerate the computation.
• Step 4: Estimate the SQNR Performance.
In general, the estimation of SQNR is a time-consuming process because we have to go through extensive computer simulation to obtain a reliable SQNR value. Fortunately, with the SQNR analysis developed in the Appendix, i.e., SQNR(dB) (19) we can accurately estimate the SQNR value by simply substituting and into (19) . Moreover, (19) indicates that when and are of the same magnitude, the SQNR reaches its maximum value due to the dependency of these two error indices. That is, decreasing may have the effects of increasing , and vice versa. We can use this property as the design guideline in determining the optimal and . Based on the observation as well as the iteration-tradeoff scheme developed earlier in this section, we are able to derive an optimization procedure, as described from Step 5 to Step 7.
• Step 5: Apply the Iteration-Tradeoff Scheme
The selection of the tradeoff-type depends on the quantities of the errors, and , in these two phases. Value of . The final step of the design procedure is to perform the semigreedy algorithm with larger value of as well as the optimized and , which are obtained in the iteration optimization procedure. In Fig. 9 , we illustrate the corresponding design flowchart of the proposed design methodology and optimization procedure.
C. Design Example
Consider rotational angle in the application of MVR-CORDIC algorithm. Assume and the wordlength is . We initially set , and apply the semigreedy algorithm with small ( in this case). Based on the SQNR refinement schemes developed in Sections V and VI, it can be found that and for Type-I rotation and Type-II scaling operation, respectively. The SQNR value is 70.93 dB. Next, by applying the Case-II tradeoff on and in (20) , we can obtain an improved results of , , and 83.44 dB SQNR value with Type-I rotation and Type-II scaling operation. The iteration number in the microrotation and scaling phase now are and , respectively. Since no improvement is possible, the optimization procedure is terminated. Then, we can apply the semigreedy algorithm with , , and a large value of ( in this case). The resultant SQNR value can be further improved to 87.39 dB. Fig. 10 shows all the possible combinations of and and their corresponding SQNR results under the constraint . Based on the results, we can make the following observations.
1) The proposed design procedure can provide the optimal solution in the determination of and . That is, the resultant and computed by our design flow are the same as the optimal ones in Fig. 10 . The important issue is that, with the aid of proposed design flow, we can obtain the optimal solution within only one design iteration, in this case, instead of checking exhaustively all possible combinations of and . 2) As can be seen from Fig. 10(b) , the theoretical SQNR values, which are obtained by simply using (19) , coincides exactly with the simulated results. The simulated SQNR values are calculated by ensemble-averaging of 10 000 output SQNR values generated by MVR-CORDIC rotation with 10 000 random input vectors . The results indeed confirm the validity of (19) in the fast estimation of the SQNR value. 3) The optimal SQNR (83.44 dB) occurs under the situation that and are about of the same magnitude. The result confirms with our earlier argument for (19) .
VIII. VLSI IMPLEMENTATION OF MVR-CORDIC
A. Iterative MVR-CORDIC Structure
In Fig. 11 , we illustrate the iterative structure for the proposed MVR-CORDIC algorithm. It is similar to the conventional iterative CORDIC structure [3] . The major difference of these two implementations lies in their control units. As shown in Fig. 11 , two separate phases are performed to complete single MVR-CORDIC rotation, i.e., the microrotational phase (marked by solid line) and the scaling phase (marked by dash line). In each phase, three kinds of control signal are used to control the operations:
• in microrotation phase and in scaling phase: it controls the number of bits to be shifted by barrel shifters.
• in microrotation phase and in scaling phase: it determines the operations of adder/subtracter.
• Control signal, : it governs the phase switching of the iterative MVR-CORDIC structure and the scaling type (Type-I or Type-II) in scaling phase. All the control signals can be generated by the proposed searching algorithm in advance, and are stored in ROM.
To evaluate the speed performance of the iterative structure, we assume that denotes the execution time to carry out single iteration of microrotation (or scaling). For MVR-CORDIC algorithm, the total execution time is . On the other hand, for conventional CORDIC algorithm, it takes iterations in microrotation phase and another iterations in scaling phase. Thus, it requires total to complete one CORDIC rotation. To make a fair comparison between these two approaches, we compare these two numbers, and , under the condition of equal SQNR performance. From Section IV, we know that by using semigreedy search, the MVR-CORDIC algorithm requires an average of iterations to reach comparable error performance of conventional CORDIC in microrotation phase. In the scaling phase, Hu [3] has reported that on the average. Similar result of also can be obtained for MVR-CORDIC algorithm. Hence, and . That is, we can save about 50% execution time in the iterative implementation of MVR-CORDIC algorithm compared with conventional CORDIC algorithm.
Note that the execution time is different from the runtime mentioned in Section III. They are two different design issues in the proposed MVR-CORDIC algorithm. Given a target angle , the runtime denotes the time to determine the design parameters of the MVR-CORDIC. The execution is defined as the hardware execution time to perform the vector rotation in VLSI circuits.
B. Parallel and Pipelined MVR-CORDIC Structure
By unfolding the iterative implementation of Fig. 11 , we can obtain the parallel MVR-CORDIC structure as depicted in Fig. 12(a) . The structure is composed of basic MVR-CORDIC processors connected in cascade form, in which the leading processors perform the microrotations and the following processors execute the scaling operations. Each basic MVR-CORDIC processor performs one iteration as specified in Fig. 11 . Moreover, for the case that the parallel structure is dedicated to perform a particular rotation angle, the operation of each processor is kept fixed. We can thus save the hardware complexity easily by replacing all the control circuits, barrel shifters, and multiplexers with only wire routing.
To achieve a higher data throughput rate, we can further insert pipeline stages (latches) between successive processors of parallel structure, which results in the pipelined MVR-CORDIC structure in Fig. 12(b) . The pipelined structure is very suitable for real-time applications at high data bandwidth.
Due to the reduced iteration number, for parallel MVR-CORDIC structure, we can save about 50% silicon area compared with the conventional parallel CORDIC structure. The reason is that the silicon area is directly proportional to the number of basic processors, and the numbers of processors for MVR-CORDIC and conventional CORDIC are and , respectively. For the same reason, the critical path of parallel structure in Fig. 12(a) is only 50% compared with the conventional pipelined CORDIC structure. The latency introduced by pipelined structure can also be halved by employing the proposed MVR-CORDIC algorithm.
IX. CONCLUSION
In this paper, we present a new modified CORDIC algorithm, called the MVR-CORDIC algorithm, to accelerate the CORDIC operation. It can be applied to the DSP applications where rotation angles are known in advance, such as digital lattice filter and discrete orthogonal transformations. In addition, by applying the three SQNR refinements techniques developed in the paper, we can save at least 50% execution time in the iterative CORDIC structure, and 50% hardware complexity in the parallel CORDIC structure compared with the conventional CORDIC algorithm. Hence, low-power/high-speed CORDIC-based VLSI architectures for high-performance DSP applications become achievable. We can identify three error sources of the quantization noise: 1) the residue angle error, , in microrotation phase, 2) the scaling approximation error, , in scaling phase and 3) the rounding error of the fixed-point arithmetic operations. In the following, we consider their impacts on the overall SQNR performance by modeling these errors as additive white noises.
• Assume that all the noise sources are independent of each other, the variance of the combined noise sources of the output vector after MVR-CORDIC rotation can be written as (A. 7) In most applications, the first two noise terms dominate the overall quantization noise due to the fact that is several magnitude orders greater than . Putting all of these together, we can relate the SQNR to the performance indicators, and , as SQNR(dB) (A.8) Equation (A.8) plays an important role in determining optimal iteration numbers of and in Section VII-A.
