Abstract-Vector rotation is the key operation employed extensively in many digital signal processing applications. In this paper, we introduce a new design concept called Angle Quantization (AQ). It can be used as a design index for vector rotational operation, where the rotational angle is known in advance. Based on the AQ process, we establish a unified design framework for cost-effective low-latency rotational algorithms and architectures. Several existing works, such as conventional COordinate Rotational DIgital Computer (CORDIC), AR-CORDIC, MVR-CORDIC, and EEAS-based CORDIC, can be fitted into the design framework, forming a Vector Rotational CORDIC Family. Moreover, we address four searching algorithms to solve the optimization problem encountered in the proposed vector rotational CORDIC family. The corresponding scaling operations of the CORDIC family are also discussed. Based on the new design framework, we can realize high-speed/low-complexity rotational VLSI circuits, whereas without degrading the precision performance in fixed-point implementations.
I. INTRODUCTION

V
ECTOR rotation plays an important role in many digital signal processing (DSP) applications. It is extensively employed as the processing kernel in discrete orthogonal transformations (DCT, DST, and DFT) [1] - [4] , lattice-based (rotation-based) digital filtering [5] , [6] , sinewave/cosine generation [7] , [8] , and digital modulation/demodulation in communication systems [9] , [10] . Let and denote the input and output vectors, respectively. Vector rotation of by a rotational angle can be formulated as (1) where the rotational matrix is defined as . Fig. 1 shows the direct implementation of (2) . As one can see, the direct implementation calls for four multipliers and two adders, and the critical path is the total delay of one multiplier and one adder. The direct implementation is very area-consuming and low-speed when rotational operations are heavily utilized in VLSI circuits.
In practical implementations, to increase the operational speed of rotation, one effective way is to employ high-speed arithmetic operators, such as Carry-lookahead Adder and Recoded Multipliers [11] , i.e., to reduce the delay of critical path. In general, the speed benefit is gained at the expense of advanced arithmetic operators. We call them Arithmetic-based approaches.
On the other hand, we can achieve high-speed/low-complexity rotational circuits by sacrificing rotation precision in fixed-point implementations. For example, the hardware complexity can be significantly reduced as the assigned wordlength is shortened, hence the critical path. Also, Signed Power-of-Two (SPT) [12] , [13] coding of coefficient parameters ( and ) is an alternative way to reduce the complexity of the direct implementation. Both approaches introduce quantization error due to the approximation process of rotational matrix , which causes degradation of the signal quality in practical implementations. The amount of quantization error depends on the assigned wordlength and number of nonzero digits employed. We call such schemes Approximation-based approaches.
Following the concept of approximation-based approaches, in this paper, we propose a novel framework to design highspeed/low-cost vector rotational VLSI circuits. Instead of performing quantization on the coefficient parameters ( and ), the proposed design framework originates from the concept of Angle Quantization (AQ). The AQ derives the name from the fact that we perform the quantization process on the rotational angle, , directly. That is, we decompose the original rotational angle into several sub-angles, s. Then, we try to sum up those sub-angles to approximate the original angle as close as possible; or equivalently, we try to minimize the angle quantization error (3) where denotes the number of sub-angles. The AQ process is demonstrated in Fig. 2 . Based on the AQ process, The vector rotation operation can be realized as shown in Fig. 3 . Each rotation module is dedicated to performing a particular rotation of sub-angle . Then, the rotation of can be accomplished by cascading these rotation modules. In the AQ process, there are two key design issues: 1) Firstly, we need to determine (or construct) the sub-angles, and each needs to be easy-to-implement in practical VLSI circuits. 2) Secondly, we have to find out how to select and combine these sub-angles such that the angle quantization error can be suppressed. In fact, the well-known COordinate Rotational DIgital Computer (CORDIC) algorithm [14] - [16] can be considered as an approach to perform the AQ process. Recall that in the CORDIC algorithm, the rotation of angle is performed by sequentially rotating elementary angle of , for , where denotes the wordlength. The advantageous feature of the elementary angle is that rotation of requires only two shift-and-add operators. The easy-to-implement feature of conforms to the requirements of aforementioned AQ process. In addition, the sequential rotating operation of s is the way to select and combine those sub-angles in conventional CORDIC.
Next, we can link the AQ process with several existing vector rotation schemes, such as Angle Recoding (AR) technique [17] , Modified Vector Rotational CORDIC (MVR-CORDIC) algorithm [18] and Extended Elementary Angle Set (EEAS) scheme [19] . We explore their relationship with the proposed AQ process. Then we will derive a unified framework for all these vector rotational operations. That is, all previous schemes can be considered as subsets of the proposed framework. The unified operations and AQ process of these algorithm suggest a family of rotation algorithms. We call it Vector Rotational CORDIC Family.
The rest of the paper is organized as follows. In Section II, we discusses the basic concept of angle quantization. Then we explore the relationship between AQ process and the conventional CORDIC algorithm. In Section III, we address several existing vector rotation schemes as well as their relationship with the proposed AQ process. Then we derive the proposed vector rotational framework. In Section IV, we discuss four searching algorithms to solve the optimization problem encountered in AQ process. The searching algorithms help to select and combine the aforementioned sub-angles so as to achieve the desired precision performance. In Section V, we focus on the scaling operation of the vector rotational CORDIC family. In Sections VI and VII, the performance comparisons and VLSI architectures of the Vector Rotational CORDIC Family are addressed, respectively. Finally, we conclude our work in Section VIII.
II. ANGLE QUANTIZATION AND CONVENTIONAL CORDIC ALGORITHM
A. Angle Quantization Process and CSD
Angle Quantization (AQ) is conceptually similar to the Canonical Signed Digit (CSD) Quantization scheme [11] - [13] . In many digital filter designs, to save hardware complexity, the coefficient is recoded in the form of sum of Signed-Power-of-Two (SPT) terms. For example, coefficient of can be recoded as for wordlength with three nonzero digits, where represents 1. With such manipulations, multiplication can be easily realized with only three shift-and-add operators, as shown in Fig. 4 . Hence, the hardware complexity and the critical path can be significantly reduced.
Conceptually, the CSD approach is performed through the following two steps. 1) Firstly, CSD quantization decomposes the coefficient into several SPT terms, i.e., sub-coefficients. The operation of each sub-coefficient can be easily realized. 2) Secondly, the multiplication of can be reformed through the combination of these SPT sub-coefficients. By following the concept of CSD, we quantize the rotation angle as demonstrated in Fig. 2 . We first decompose the rotation angle into several sub-angles, s. Next, similar to aforementioned CSD scheme, the design criteria of all sub-angles, s, is that the rotational operation of each should be easily realized (just like the SPT terms in filter design). Suppose that each can also be realized using only shift-and-add operations. Then, with the help of easy-to-implement sub-angles, the rotation of can be performed through successive applications of sub-angle rotations in a cost-effective way. In Table I , we summarize the correspondences between the CSD quantization scheme and the proposed AQ process.
B. Operations of Conventional CORDIC Algorithm
In conventional CORDIC algorithm, the elementary angles, , is defined as [14] - [16] 
By substituting into (2), the th microrotation of CORDIC can be represented as (5) where denotes the ideal rotated vector. The operation is demonstrated in Fig. 5 . As we can see from (5), the rotation of requires only two shift-and-add operators, which can be easy to realized in VLSI circuits. The easy-to-implement feature of conforms to the requirement of sub-angles in AQ process. Based on the elementary angles, the conventional CORDIC algorithm can be rewritten as [14] - [16] (6) where denotes the number of elementary angles, is the rotation sequence which determines the th rotational angle . In general, for data of -bit wordlength, the iteration number is less than , i.e., . Basically, the CORDIC tries to decompose the rotation angle, , into the combination of , for . The angle quantization error of the CORDIC algorithm (7) represents the residue angle beyond the resolution of CORDIC algorithm. In Table II , we summarize the basic iteration procedure of the CORDIC algorithm in the circular mode.
C. Link AQ Process With Conventional CORDIC Algorithm
Next, we would like to define Elementary Angle Set (EAS) for the derivation of the proposed vector rotational framework. Basically, EAS consists of all elementary angles used in the rotation algorithms. In the conventional CORDIC algorithm, the EAS comprises of all , for , and can be defined as (8) In Table III , we illustrate the EAS of the CORDIC with . With the help of EAS, we can say that the CORDIC algorithm essentially performs the angle quantization. This can be observed from (6) . Given a target rotation angle , CORDIC algorithm determines the first rotation sequence for the most significant elementary angle (the topmost elementary angle in Table III) , followed by the determination of for . The process is repeated until the last elementary angle is applied. That is, the CORDIC algorithm tries to perform the rotation through sequentially applying micro-rotations of all elementary angles.
Referring to Table I , now we can relate AQ to CORDIC algorithm as follows:
1) The sub-angle in AQ now becomes (9) in CORDIC algorithm. 2) The number of sub-angles of in AQ is set to be in CORDIC algorithm. 3) CORDIC algorithm sequentially apply all , for , to approximate the target angle .
III. DESIGN FRAMEWORK FOR VECTOR ROTATIONAL OPERATIONS
In previous section, we have shown that the CORDIC algorithm tries to perform AQ based on limited s and fixed sequence. However, in the applications where the rotation angles are known in advance, it would be advantageous to relax the sequential constraint on the sub-angles (or micro-rotations), or to extend the EAS range. Based on the idea, a new design framework can be developed, and several previous designs of [17] - [19] can be fitted into this design framework.
A. AR Technique [17]
In conventional CORDIC algorithm, the micro-rotations of all elementary angles are performed in a sequential way. On the contrary, in the Angle Recoding (AR) technique proposed by Hu and Naganathan [17] , certain micro-rotations can be skipped depending on the target rotational angle. Specifically, the modification is done by extending the set of from to . By substituting into the recurrence equation in Table II , we obtain (10) Equation (10) represents a null operation, which can be ignored in practical implementation. Hence, by setting , one can skip the micro-rotation of the elementary angle . This can help to reduce the iteration number, implying the speedup of CORDIC algorithm.
With the modification of (10), now the angle quantization error of the AR technique, , can be represented as (11) Basically, (11) is identical to (7), except for the extended . In practice, for certain target rotational angle , the AR technique can not only reduce the iteration number, but also suppress the angle quantization error. Take one extreme case for example. Consider the target angle of . The conventional CORDIC approach has to go through all micro-rotations with , where denotes the set . The angle quantization error of for . However, with the AR technique, we can rotate by performing only one micro-rotation of elementary angle , while skipping all remaining micro-rotations. The resultant rotation sequence is , and it takes only one iteration with . In addition to minimizing , it would be desirable to minimize the effective iteration number . Hence, the iteration stages in implementation can be reduced. In summary, the AR optimization problem can be stated as:
Given a rotation angle , find rotation sequence for , such that the angle quantization error , and the effective iteration number is minimized. In [17] , the Greedy algorithm is proposed to solve the optimization problem. We will discuss the issue in Section IV.
1) AQ Process in the AR Technique: Elementary Angle Set (EAS):
To make AR technique fit into our design framework, we reformulate (11) of AR technique by removing the null operations of and applying the equality of . After changing the variables and index, we can rewrite (11) in a compact form as (12) where effective iteration number. , iteration index. rotational sequence that determines the micro-rotation angle in the th iteration. directional sequence that controls the direction of the th micro-rotation of . th micro-rotation angle, defined as . As we can see from (12) , the AR technique essentially tries to approximate with the combination of selected angle elements from a pre-defined elementary angle set (EAS). The EAS consists of all possible values of s, and the EAS used in AR technique can be represented as (13) The use of the subscript 1 will become apparent later in this section.
Here, we use an example to demonstrate the EAS . Each angle element (entry) in the set is denoted as . These angles are listed in Table IV in an ascending order based on their values. Note that EAS covers not only all CORDIC angles in Table III , the negative elementary angle, , is also treated as an individual angle of the EAS table. In other words, we can say that EAS used in conventional CORDIC algorithm is only a subset of EAS .
With the EAS in hand, now we can easily link AR technique to the AQ process. By comparing (12) with the AQ approximation equation of (3), we find that AR technique indeed performs the angle quantization of target angle : The sub-angle now corresponds to and is set to be . Optimization Problem: We can also reconsider the optimization problem of AR technique from EAS point of view. It can be restated as:
Given , find the combination of elementary angles from EAS , such that the angle quantization error and is minimized.
In summary, the AR technique intends to perform the angle quantization in a more flexible way than conventional CORDIC algorithm by relaxing the constraint of micro-rotation procedure. Hence, a faster and more precise rotation operation can be obtained.
B. MVR-CORDIC Algorithm [18]
Based on the AR technique, in the Modified Vector Rotational CORDIC (MVR-CORDIC) algorithm [18] , two more modifications are proposed. These modifications also try to break the sequential rotation procedure of the conventional CORDIC algorithm so as to facilitate the operation of angle quantization. They are stated as follows.
1) Repeat of Elementary Angles:
Referring to (11) , in the AR technique, each micro-rotation angle of is allowed to be used only once. However, in the MVR-CORDIC algorithm, each micro-rotation of elementary angle can be performed repeatedly. The relaxed operation can result in more possible combinations of elementary angles, hence, smaller can be expected. For example, we can rotation by performing micro-rotation of once and twice, i.e., , which, on the contrary, cannot be represented by (11) of AR technique.
2
) Confines of Total Micro-Rotation Number:
From (12), we can see that the effective iteration number in the AR technique is not fixed. It will be varied with the target rotational angle . For certain cases, is large and very close to the upper bound of [17] . In synchronous circuit design, the overall speed performance of the CORDIC-based arithmetic operation will be limited by those special angles with large . Besides, the nonuniform feature of the iteration number is not suitable for modular design in VLSI implementation.
In the MVR-CORDIC algorithm, we confine the iteration number in the micro-rotation phase to ( ). The role of is quite similar to the number of nonzero digits, , used in CSD recoding scheme; it will dominate the precision performance and the complexity (see Table I ).
1) Constellation of Reachable Angles:
To see the effectiveness of the above modifications, we show the constellation of reachable angles of MVR-CORDIC algorithm in Fig. 6(b) . The wordlength, , is assigned to be 4, and the restricted iteration number, , is 3. The reachable angles of conventional CORDIC for iteration number are also shown in Fig. 6 (a) for comparison purpose. In fact, given and , the numbers of angles that can be represented by conventional CORDIC and MVR-CORDIC algorithm are in the order of and , respectively. Note that the comparison is made under the condition of equal iteration number in the micro-rotation phase.
As we can see in Fig. 6 , the number of reachable angles of the MVR-CORDIC is much larger than the conventional CORDIC algorithm. This implies that given the same iteration number, the MVR-CORDIC algorithm will outperform the conventional CORDIC in terms of angle quantization error. Namely, given a target angle quantization error, the MVR-CORDIC rotation requires fewer iterations compared with conventional approach. Consequently, the speed performance of CORDIC-based arithmetic operations can be greatly improved.
2) AQ Process in MVR-CORDIC Algorithm:
Elementary Angle Set (EAS): Next, we will explore the EAS in MVR-CORDIC algorithm. Putting all the aforementioned modifications together and ignoring the null operations, we can represent the angle quantization error of the MVR-CORDIC algorithm as (14) where 1) is the rotational sequence that determines the micro-rotation angle in the th iteration.
2)
is the directional sequence that controls the direction of the th micro-rotation of . As one can find that the sub-angle of in (14) is exactly the same as the definition of in (12) . Hence, the EAS formed by MVR-CORDIC algorithm is the same as AR technique, as shown by Based on the (14) , it is obvious that MVR-CORDIC algorithm also performs the AQ process as well. The major difference is that 1) The total number of sub-angles in Table I (i.e., the total iteration number in the micro-rotation phase) is now kept fixed to a pre-defined value of ( ) 2) The sub-angle corresponds to in MVR-CORDIC algorithm, i.e., . Optimization Problem: In the application of MVR-CORDIC algorithm, the optimization problem can be stated as follows.
Given , find the sequences of and , such that is minimized, subject to the constraint that the total iteration number is confined to . Similarly, the optimization problem can also be stated alternatively from EAS point of view as Given , find the combination of elementary angles from EAS , such that the angle quantization error is minimized. In Table V , we summarize the micro-rotation procedure and the scaling operation of the MVR-CORDIC algorithm. In summary, we may conclude that MVR-CORDIC algorithm is similar to the conventional CORDIC algorithm, except that the rotational constraint is greatly relaxed. It would be informative for readers to compare Table II with Table V in detail. [19] In Extended Elementary Angle Set (EEAS)-based CORDIC algorithm [19] , in addition to applying the relaxation on , we also relax the constraint of elementary angles by extending EAS . Then, we can have more choices (elementary angles) in approximating the target angle . It is expectable that the angle quantization error can be reduced correspondingly. Extended EAS (EEAS): By observing (13), we can see that the EAS are comprised of arctangent of single signed-power-of-two (SPT) term, i.e., . In the problem of SPT-based digital filter design, one effective way to increase the coefficient resolution (hence the filter performance) is to employ more SPT terms to represent the filter coefficients [12] , [13] . Motivated by this, we can easily extend the set by representing the elementary angles as the arctangent of the sum of two SPT terms [19] . That is (15) We call it Extended Elementary-Angle Set (EEAS ). The subscript is used to denote the number of SPT terms.
C. Extended EAS-Based CORDIC Algorithm
In Table VI Table IV ).
Constellation of Reachable Angles in EEAS-Based CORDIC:
Based on the EEAS developed in (15), the sub-angle in Table I now can be represented as (16) and the number of sub-angles is set to be . To see the effectiveness of the EEAS scheme, we show the constellation of all reachable elementary angles of and in Fig. 7(a) and (b), respectively. As we can see, the number of reachable angles of is much larger than that of . This implies that EEAS can yield better precision performance (smaller ) than under a fixed number of micro-rotations of sub-angles (iterations).
Optimization Problem: With the derived EEAS , now the optimization problem of the EEAS-based CORDIC algorithm can be stated as Given and , find the parameters of , , and (i.e., the combination of elementary angles from EEAS ), such that the angle quantization error (17) can be minimized. In Table VII , we summarize the micro-rotation procedure and the scaling operation of the EEAS-based CORDIC scheme. Note that now four additions are required to complete each micro-rotation of elementary angle from , which is twice as many as in the conventional CORDIC approaches. Hence, the relaxation on the set of elementary angles is obtained at the expense of the doubled hardware/computational complexity. Fortunately, as will be shown in Section VI, the increased complexity can be compensated by the halved maximum iteration number.
D. Generalized EEAS Scheme
By following the similar idea of EEAS scheme, it is straightforward to insert more SPT terms in the representation of elementary angles. Hence, the size of EEAS can be increased. With more than two SPT terms, we call such an extension scheme Generalized EEAS Scheme. Specifically, the generalized EEAS with SPT terms can be represented as (18) As one can expect that the size of the EEAS increases exponentially as increases. Consequently, with properly chosen design parameters, we can achieve higher precision performance in the AQ process. 
E. Family of Vector Rotational CORDIC Algorithm
So far, we have linked the AQ process with several existing vector rotation approaches, including CORDIC algorithm, Angle Recoding technique, MVR-CORDIC algorithm, EEAS scheme, and Generalized EEAS scheme. All algorithms intend to realize the AQ process with various EAS and suitable combinations of sub-angles. That is, they try to decompose the target rotational angle into several easy-to-implement sub-angles, while minimizing the angle quantization error to obtain the best precision performance.
Based on our discussion, now we can link all these rotation algorithms together under a unified design framework, from the AQ point of view. They form a family of vector rotational CORDIC algorithm, called Vector Rotational CORDIC Family. They all conform to the AQ process, but each rotational algorithm uses different AQ setting as summarized in Table VIII. Note that EEAS scheme covers MVR-CORDIC algorithm and AR technique due to the fact that MVR-CORDIC and AR employ EAS as a searching space that is a subset of EEAS . Moreover, MVR-CORDIC algorithm can also be treated as a subset of AR technique due to the fact that we impose one constraint on the total iteration number. 
IV. SEARCHING ALGORITHMS FOR THE VECTOR ROTATIONAL CORDIC FAMILY
In the applications of vector rotational CORDIC family, optimization of the angle quantization error are always encountered. The problem is conceptually analogous to finding the closest codeword for a coefficient, , under given number of nonzero digits, , in the CSD application. In this section, we address four searching algorithms. They are Greedy algorithm, Exhaustive Searching algorithm, Semi-greedy algorithm, and Trellis-based Searching algorithm, respectively. The target optimization problem is the EEAS-based CORDIC algorithm derived in Section III-C [see (17) ]. We choose it since it covers most other rotational schemes, as illustrated in Fig. 8 .
A. Greedy Algorithm
In [17] , Greedy algorithm has been proposed to solve the AR optimization problem. In running the Greedy algorithm, we approach the target rotation angle, , step by step. That is, in each searching step, without looking ahead, decisions are made on [i.e.,
, and ] so that the accumulated s is the best approximation of the target angle . Specifically, is determined such that the error function of is minimized, where is the residue angle at th step, defined as (19) The searching algorithm is terminated if no further improvement can be found, i.e., ; or is determined at the end of the searching process. Most of the time, Greedy algorithm gives us local optimal solution.
B. Exhaustive Searching Algorithm
The idea of Exhaustive searching algorithm is to search for the entire solution space, i.e., all the possible combinations of . Decisions on , for , are made such that the error function (20) is minimized. Obviously, the Exhaustive searching algorithm produces global optimal solution, but it is too time-consuming in finding the design parameters.
C. Semi-Greedy Algorithm [18]
Basically, we can treat the Semi-greedy algorithm as a combination of Greedy and Exhaustive searching algorithm. The whole searching space of for is divided into several segments, and each segment contains iterations. We call such a segment as a block, and denotes the block length. The concept of segmentation scheme is illustrated in Fig. 9 . In the Semi-greedy algorithm, the exhaustive search is performed within each isolated block, and the connection between each consecutive blocks is determined in the greedy manner. Specifically, in the th block (corresponds to th step in performing the searching algorithm), the decisions of for are made to minimize the error function (21) where (22) is the residue angle at the th step. For simplicity of representation, we assume is a multiple of in (21) and (22).
D. Trellis-Based Searching (TBS) Algorithm
The trellis-based searching (TBS) algorithm provides an alternative way to solve the complicated optimization problem [20] . Here, the TBS algorithm is applied to solve the optimization problem encountered in the EEAS-based CORDIC algorithm.
Step 1) Initialization First of all, let denote the number of the elementary angles in the extended set , and each distinct elementary angle in the set is expressed as , for , i.e., . In the TBS algorithm, there are states in each step. For th state ( ) of th search step, we use the Cumulative Angle, , to denote the best approximation of angle in the th state up to the th step. The TBS algorithm is performed column-wise from left to right. Initially, we start the TBS algorithm by setting all as the corresponding elementary angles, i.e., for all .
Step 2) Accumulation A path in the trellis, which leaves the th state at th step and enters the th state at th step, corresponds to an operation of adding by . Then, the appended angle of becomes the candidate for . Moreover, from a given state at step , the paths can diverge to all the states at the next search step . Namely, there are paths, carrying the corresponding appended angles of for all , enters the th state at th step. Then, those appended angles form the candidate set for the cumulative angle of .
Step 3) Comparison and Selection
Conceptually, the whole process is similar to the trellis decoding of convolutional code [21] : The TBS algorithm involves calculating and minimizing the difference between the target angle and for all at each search step . To be specific, is determined such that (23) Then, the selected path is denoted as the surviving path. Note that we have to calculate all the cumulative angles for all (thus their corresponding surviving paths) before moving to the th step. Continuing in this manner, we can successively advance deeper into the trellis (set ), until the maximum iteration number is reached ( ).
Step 4) Determination of the Global Result and Trace Back
After calculating all the cumulative angles at the last search step, the next procedure for the TBS algorithm is to determine the global result, . Similar to the determination of surviving path, we decide as follows:
Next, we can determine all the micro-rotations by tracing from the state, whose corresponding is best approximation of , along its surviving path backward.
V. SCALING OPERATION FOR THE VECTOR ROTATIONAL CORDIC FAMILY
In this section, we consider the indispensable phase in the practical implementation of vector rotational CORDIC family-the scaling phase. As mentioned in Section II, each micro-rotation (rotation of sub-angles) enlarges the norm of the vector by a factor . The factor depends on which algorithm in the vector rotational CORDIC family is employed. For example, in the AR technique and MVR-CORDIC algorithm; and in EEAS-based CORDIC scheme. After applying a sequence of micro-rotations of sub-angles, the norm of the rotated vector will be enlarged by a factor of . Consequently, to preserve the norm of , we have to scale the rotated vector of by the scaling factor of (25)
A. Scaling Operations
In practical implementation, several scaling operations has been proposed to save the hardware complexity; they are Type-I and Type-II scaling operations [15] , [16] , [14] , and the Extended Type-II (ET-II) scaling operation [19] . These scaling operations are performed by quantizing the floating-based scaling factor, , in the following fixed-point forms:
ET-II
Here, denotes the quantized value of , , , and , .
is the counterpart of in the micro-rotation phase, determining the number of iterations in the scaling phase. The corresponding iteration procedures for those scaling operations are listed as follows: 
for , with and . Here, and are used in the scaling phase to distinguish them from the ones in micro-rotation phase.
After iterations, the scaled rotational vector becomes . By doing so, we can approximate the scaling factor of by using only shift-and-add operations for Types-I and -II scaling operation, and shift-and-add operations for the ET-II scaling operations. Moreover, it is advantageous to employ Types-I or -II scaling operation when AR technique and MVR-CORDIC algorithm are employed as the vector rotation kernel. By doing this, scaling operation can share the same VLSI circuits with the micro-rotation module, which can save the significant hardware overhead of scaling multipliers. Similarly, for the case the EEAS-based CORDIC scheme, ET-II scaling operation is preferred due to the hardware sharing feature.
B. Scaling Approximation Error and SQNR Performance
With the approximated , the scaling operations will introduce some quantization noise, and it will increases as decreases. Similar to the micro-rotation phase, we introduce another performance index, , to describe the amount of error introduced by the approximation process of (26) The usage of SQNR can give designers a more straightforward view about the signal quality in practical fixed-point implementations. It has been proven that we can relate the parameters of and to SQNR performance as [18] dB (34)
The close-form equation links both error indices in a very concise way.
VI. PERFORMANCE COMPARISON AND DESIGN EXAMPLE
In this section, we compare the precision performance for the aforementioned vector rotational algorithms and searching algorithms.
A. EEAS Scheme versus MVR-CORDIC Algorithm
Essentially, the comparison between EEAS Scheme and MVR-CORDIC algorithm is the comparison between EEAS versus EAS . In Section VII, we will show that it takes four full adders (FAs) to perform each rotation of sub-angle in the EEAS scheme, as opposed to two FAs in the MVR-CORDIC algorithm. Hence, to make fair comparison, the performance of these two approaches must be evaluated under the condition that the maximum number of FAs are the same. Denote the maximum iteration number of EEAS scheme and MVR-CORDIC algorithm as and , respectively. They need to satisfy the constraint of (35) In the first simulation, 4097 uniformly spaced rotation angles in the region from 0 to , i.e., , , , , , are performed. The Greedy algorithm is adopted to solve the optimization problem, and the wordlength . The simulation results are shown in Fig. 10 , where and denote the ensemble-averaged angle quantization error of the MVR-CORDIC algorithm and the EEAS scheme, respectively.
From Fig. 10 , we can make the following observations: 1) Increasing the maximum iteration number ( or ) has the effect of improving error performance. This is consistent with the conventional CORDIC algorithm.
2) The results show that the EEAS scheme outperforms the MVR-CORDIC algorithm when the comparison criteria, (35), is satisfied. This can be explained that the EAS in (13) is only a subset of in (15) . Any possible combination of elementary angles from can also be constructed by using set . Fig. 11 depicts the angle quantization error versus the searching block length in semi-greedy algorithm. The comparison is intended to highlight the precision performance for various block length. From the results shown in Fig. 11 , we can make the following observations: 1) We can obtain better performance by increasing the searching block length of semi-Greedy algorithm. The results confirm our argument in Section IV-C that for larger , the Semi-greedy behaves like Exhaustive searching algorithm. On the other hand, when is small, the essence of Greedy algorithm will arise due to the confined searching space.
B. Error Performance versus Searching Block Length,
2) The saturation phenomenon suggests that we can use semiGreedy algorithm with moderate value of (say in this case) to obtain a near-optimal error performance without going through exhaustive search.
Meanwhile, the saving in computational complexity is significant.
C. Trellis-Based Searching (TBS) Algorithm versus Greedy Algorithm
In this simulation, both Greedy algorithm and TBS algorithms are applied to solve the optimization problem of MVR-CORDIC algorithm. Let and denote the angle quantization error generated by the Greedy and TBS algorithms, respectively. The simulation results are shown in Fig. 12 .
Based on Fig. 12 , we can see the following. 1) For any , the error performance of the proposed TBS algorithm is superior to the Greedy algorithm. Actually, we have proved in [18] that the precision performance of TBS algorithm always outperforms that of Greedy algorithm. For the case of , it can be easily shown that the operations of TBS algorithm is identical to the Greedy algorithm. Hence, the error performance is the same.
2) The improvement of the error performance of TBS algorithm become more significant as maximum iteration number, , increases. The reason is that as increases, more possible combinations of the elementary angles can be found by the TBS algorithm than the Greedy algorithm.
D. Design Example
In the design example, we consider the rotation angle of . All algorithms in vector rotational CORDIC family derived in Section III are applied to perform the rotation of . Meanwhile, searching algorithms in Section IV are conducted to solve the optimization problems. In particular, we consider the following the combinations:
• Conventional CORDIC algorithm [14] - [16] • AR technique Greedy algorithm [17] • MVR-CORDIC algorithm Greedy algorithm with
• MVR-CORDIC algorithm Semi-greedy algorithm with
• MVR-CORDIC algorithm TBS algorithm • EEAS scheme Greedy algorithm • EEAS scheme TBS algorithm. In Table IX , we summarize the results for these aforementioned combinations, including rotational parameters ( , , and ) and the angle approximation error . Based on the results, we may make the following the observations: 1) In AR technique, it takes 6 iterations to complete the rotation, as oppose to 16 iterations in conventional CORDIC algorithm. Moreover, the angle quantization error of AR technique is smaller than that of CORDIC algorithm.
2) By comparing combinations 3 and 5 as well as the combinations 6 and 7, we find the TBS algorithm can generate superior precision performance than that of Greedy algorithm. By comparing combinations 3 and 4, Semi-greedy algorithm is also found to be better than Greedy algorithm.
3) By comparing combinations 3 and 6 as well as the combinations 5 and 7, we find that we can obtain better precision performance using EEAS than EAS . Note that the comparison is made under the criteria that . 4) The combination 8 has superior performance than combination 2. The comparison is made under the condition of equal number of FAs. It takes 6 iterations to perform the rotation of in combination 2, thus 12 FA's are required. The number is equal to that of EEAS scheme with . As to the scaling phase, we take the combination of EEAS scheme with and TBS algorithm (combination 7) as the example. By substituting the rotation parameters of , , , and into (25), we can calculate the scaling factor as . As mentioned earlier that when the EEAS-based CORDIC algorithm is employed in the micro-rotation phase, it is advantageous to use the ET-II scaling operation in the scaling phase. That is, we try to represent in the fixed-point form of (28). Actually, the parameters of , , and can be determined with the help of TBS algorithm due the similarity between ET-II scaling operation and EEAS scheme [19] . In this example, assuming that the iteration number in scaling phase is , we derive the parameters in ET-II scaling operation as  ,  , , and . Then, the scaling approximation error of . Next, we can substitute the and into (34), and the SQNR value of the scaled EEAS-based CORDIC algorithm is about 95 dB.
VII. VLSI IMPLEMENTATION OF VECTOR ROTATIONAL CORDIC FAMILY
In this section, we will illustrate the VLSI structure (also known as CORDIC processor) for the proposed vector rotational CORDIC family. It is derived based on the conventional CORDIC VLSI architecture in [14] . 
A. Iterative Architecture
In Fig. 13 , we develop the iterative architecture for EEASbased CORDIC algorithm. It consists of 4 barrel shifters (BS), 4 multiplexers (MUX) and 4 adders/substrators. Here, we employ the EEAS scheme as the architectural design basis. This reason is that we can easily modify the iterative VLSI architecture of Fig. 13 to AR and MVR-CORDIC-based architecture by removing one BS, one MUX and two adders/substrators. Of course, control signals as well as signal paths must be modified correspondingly. On the other hand, we can insert more BS's, MUX's and adders/substrators to realize the Generalized EEAS scheme.
As shown in Fig. 13 , two separate phases are performed to complete single vector rotation, i.e., the micro-rotational phase (marked by solid line) and the scaling phase (marked by dash line). In each phase, three kinds of control signal are used to control the operations. and in micro-rotation phase as well as , in scaling phase: they control the number of bits to be shifted by barrel shifters. and in micro-rotation phase as well as , in scaling phase: they determine the operations of adders/subtracters.
Control signal, : it determines the data path by controlling the multiplexer; governing the phase switching of the iterative CORDIC architecture.
All the control signals can be generated by searching algorithms in advance, and are stored in ROM.
B. Parallel and Pipelined Architecture
By unfolding the iterative implementation of Fig. 13 , we can obtain the parallel structure as depicted in Fig. 14(a) . The structure is composed of EEAS-based CORDIC processors connected in cascade form, in which the leading processors perform the micro-rotations and the following processors execute the scaling operations. Each basic processor performs one iteration as specified in Fig. 13 . Moreover, for the case that the parallel structure is dedicated to perform a given rotation angle, the operation of each processor can be kept fixed. We can thus save the hardware complexity easily by replacing all the control circuits, barrel shifters, and multiplexers with only wire routing.
To achieve a higher data throughput rate, we can further insert pipeline stages (latches) between successive processors of parallel structure, which results in the pipelined architecture in Fig. 14(b) . Due to the reduced critical path, the pipelined structure is very suitable for real-time applications at high data bandwidth.
VIII. CONCLUSION
In this paper, we introduce a new design index, called Angle Quantization. Following the new index, we propose a unified design framework for several existing vector rotational algorithms. With the versatile feature of the design framework, we can identify more design parameters. Hence, designers can explore a bigger design space in deriving low-cost/high-performance rotational circuits. As illustrated in [6] , most popular DSP algorithms can be realized via rotational circuits. The new framework proposed in this paper can be employed to design the processing kernel of the DSP engine in [6] .
