In this paper, we propose a new CORDIC algorithm and architectures which can generate close-to-optimum rotation sequences easily with small lookup table sizes. This new design is particularly suitable for the applications of adjustable-length FFT. In all, the required number of shift-andadd operations for micro-rotations and scale-factor compensations is only n/2, where n is the output precision. For design verification, we synthesized both serial and pipelined architectures, by using Synopsys Design Complier based on UMC 0.18 m , 1P6M CMOS technology. The synthesized 16-bit pipelined FFT PE runs at 222MHz, with a total gate count of 89263 and a low-power consumption of 26.75 mW. It meets the FFT speed requirements of most OFDM-based communication systems, including DAB, DVB, 802.16 and VDSL. Compared with a conventional multiplier-based FFT PE and the existing CORDIC-based FFT PE's, the proposed designs has better performances in terms of area, speed and power consumption.
INTRODUCTION
DFT is one of most important computations in engineering applications. It is also a key component in OFDM communication systems. Cooley-Tukey FFT algorithms [1] are efficient realization techniques of DFT operations. They are composed of a sequence of unit butterfly operations. A butterfly operation involves addition operations and multiplication operation by twiddle factors
where N is the FFT length. In the literature, a butterfly unit is generally realized by complex multipliers and adders, together with the stored required twiddle factors in memory. This approach has the advantages of design simplicity and high-speed operations. However, the required complex multipliers and the storage for twiddle factors incur high area overhead. In this work, in order to reduce these two overheads, we will propose a new CORDIC algorithm and its architectures specifically tailored for FFT computations.
The proposed design will consume smaller areas than conventional multiplier-based FFT units and those FFT processing elements based on general CORDIC algorithms. CORDIC algorithm [2] is a well-known and efficient algorithm for the computations of vector rotations, vector angles and magnitudes. Since CORDIC algorithm only needs a sequence of micro rotations based on simple shiftand-add operations, it is efficient in hardware realization.
Since a twiddle factor multiplication is equivalent to rotation of a 2-D vector by the twiddle factor's phase, CORDIC algorithms are very suitable for twiddle factor multiplications. An additional significant benefit of applying CORDIC to FFT computations is that there is no need to store twiddle factors which are required in conventional multiplier-based FFT designs. In the literature, most of the CORDIC algorithms are directly applied to FFT computations without optimizations. As a result, they are not efficient in terms of operations counts and area complexities.
There are only a few optimized CORDIC-based FFT designs (that reduce redundant iterations as much as possible) [3] [4] [5] . However, some of the optimized designs require considerable memory overheads [4, 5] to generate optimized rotation sequences for low iteration counts. Besides, those optimized rotation sequences introduce the serious problem of variable scale factors. There involves considerable overheads in generating on-line scale factors for those optimized rotation sequences. In previous works, we proposed an efficient on-line generation and compensation scheme [3] for variable scale factor. In this work, we will alleviate the problem by applying the similar idea to the new designs.
Besides, when apply CORDIC techniques to FFT computations, we can take advantages of special properties of FFT operations, for more efficient design results. For example, for a radix-2 FFT algorithm, only N twiddle factors are requested for computations (in the order of , , ,
for the first stage). In our proposed design, we will utilize this property for efficient generations of rotation sequences as will be discussed later. The result is that the proposed designs can generate close-to-optimum rotation sequences, at the cost of little hardware overhead and very small table size. Notice that the whole computations are mapped into a sequence of shift-and-add operations. To reduce the iteration counts and speed up the whole operations, in our previous works, we proposed high-radix CORDIC algorithms such as the radix-4 [6] , radix-16 [7] algorithms. Although those designs are efficient, as usual they require hardware overhead for obtaining residue angles. Next, we will introduce a new CORDIC design for FFT computations, by taking advantage of FFT properties. The required rotation sequences can be easily obtained by looking up a small table and performing some few simple operations in one shot, and there is no need to update the residue angles in each iteration (as detailed below).
THE NEW CORDIC ALGORITHM FOR FFT

The Basic Design Idea
Our design is based on an idea of efficient angle decomposition. Since we focus on FFT computation, there are only N twiddle factors (that correspond to only N rotation angles) involved in the CORDIC operations, we can decompose the rotation angle of a twiddle factor into a coarse angle component and a fine angle component. The fine angle component is small enough to satisfy the well-known linear mapping property
, while the coarse component is relatively large and does not satisfy the condition. For example, for any input angle
, then it is said to have a coarse angle component. As a result, for the fine angle component, its rotation sequence can be readily obtained by inspection, while its corresponding scale factor can be easily obtained (as will be detailed later). Further, for the coarse angle component, we can store its optimized rotation sequences and scale factor sequences in a lookup table. Then, whenever we want to do a twiddle factor multiplication (i.e., a CORDIC rotation), we directly decompose the input angle into these two components, and at the same time obtain all the required optimized rotation sequences and scale factors. Doing so, we will have a very low shift-andadd operation count, and we don't need to compute residue angles iteratively. Hence, the normally required 
Generations of Twiddle Factor Angles
The twiddle factors (and the corresponding rotation angles) of a radix-2 n FFT algorithm come in a particular order in accordance with the order of butterfly operations. For example, for radix-2 FFT algorithm, the twiddle factors can be requested in the order of
, where l is the index for the twiddle factors (and also the butterflies) and k is FFT stage number. As such, the corresponding twiddle factor angles (
) come in successively and incrementally, which are l multiples of the "base rotation angle"
Generations of those ordered twiddle factor angles therefore can be done by successively accumulating the base rotation angles. The base angles can be either a "base fine angle" or a "base coarse angle". Hence, we need to store all the optimized rotation sequences and their corresponding scale factors in a lookup table, for those base rotations angles. This requires a small memory size of 
Generation of fine rotation sequences
As mentioned above, a radix-2 FFT performs twiddle factor multiplications by is small enough so that it and its initially accumulated angles are all fine angles. For those fine rotation angles, the corresponding fine rotation sequences are exactly the same as their binary angle representations from the accumulator output. In the proposed design, by using the contents of FFT stage counter and the butterfly counter as the address lines, we can lookup the base fine rotation sequence corresponding to the base fine rotation angle from a table. Then the base fine rotation sequence is sent to an accumulator for the generation of other rotation sequences. Next, those sequences are converted to CSD (canonical signed digit) formats which guarantee minimum numbers of micro-rotations.
Generation of coarse rotation sequences
Coarse angles can be generated owing to two different conditions. The first one is that accumulations of a base fine rotation sequence may end up with output angles with coarse angle components. The second condition is that a base rotation angle is already contains a coarse angle com-ponent. As a result, all its subsequent accumulated values also contain coarse angle components. In those cases, we have to decompose those twiddle factor angles into coarse and fine angle components. Then the corresponding fine rotation sequences (in binary formats) can be easily obtained. On the other hand, the optimized coarse rotation sequences (in CSD formats for their MSB parts and in binary formats for their LSB parts) and their corresponding scale-factor sequences can be obtained from a lookup table as discussed before.
In fact, differentiation and decomposition of a twiddle factor angle into a coarse angle component and a fine angle component can be easily done from the contents of the stage and butterfly counters.
Combined rotation sequences
Since there are overlaps in the micro-rotation angles between the coarse and fine rotation sequences, we can combine them altogether for further reduction of the numbers of shift-and-add operations. Specifically, one can combine the LSB portions of coarse rotation sequences with fine rotation sequences by simply adding them up, because both satisfy the property of
. Then the combined rotation sequences are converted to CSD formats which correspond to the minimum numbers of shift-and-add operations. Finally, the CSD signals are sent to the rotator unit of a CORDIC-based FFT PE.
Generations and compensations of scale factors
In the proposed design, we skip many redundant microrotations. Therefore, the scale factors will not be constant. Here, by taking into account of the pre-stored scale factor sequences (for coarse rotation angles), we modify our previous work [3] and propose a low-complexity generation and compensation scheme for variable scale factors as follows. Based on the following approximation of a basic scale factor: As such:
3. When 1 2 / n i , then 1 cos i and no compensation is required.
Overall description of the new CORDIC algorithm
The new CORDIC algorithm can be summarized in steps as follows.
Step 0: Obtain the table index based on the contents of FFT butterfly counter and stage counter, then decide the numbers (L C , L S and L F ) of iterations for coarse rotation, scale factor compensation, and fine rotation, respectively.
Step 1: Obtain the coarse rotation sequence and scale factor compensation sequence from the sequence table using the  table index from Step 0, and generate the fine rotation sequences by accumulating the base fine rotation sequence.
Step 2: Add the LSB portion of coarse rotation sequence to the fine rotation sequence, and convert the combined sequence to CSD format.
Step 3: For data rotation, we first perform the coarse rotation according to the MSB portion of coarse rotation sequence for L C iterations. Next, we perform the scale factor compensation according to the scale factor compensation sequence for L S iterations. Finally, we perform the fine rotation according to the combined rotation sequence for L F iterations.
From 16-bit simulation, the new CORDIC algorithm needs 5.03 shift-and-add operations in average for data rotation which is very close to the optimum 4.14 iterations (due to computer full-search).
REALIZATION OF THE NEW CORDIC-BASED FFT PROCESSING ELEMENT
We apply the new CORDIC algorithm to the design of multi-standard, multi-mode FFT computations for several mainstream OFDM communication systems, including DAB, DVB, 802.16, ADSL and VDSL systems. Specifically, we design an adjustable-length FFT processing element (PE) which can process FFT lengths up to 8192. Fig. 1 shows the block diagram of the design 16-bit pipelined CORDIC-based FFT PE. The design meets the speed specification requirements of all those OFDM systems.
We realize the pipeline architecture with two different unit rotator cells, i.e., the Rotator_4_2 cell and the Rotator_2_1 cell. Rotator_4_2 cell can handle 4 rotation digits and process up to two micro-rotations at a time, while Rotator_2_1 cell can only process 2 rotation digits and process up to one micro-rotation at a time. One can design a unit rotator cell adjusted for desired speed and area specifications. We also realize the new CORDIC algorithm based on a serial single-rotator architecture, for applications with lower data rates such as DAB and DVB. Table 1 shows the synthesized areas and power performances of the proposed designs and a general multiplierbased processing element. They are synthesized with Table 2 shows the maximum synthesized clock rates of the proposed architectures and multiplier-based design. Table 3 shows the comparison of required table size for twiddle factor storage in conventional multiplier-base FFT architecture. The sequence table of our design is only about 1% of the size of twiddle factor ROM table. As shown, the proposed FFT PE's based on the pipelined Rotator_4_2 structure has better performance than the conventional multiplier-based PE. Table 4 compares the new design with some CORDIC-based FFT designs. Although the simulation is done assuming 16-bit case, we also conduct simulations with other word lengths. Simulations show that in average n/2 iterations are achieved with the new CORDIC algorithms. [3] 768 N/A 10.6
SIMULATIONS RESULTS AND COMPARISONS
CONCLUSION
The CORDIC algorithms and architectures proposed in this work are specifically designed for FFT operations. They combine some FFT properties effectively and achieve closeto-optimum iteration numbers of shift-and-add operations, with small lookup table and hardware complexity. The new designs are advantageous over the existing designs in terms of both speed and area. The designed pipelined multi-mode FFT PE can meet the speed specifications of most OFDM communication systems, including VDSL, 802.16, DAB and DVB.
