Introduction
Discrete Fourier Transform (DFT) is one of the core operations in digital signal processing and communication systems. Many fundamental algorithms can be realized by DFT, such as convolution, spectrum estimation, and correlation. Furthermore, DFT is widely used in standard embedded system applications such as wireless communication protocols requiring Orthogonal Frequency Division Multiplexing (Wey et al., 2007) , and radar image processing using Synthetic Aperture Radar (Fanucci et al., 1999) . In practice, DFT is difficult to implement directly due to its computational complexity. To reduce the degree of computation, Cooley and Tukey proposed the well-known Fast Fourier Transform (FFT) algorithm, which reduces the calculation of N-point DFT from O(N 2 ) to O(N/2log 2 N). (Proakis & Manolakis, 2006) . Nevertheless, for embedded systems, in particular portable devices; efficient hardware realization of FFT with small area, low-power dissipation and real-time computation is a significant challenge. The challenge is even more pronounced when FFTs with large transform lengths (>1024 points) need to be realized in embedded hardware. Therefore, the objective of this research is to investigate hardware efficient FFT architectures, emphasizing compact, low-power embedded realizations. As VLSI technology evolves, different architectures have been proposed for improving the performance and efficiency of the FFT hardware. Pipelined architectures are widely used in FFT realization (Li & Wanhammar, 1999; He & Torkelson, 1996; Hopkinson & Butler, 1992; Yang et al., 2006) due to their speed advantages. Higher radix (Hopkinson & Butler, 1992; Yang et al., 2006) and multi-butterfly (Bouguezel et al., 2004; X. Li et al., 2007) structures can also improve the performance of the FFT processor significantly, but these structures require substantially more hardware resources. Alternatively, shared memory based schemes with a single butterfly calculation unit (Cohen, 1976; Ma, 1994 Ma, , 1999 Ma & Wanhammar, 2000; Wang et al., 2007) are preferred in many embedded FFT processors since they require least amount of hardware resources. Furthermore, "in-place" addressing strategy is a practical choice to minimize the amount of data memory. With "in-place" strategy, the two outputs of the butterfly unit can be written back to the same memory locations of the two inputs, and replace the old data. For in-place FFT processing, two data read and two data write operations occur at every clock cycle. Multiple memory banks and conflict-free addressing logic are required to realize four data accesses in one clock cycle. Consequently, a typical FFT processor is composed of three major components: i) butterfly calculation units, ii) conflict free address generators for both data and coefficient accesses and iii) multi-bank memory units.
In this study, several techniques are developed for reducing the hardware logic and power requirements for these three components: 1. In order to optimize the conflict free addressing logic, a modified butterfly structure with input/output exchange circuits is presented in Section 2. 2. CORDIC based FFT algorithms are presented for multiplier-less and coefficient memory-less implementation of the butterfly unit in Section 3. 3. Memory bank partitioning and bitline segmentation techniques are presented for dynamic power reduction of data memory accesses. Furthermore, a special coefficient memory addressing logic which reduces the switching activity is proposed in Section 4. Case studies with ASIC and FPGA synthesis results demonstrate the performance gains and feasibility of these FFT implementations on embedded systems.
Hardware efficient realization of fast Fourier transform
There is an ongoing interest in hardware efficient FFT architectures. Cohen (Cohen, 1976) introduced a simplified control logic for FFT address generation, which is composed of parity checks, barrel shifters and counters based on the fact that two data addresses of every butterfly operations differ in their parity. Ma (Ma, 1999) proposed a method to realize the radix-2 addressing logic which reduces the address generation delay by avoiding parity check (XOR operations), but barrel shifters are still needed. Furthermore, Ma's approach is not "in-place", so more registers and related control logic are needed to buffer the interim data to avoid the memory conflict. Yang (Yang et al., 2006) proposed a locally pipelined radix-16 FFT realized by two radix-2 deep feedback (R2SD 2 F) butterflies. This architecture can improve the throughput of the FFT processing and reduce the complex multipliers and adders compared to other pipelined methods, but it needs extra memory and there is significantly more coefficient access due to radix-16 implementation. Li (X. Li et al., 2007) proposed a mixed radix FFT architecture, which contains one radix-2 butterfly and one radix-4 butterfly. The two butterflies share the multipliers, which reduce the hardware consumption, but the address generation is based on XOR logic, and similar to Cohen's design. Next section describes in detail addressing schemes that emphasize reduced hardware.
Conflict-free addressing for FFT
The N-point discrete Fourier transform is defined by
(1) Fig. 1 shows the signal flow graph of 16-point decimation-in-frequency (DIF) radix-2 FFT (Proakis & Manolakis, 2006) . FFT algorithm is composed of butterfly calculation units:
Equations (2), (3) describe the radix-2 butterfly calculation at Stage m as shown in Fig. 2 . Parallel and "in-place" butterfly operation using two memory banks of two-port memory www.intechopen.com units requires that the two inputs of any butterfly are read from different banks of memory and the two outputs are written to the same address locations as the inputs. As shown in Fig. 1 , in the conventional FFT addressing scheme, only the butterflies in the first stage satisfy this requirement. Two inputs and two outputs of butterfly operations in all other stages are originating from and sinking to the same memory bank. Therefore, a special addressing scheme is required to prevent the conflicting addresses. Cohen (Cohen, 1976) used parity check to separate the data into two memory banks. Fig. 3 is the signal flow graph of Cohen's approach and it shows that inputs and outputs of any butterfly stage utilize separate memory banks. The addresses of butterfly operations are "inplace" located. The drawback of Cohen's method is the address generation delay. In order to reduce the delay of the address generation, Ma (Ma, 1999) proposed an alternative addressing scheme which avoids using parity check. The signal flow graph of Ma's scheme is shown in Fig. 4 . In Ma's scheme, two inputs of a butterfly unit originate from two separate memory banks but two outputs of the butterfly unit utilize the same memory bank. The inputs and outputs of a butterfly unit are not "in-place". Therefore, extra registers and related control logic are needed to buffer the outputs of the butterfly until next butterfly calculation is finished in order to realize the "in place" operation. Compared to Cohen's approach which uses both parity check and barrel shifters, Ma's method needs only barrel shifters and avoids parity check, resulting in a reduced address generation delay. However, Ma's approach consumes more hardware resources to realize the "in-place" operation. In the following section, a hardware efficient FFT engine with reduced critical path delay is proposed. Addressing logic is reduced by using a butterfly structure which modifies the conventional one by adding exchange circuits at the input and output of the butterfly (Xiao, et al., 2008] . With this butterfly structure, the two inputs and two outputs of any butterfly can be exchanged; hence all data addresses in FFT processing can be reordered. Using this flexible input and output ordering, addressing logic is designed to be "in-place" and it does not need barrel shifters.
Memory Bank0
Memory Bank1 Fig. 4 . Signal flow graph of 16-point FFT using Ma's method (Ma, 1999) www.intechopen.com
Reduced address generation logic with the modified butterfly FFT (mbFFT)
This addressing scheme is based on a modified butterfly FFT (mbFFT) structure, which is shown in Fig. 5 . The main difference between the modified butterfly structure and the conventional one is the addition of two exchange circuits that are placed at both the input and the output of the butterfly unit. Each exchange circuit is composed of two (2:1) multiplexers; when the exchange control signal C1 or C2 is 1, the data will be exchanged, otherwise they keep their locations. 
Based on this butterfly structure, all data within the FFT processing can be reordered by setting the different values of the exchange control signals C1 and C2. The control signals are chosen such that the input data always originate from two separate memory banks and output data are written to the same memory location in order to achieve in-place operation.
16-point mbFFT implementation
For 16-point mbFFT, the signal flow graph is shown in Fig. 6 . In the figure, the butterfly inputs or outputs indicated by broken lines denote that the data have been exchanged. Fig.  7 shows the complete address generation architecture and components for 16-point FFT implementation. The address generation logic is composed of a 5-bit counter D, three inverters, a 3-bit shifter, three (2:1) multiplexers, two (4:1) multiplexers, four multi-bit (2:1) multiplexers and delay elements. Stage Counter S indicates which stage of FFT is currently in progress and controls the two (4:1) multiplexers to generate the correct exchange control signals C1 and C2 for the butterfly operation. The 3-bit shifter shifts one bit at each stage and it controls three (2:1) multiplexers to generate the correct M1 address. Since this technique is "in-place", the addresses for read and write are same with the exception of a delay introduced for compensating the butterfly computation time. Table I 000  111  000  000  000  100  000  110  000  111  001  110  001  001  001  101  001  111  001  110  010  101  010  010  010  110  010  100  010  101  011  100  011  011  011  111  011  101  011  100  100  011  100  100  100  000  100  010  100  011  101  010  101  101  101  001  101  011  101  010  110  001  110  110  110  010  110  000  110  001  111  000  111  111  111  011  111  001 111 000 
N-point mbFFT implementation
In order to generalize the addressing scheme for 2 n N = -point FFT, the necessary circuit components of the addressing and control logic can be listed as follows:
Two memory banks, Bank 0 (M0) and Bank 1 (M1). In practice, Stage Counter S and Butterfly Counter B can be combined to a single counter D, where B is the least significant (n-1) bits of counter D, and S is the most significant www.intechopen.com
VLSI synthesis results
The mbFFT architecture is synthesized using TSMC CMOS 0.18µm technology. Synthesis is performed with Cadence Build Gates and Encounter tools. The synthesis results for 16-point FFT with 32-bit complex number input show a maximum clock frequency of 280MHz with 0.665mm 2 area and 0.645mW total power consumption for the complete FFT operation including butterfly unit, address generation unit, and memory circuits. In order to compare different FFT addressing methods, the logic complexity can be evaluated similar to (Ma, 1999) , based on gate counts. The sizes of some basic circuits and gates are listed in Table 2 . Estimated gate count comparison for 1024-point FFT of 32-bit complex data (16-bit each for the real part and imaginary part) is shown in the Table 3 . In terms of area, mbFFT scheme requires 24% fewer number of transistors. This reduction is mainly due to the difference in logic complexity of the multiplexers and barrel shifters. Based on the gate counts in Table 2 (and confirmed by synthesis results), r-input (r:1) multiplexer is approximately 4 times smaller than (r-1) barrel shifter in terms of area. The delay of address generation for both read and write operations in the mbFFT addressing scheme is determined by two stages of multiplexers, where the first stage uses an r-input (r:1) multiplexer and the second stage uses a 2-input (2:1) multiplexer for a 2 r -point FFT operation (see Fig 7) . In (Ma, 1999) , worst-case address generation delay is dominated by an (r-1)-bit barrel shifter and a (2:1)-multiplexer. An (r-1)-bit barrel shifter requires 2 log (1 ) r − ⎡⎤ ⎢⎥ stages of (2:1) multiplexers in the critical path. Cohen's address generation method (Cohen, 1976) uses an r-bit parity check unit, an (r-1)-bit barrel shifter, and two (2:1) multiplexers in the critical path. Standard cell synthesis results in Table 4 show that the proposed mbFFT address generation scheme is faster compared to (Cohen, 1976) and (Ma, 1999) for large FFTs, due to the complex wiring and parasitic capacitances in barrel shifters and elimination of the parity-check operation. Compared to a pipelined FFT architecture such as R2SD 2 F given in (Yang et al., 2006) , the shared memory architectures such as mbFFT offer significantly reduced hardware cost and power consumption at the expense of (slower) throughput. R2SD 2 F requires log 4 N-1 multipliers, 2log 4 N adders and 10log 4 N multiplexers for the butterfly operations in an Npoint FFT. In contrast, only one multiplier, two adders and four multiplexers are used in the mbFFT architecture datapath. The latency (total clock cycles) of a pipelined FFT architecture is faster by a factor of N 2 2 1 log . However, the maximum achievable clock frequency would be less than the mbFFT design due the increased complexity of the R2SD 2 F datapath and address generation. Hence, for embedded applications, the proposed reduced logic, shared memory FFT approach with modified butterfly units presents a more viable solution.
Types of Gates and Circuits
No. of. Transistors 2-Input XOR 10 2-1 Multiplexer 6 10- 
Multiplierless FFT architectures using CORDIC algorithm
In FFT processors, butterfly operation is the most computationally demanding stage. Traditionally, a butterfly unit is composed of complex adders and multipliers. A complex multiplier can be very large and it is usually the speed bottleneck in the pipeline of the FFT processor. The Coordinate Rotation Digital Computer (CORDIC) (Volder, 1959) algorithm is an alternative method to realize the butterfly operation without using any dedicated multiplier hardware. CORDIC algorithm is versatile and hardware efficient since it requires only add and shift operations, making it suitable for the butterfly operations in FFT (Despain, 1974) . Instead of storing actual twiddle factors in a ROM, the CORDIC-based FFT processor needs to store only the twiddle factor angles in a ROM for the butterfly operation.
In recent years, several CORDIC-based FFT designs have been proposed for different applications (Abdullah et al., 2009; Lin & Wu, 2005; Jiang, 2007; Garrido & Grajal, 2007) . In (Abdullah et al., 2009) , non-recursive CORDIC-based FFT was proposed by replacing the twiddle factors in FFT architecture by non-iterative CORDIC micro-rotations. It reduces the ROM size, however, it does not eliminate it completely. (Lin & Wu, 2005) proposed a "mixed-scaling-rotation" CORDIC algorithm to reduce the total iterations, but it increases the hardware complexity. (Jiang, 2007) introduced Distributed Arithmetic (DA) to the CORDIC-based FFT algorithms, but the DA look-up tables are costly in implementation. (Garrido & Grajal, 2007) proposed a memory-less CORDIC algorithm to reduce the memory requirements for a CORDIC-based FFT processor by using only shift operations for multiplication.
Conventionally, a CORDIC-based FFT processor needs a dedicated memory bank to store the necessary twiddle factor angles for the rotation. In our earlier work (Xiao et al., 2010) , a modified CORDIC algorithm for FFT processors is proposed which eliminates the need for storing the twiddle factor angles. The algorithm generates the twiddle factor angles successively by an accumulator. With this approach, memory requirements of an FFT processor can be reduced by more than 20%. Memory reduction improves with the increasing radix size. Furthermore, the angle generation circuit consumes less power consumption than angle memory accesses. Hence, the dynamic power consumption of the FFT processor can be reduced by as much as 15%. Since the critical path is not modified with the CORDIC angle calculation, system throughput does not change.
In the following sections, CORDIC algorithm fundamentals and the design of the proposed memory efficient CORDIC-based FFT processor are described.
CORDIC algorithm
CORDIC algorithm was proposed by J.E. Volder (Volder, 1959) . It is an iterative algorithm to calculate the rotation of a vector by using only additions and shifts. Fig. 8 shows an example for rotation of a vector V i . , cosφ can be simplified to a constant with fixed number of iterations:
where cos(arctan (2 ))
and 1 i d = ± . Product of K i 's can be represented by the K factor which can be applied as a single constant multiplication either at the beginning or end of the iterations. Then, (9) and (10) can be simplified to:
The direction of each rotation is defined by d i and the sequence of all d i 's determines the final vector. d i is given as:
where z i is called angle accumulator and given by 1 (a r c t a n 2 )
All operations described through equations (10)-(13) can be realized with only additions and shifts; therefore, CORDIC algorithm does not require dedicated multipliers. CORDIC algorithm is often realized by pipeline structures, leading to high processing speed. Fig. 9 shows the basic structure of a pipelined CORDIC unit.
As shown in equation (1), the key operation of FFT is ( )
). This is equivalent to "Rotate () xn by angle 2 nk N π − " operation which can be realized easily by the CORDIC algorithm. Without any complex multiplications, CORDIC-based butterfly can be fast. An FFT processor needs to store the twiddle factors in memory. CORDIC-based FFT doesn't have twiddle factors but needs a memory bank to store the rotation angles. For radix-2, N-point, m-bit FFT, 2 mN bits memory needed to store 2 N angles. In the next section, a new CORDIC based FFT design which does not require any twiddle factor or angle memory units is presented. This design uses a single accumulator for generating all the necessary angles instantly and does not have any precision loss.
Reduced memory CORDIC based FFT
Although several multi-bank addressing schemes have been used to realize parallel and pipelined FFT processing (Ma, 1999; Xiao et al., 2008) , these methods are not suitable for the reduced memory CORDIC FFT. In these schemes, the twiddle factor angles are not in regular increasing order (see Table 5 ), resulting in a more complex design for angle generators. As shown in Table 6 , using a special addressing scheme first proposed in (Xiao et al., 2009) , the twiddle factor angles follow a regular, increasing order, which can be Fig. 9 . Basic structure of a pipelined CORDIC unit generated by a simple accumulator. Table 6 shows the address generation table of the 16point radix-2 FFT. It can be seen that twiddle factor angles are sequentially increasing, and every angle is a multiple of the basic angle 2 N π , which is 8 π for 16-point FFT. For different FFT stages, the angles increase always one step per clock cycle. Hence, an angle www.intechopen.com generator circuit composed of an accumulator, and an output latch can realize this function, as shown in Fig. 10 . Control signal for the latch that enables or disables the accumulator output is simple and it is based on the current FFT butterfly stage and RAM address bits b 2 b 1 b 0 (see Table 6 ).
CLK Angle Latch

Control Accumulator
Register N π 2 (Ma, 1999) design for 16-point radix-2 FFT Fig. 11 shows the architecture of the proposed no-twiddle-factor-memory design for radix-2 FFT. Four registers and eight 2-to-1 multiplexers are used. Registers are needed before and after the butterfly unit to buffer the intermediate data in order to group two sequential butterfly operations together. Therefore, the conflict-free "in-place" data accessing can be realized. This register-buffer design can be extended to any radix FFTs. For radix-2, the structure can be simplified by using just 4 registers, but for radix-r FFT, 2 2 r × registers are needed. Fig. 12 shows the structure for radix-r FFT. Butterfly Angle Generator Fig. 11 . Radix-2 FFT processor with no-twiddle-factor-memory www.intechopen.com −− = will provide the address sequences and the control logic of the angle generator. In stage S , the memory address is given by
... ... For radix-2, 2 n N = -point, m-bit FFT, (each data is 2m-bit complex number; m-bit each for the real part and imaginary part) by using the proposed angle generator, 5 2 mN bits memory required by the conventional CORDIC can be reduced to 4 2 mN which corresponds to 20% reduction. For higher radix FFT, the reduction is even more significant. For radix-r FFT, the saving is (1 ) rm N r − bits out of (3 1) rm N r − , which converges to 33.3% reduction.
Due to finite wordlength, as the accumulator operates, the precision loss will accumulate as well. In order to address this issue, more bits (wider wordlength) can be used for the fundamental angle 2π/N and the accumulator logic. For example, for 1024-point FFT, the accumulator is extended from 16 bits to 21 bits and no precision loss is observed compared to a conventional angle-stored CORDIC FFT processor.
FPGA synthesis results
The proposed reduced memory CORDIC based FFT designs for both radix-2 and radix-4 FFT algorithms have been realized by Verilog-HDL and implemented on an FPGA chip (STRATIX-III EP3SE50C2). Synthesis results shown in Table 7 show that these designs can reduce memory usage for FFT processors without any tangible increase in the number of logic elements used when compared against the conventional CORDIC implementation (i.e., angles are stored in memory). Furthermore, dynamic power consumption is reduced (up to 15%) with no delay penalties. The synthesis results match with the theoretical analysis. 
Low-power FFT addressing schemes
For embedded applications, power dissipation is often a crucial design goal. (Ma & Wanhammar, 1999) proposed a new addressing logic to improve the memory accessing speed and to reduce the power consumption. (Hasan et al., 2003) designed a new coefficient ordering method to reduce the power consumption of radix-4 short-length FFTs. Gate-level algorithms have also been proposed (Zainal at al., 2009; Saponara, 2003) to reduce the FFT processor's power consumption by lower supply voltage techniques and/or voltage scaling. Power consumption of FFT processors can be significantly reduced by optimizing both data and coefficient memory accesses. Dynamic power consumption in CMOS circuits can be characterized by the following equation:
where α is the switching activity, V DD is the supply voltage, f is the frequency and C total is the total switching capacitance charging and discharging in the circuit. In particular, architectural techniques can reduce two parameters in (14), C total and . These techniques are discussed next: First, a multi-bank memory structure is proposed for data memory accesses, resulting in reduced overall capacitance load on the SRAM bit-lines. Second, a new butterfly calculation order reduces the memory access frequency for twiddle factors and minimizes the switching activity.
Memory bank partitioning
Since FFT operation largely consists of data and twiddle factor memory accesses, it is desirable to reduce the power dissipation caused by memory accesses. Memory bank partitioning and bitline segmentation is an important technique to reduce the power dissipation in SRAMs. The bitlines (each read and write port is associated with one bitline) in the SRAM logic are a significant source of energy dissipation due to the large capacitive load. This capacitance has two components, wire capacitance of the bitlines and the diffusion capacitance of each pass transistor connecting bitline to bitcells. Hence, the capacitive load increases linearly with the components attached to the bitline i.e., the number of words or size of the memory. In order to reduce this large capacitive load, the data memory can be partitioned into four memory banks instead of two. As a result, the capacitive loading in each memory bank is lowered since the bitline wire length and the number of pass transistors connected to the bitline is now only one fourth of the original bitline. The first two memory banks, bank0 and bank1 are accessed by the upper leg of the butterfly structure, and bank2 and bank3 are accessed by the lower leg of the butterfly (see Fig. 13 ). The most significant bit (MSB) of the addresses determine which two memory banks will be accessed; the remaining two memory banks will be inactive. Multi-bank memory structure has been proposed before (Ma & Wanhammar, 2000) , but a major advantage of the proposed addressing scheme is that the memory bank switching occurs only once in the middle of a stage. In the first half of the stage, same two memory banks are used and in the second half of the stage, the other two memory banks are accessed. There is no precharging and discharging of bitlines in the inactive memory banks. 
Reordering coefficient access sequence
The mbFFT architecture (see Section 2.2) can be used to generate the addressing scheme for reducing twiddle factor memory accesses and switching activity power. The twiddle factor access sequence is optimized for minimizing data bus changes. For all butterfly stages, the twiddle factor addresses are ordered in such a way that the twiddle factors at the same address are grouped together and accessed sequentially. This way, the twiddle factor ROM is not accessed every clock cycle. Reordering of the coefficient access sequences is shown in Table 8 and Table 9 . For example, in stage 1 in Table 9 , only 8 accesses are needed instead of 16, and in stage 2, only 4 accesses instead of 8 and so on. 
Twiddle factor address
Bank 2,3 address
Bank 0,1 address
Equations (15) and (16) show the twiddle factor memory access frequency for shared memory methods (Xiao et al., 2008) and the proposed reduced memory access method for 2 n N = point FFT.
Conventional method:
Reduced memory access method: Table 10 shows the twiddle factor memory access frequency for different FFT lengths. As FFT length increases, the power saving also scales up.
Implementation
To implement an 2 n N = -point FFT with reduced coefficient memory accesses, an (n-1)-bit Reduction 22% 40% 52% 61% 67% 72% 75% 78% 80% 82% Table 10 . Reduction in twiddle factor memory access frequency For example, for the 32-point FFT shown in Table 9 , at stage 2, the address of the upper legs of the butterfly is 210 102
, and when b 3 =0, memory bank0 will be accessed, when b 3 =1, memory bank1 will be accessed. For the read and write addresses of the lower legs of the butterfly, (n-2) inverters are needed. The address is given by 
FPGA synthesis results
The low-power FFT algorithm is implemented on an FPGA chip (ALTERA STRATIX EP1S25F780C5) with FFT length up to 8192 points as shown in Table 11 . The synthesis results demonstrate that dynamic power reduction grows with the transform size, making this architecture ideal for applications requiring long FFT operations.
Conclusion
This study focused on hardware efficient and low-power realization of FFT algorithms. Recent novel techniques have been discussed and presented to realize conflict-free memory addressing of FFT. Proposed methods reorder the data and coefficient address sequences in order to achieve significant logic reduction (24% less transistors) and delay improvements within FFT processors. Multiplierless implementation of FFT is shown using a CORDIC algorithm that does not need any coefficient angle memory, resulting in 33% memory and 15% power reduction. Finally, optimization of FFT dynamic power consumption is presented through memory partitioning and reducing coefficient memory access frequency (26% power reduction for 8192 point-FFT).
