Abstract. This chapter examines the superscalar pipeline Fast Fourier Transform algorithm and architecture. The algorithm presents a memory management scheme that avoids memory contention throughout the pipeline stages. The fundamental algorithm, a switch-based FFT pipeline architecture and an example 64-point FFT implementation are presented. The pipeline consists of log 2 N stages, where N is number of FFT points. Each stage can have M Processing Elements (PEs.) As a result, the architecture speed up is M*log 2 N. The pipeline algorithm is configurable to any M > 1.
I. INTRODUCTION
THE FAST FOURIER TRANSFORM (FFT) ALGORITHM, presented in [1] , is a standard method for computing the Discrete Fourier Transform (DFT). The FFT algorithm consists of log 2 N loops; where each loop executes N/2 complex operations. FFT processor design has been researched extensively in the last few decades for speed, area and power optimization. As a result, many implementations have been proposed and developed to address one or more of the following optimization areas: architecture, memory access and power consumption. A variety of FFT architectures have been proposed, which employ different techniques such as pipelining, multiprocessing and cache-design, as shown in Figure 1 [2] . A single memory architecture consists of a scalar processor connected to a single N-word memory via a bidirectional bus. While this architecture is simple, its performance suffers from inefficient memory bandwidth. A cache memory architecture adds a cache memory between the processor and the memory to increase the effective memory bandwidth. A dual memory architecture uses two memories connected to a digital array signal processor. A memory controller generates addresses to memories in a ping-pong fashion. The processor array architecture consists of independent processing elements, with local buffers, which are connected using an interconnect network. Finally, the pipeline FFT architecture utilizes log r N blocks; each block consists of delay lines and radix-r butterfly units.
Processor memory access is another area of optimization that has received considerable research. Several algorithms have been proposed to avoid memory contention. Specifically, the address generation algorithm and logic are optimized for speed and area. A memory address generation scheme was presented by Cohen in [3] , that allows parallel organization of memory so that the pairs of data that are used at any instant reside in different memories. The address generation is based on a counter, shifters and rotators. In [4] , Pease proposed dividing the memory into sub-memories for overlapping the access. He observed that the operand addresses differ only in the (n-i)-th bit for the butterfly operand pair in stage i, where n is the number of address bits. A multi-bank memory address assignment for a radix-r FFT was developed in [5] . A fast address generation scheme is described in [6] with hardware cost comparable to the address generation scheme in [3] . Ma and Wanhammar presented an address generation scheme in [7] to reduce the hardware complexity and power consumption. Power is reduced by activating only half of the memory during memory access and by minimizing the number of memory accesses. The methods do not address conflicts for multi-processors accessing memory simultaneously. Lastly, several power reduction techniques were designed for energy-efficient processors; including techniques to reduce memory accesses. A cache-memory architecture was described in [8] to reduce communication energy between FFT processors and memories. In [9] and [10] , Zhong, et al. described a power-scalable reconfigurable ring-architecture multiprocessor for a single chip FFT/IFFT processor.
The processor is capable of processing different FFT sizes with scalable power across FFT sizes. However, while the use of the processor ring architecture seems to be an interesting idea, the case for using the ring architecture to compute FFTs is weak. The architecture seems to be better suited for more serialized computations such as FIR filters. Also, large values of N require more complex processor programs. Further, power does scale well for N ≤ 128.
This chapter presents a superscalar pipeline architecture to achieve maximum speed for FFT processing. A switch fabric controls and connects single-port memories and processing elements (PEs). A memory management algorithm avoids memory access contention. Rearranging data in the memories requires tracking them throughout the pipeline to process the right pair of data for FFT computations. The ordering of data elements is used to calculate the twiddle factors and other important indices. The algorithm provides an implicit method to track data. The superscalar pipeline achieves a speed up of M*log 2 N.
The chapter is organized as follows. Section II discusses current pipeline designs. Next, Section III explains the pipeline architecture and analyzes pipeline speedup hazards and optimizations. Section IV discusses hazard conditions and resolutions. It provides a pseudo code for the pipeline memory management algorithm. Section V details the design of a 64-point FFT with emphasis on the data movement and storage in the pipeline and memories. Section VI compares the proposed design with other pipeline FFTs.
II. EXISTING PIPELINE FFT ARCHITECTURES
This section reviews the main pipeline FFT architectures. Groginsky and Works developed an early pipeline FFT design [11] . Several pipeline FFTs have been implemented [12] - [14] . Later, several pipeline architectures were proposed and designed [15] - [17] . Pipeline FFT processors consist of log r N stages, each stage utilizes variable sizes of memories and complex multipliers/adders depending on the pipeline type. Because it performs log r N butterflies in parallel, the radix-r pipeline FFT processor has as a speed-up of (at least) log r N compared to an FFT performed on a single radix-r FFT processor. Based on the number of paths between stages, FFT pipelines are classified into Single-path Delay Feedback (SDF) and Multi-path Delay Commutator (MDC). The modular pipeline constructs the pipeline from two smaller pipelines to reduce power. The rest of this section will explain the SDF, MDC and modular pipelines.
SDF Pipeline FFT
The SDF pipeline FFT has one path between stages, as shown in Figure 2 . The pipeline uses feedback registers in each stage. The feedback registers store previous stage outputs for use by the butterfly. Figure 2 illustrates the SDF pipeline FFT for a radix-r N-point FFT and shows an example of an 8-point radix-2 pipeline [15] , [16] . Each SDF stage is comprised of:
• A radix-r FFT butterfly. Each butterfly is followed by a complex multiplier (shown explicitly in Figure 2) ), e.g., stage 0 has (r-1)(N/r) registers. The pipeline hardware complexity depends on the number of delay elements and multipliers. The total number of complex multipliers is (log r N -1) [15] , [16] . Additionally, the total number of registers in the pipeline is N-1. A high radix SDF (i.e., r >2) can be also implemented by cascading several radix-2 processing elements referred to as 2 s [15] . Calculating pipeline throughput and complexity is straightforward. The SDF pipeline accepts a new point each clock cycle. Further, it outputs one point per cycle. Therefore, the pipeline throughput is one point per cycle. 
MDC Pipeline FFT
The radix-r MDC pipeline FFT utilizes r paths between stages, as shown in Figure  3 [15] , [16] . With the exception of one path, all paths utilize delays with different numbers of registers. Each stage receives r intermediate results from the previous stage, and passes r outputs to the next stage. An example of an 8-point radix-2 MDC pipeline FFT is shown in Figure 3 . An MDC stage is comprised of:
• An r-input commutator, • A radix-r butterfly which includes (r-1) complex multipliers
• Two sets of shift registers. The first set is located before the commutator (shown as D). This set does not exist in stage 0. The second set is situated after the commutator. Moreover, the number of registers in the j-th element of each set in stage i can be expressed as: Di j = DDi j = j × ( N/ r i+1 ). An example of the shift register sizes for a 1024-point radix-4 pipeline FFT is shown in Table 1 . The pipeline complexity is a function of the number and size of delay shift registers, adders and multipliers. The total number of delay registers is (r+1)N/2 -r. In addition, there are (r-1) (log r N -1) complex multipliers and 2(r-1) (log r N -1) complex adders in the pipeline [12] , [16] . In contrast to the SDF pipeline, the MDC pipeline receives r points and outputs r points in each clock cycle. Thus, the pipeline throughput is r.
The Modular Pipeline
El-Khasahab, et al. developed the modular pipeline FFT detailed in [18] - [20] . The N-point modular pipeline FFT consists of two N -point FFT modules joined by a specialized center element. The center element contains coefficient and data memory as well as addressing, routing and control logic. The modular pipeline FFT significantly reduces the size of the shift registers. Moreover, the coefficient storage is concentrated within the center element, which can be implemented using energyefficient RAM memories. Further, the throughput of the modular pipeline FFT is identical to that of the standard pipeline FFT, although the end-to-end latency is very slightly higher.
The modular pipeline FFT algorithm is expressed mathematically by the following equation, which demonstrates the two-stage N-point FFT: 
where:
To obtain the correct results, the transforms of the first stage are combined (in a fixed way) and fed to the second stage. Table 2 compares the modular pipeline with a conventional N-point pipeline FFT. Despite the fact that it requires a larger memory; the modular pipeline has fewer shift registers. The modular pipeline FFT requires an additional pre-rotation multiplication and has very slightly higher latency than the standard pipeline FFT. 
III. THE SWITCH-BASED ARCHITECTURE
This section describes the superscalar pipeline architecture for a radix-2 FFT.
Superscalar Pipeline Architecture
The pipeline architecture of an N-point radix-2 FFT consists of log 2 (N) stages. Switch Fabric
Pipeline Stage i The Hazard-Free Superscalar Pipeline Fast Fourier Transform Architecture and Algorithm 9 Figure 7 shows an overview of pipeline architecture. Each stage is capable of calculating M radix-2 butterfly results. Using the Instruction Level Parallelism (ILP) classification from [22] , the architecture is a superscalar machine with Instruction Parallelism (IP) equal to M. It is also a super-pipeline where each cycle has N/(2*M) minor-cycles. The architecture applies to the decimation-in-time FFT as well, where the specifications of stage i in the decimation-in-time algorithm is the same as that of stage log 2 (N)-i in the decimation-in-frequency algorithm. A scalar machine takes (N/2)*log 2 (N) steps to execute an N-point radix-2 FFT algorithm. The architecture consists of log 2 (N) stages, where each stage executes M operations. Therefore, the pipeline speedup is: M*log 2 (N). The maximum pipeline speedup is (N/2)*log 2 (N), when M = N/2. In this case memories are reduced to registers, and the switch fabric connects each any register to any PE. Clearly, while this case provides the most speed up, its hardware is expensive. The optimum value of M is decided by design parameters: speed, area and power. 
Pipeline Design Optimization
Upon close examination of the FFT algorithm, it is clear that not all twiddle factors are used in all stages. Also, the algorithm allows PEs to have identical twiddle factors in some stages, and therefore, not all the ROMs are required. In fact, the number and size of ROMs per stage can be reduced as outlined in Table 3 . 
If the pipeline is designed for a specific value of N, where N is fixed, the pipeline connectivity and twiddle factors are fixed. As a result, the design implementation can be optimized since the connectivity of each stage is predetermined. Figure 8 illustrates the connectivity of 16-point 2-PE pipeline. Furthermore, in many computations the value of the twiddle factor is one. A twiddle factor of one reduces the PE computation to add/subtract operations. Also, several PEs execute specific sets of twiddle factors, which can lead to design simplification.
Bassam Mohd
Earl E. Swartzlander, Jr. Adnan Aziz
x (1) x (2) x (3) x (4) x (5) x (6) x (7) x (8) x (9) x (10) x (11) x (12) x (13) x (14) x (15) x1 (0) x1 (1) x1 (2) x1 (3) x1 (4) x1 (5) x1 (6) x1 (7) x1 (8) x1 (9) x1 (10) x1 (11) x1 (12) x1 (13) x1 (14) x1 (15) x2 (0) x2 (1) x2 (2) x2 (3) x2 (4) x2 (5) x2 (6) x2 (7) x2 (8) x2 (9) x2 (10) x2 (11) x2 (12) x2 (13) x2 (14) x2 (15) x3 (0) x3 (1) x3 (2) x3 (3) x3 (4) x3 (5) x3 (6) x3 (7) x3 (8) x3 (9) x3 (10) x3 (11) x3 (12) x3 (13) x3 (14) x3 (15) x4 (0) x4 (1) x4 (2) x4 (3) x4 (4) x4 (5) x4 (6) x4 (7) x4 (8) x4 (9) x4 (10) x4 (11) x4 (12) x4 (13) x4 (14) x4 ( As indicated earlier, the speed up of the pipeline depends on two factors: the number of PEs/stage (i.e., M) and the number of stages (log 2 (N)) since Speedup = M*log 2 (N). One might ask, "Given fixed target speedup (e.g., S), which factor should be increased to achieve more efficient design: the number-of-stages or the number-ofPEs/stage?" Consider a pipeline with a speedup of S with two designs: Design A and design B, as shown in Table 4 . Design A has one PE per stage, while design B has one stage. Clearly, • Design B requires less memory than design A since the design A total memory is proportional to S.
• Design A switch fabric is simpler than that of design B. The complexity of the design B switch fabric is proportional to S 2 . The main disadvantage of the increasing the number of stages is the increase in total memory. On the other hand, increasing the number of PEs per stage increases the complexity of the switch fabric. Hence, the tradeoffs between the two factors depend on the constraints on the total memory and the maximum complexity of the switch. Only specific design goals and technology processes can determine the optimum solution.
Pipeline Hazards
The main source of hazards in the pipeline is memory contention. Memory contention occurs when one or more PEs requests two or more accesses to a given memory at the same time. Memory contention results in stalling the pipeline and reduces the system speed. In the decimation-in-frequency FFT, memory contention does not occur in the early stages, it occurs from stage log 2 (M)+1 to the last stage. In the decimation-in-time FFT, contention affects stage 0 to stage log 2 (N) -log 2 (M) -1. Figure 8 shows an example of memory contention for N=16 and M=2. It is clear that stage 0 and stage 1 have no contention. However, contention occurs in stage 2 and stage 3. Observe the following:
• In stage 2 the inputs for the top PE are x 2 (0) and x 2 (2), both of which reside in MEM0.
• In stage 3 the inputs for the top PE are x 3 (0) and x 3 (1), both of which reside in MEM0.
One solution for memory contention is to use a multi-port memory. However, multi-port memories are expensive and can slow down the system performance. In addition, the later stages of the pipeline have higher degree of contention which requires more ports in the memory. Eventually, it becomes impractical to implement the required multi-port memory. Moreover, the number of memory ports varies in the memory hierarchy. Register files usually have more ports than caches and SRAMs. Requiring a certain number of memory ports restricts where the intermediate results can be saved in the memory system. Another solution to resolve memory contention is to employ a memory management mechanism to mitigate the hazard, as discussed in the next section.
IV. HAZARD FREE PIPELINE ALGORITHM
The main idea of the algorithm is resolve memory contention in the early stages of the pipeline. The rest of the section describes the hazard conditions, memory management operations and the algorithm.
Detecting Pipeline Hazards
From Figure 8 , in stage 0, x(0) and x(8) go to PE 0 . Similarly, x(1) and x(9) go to PE 1 ,..., etc. Define stage distance as the index delta in each stage. The stage distance for a 16-point pipeline FFT is shown in Table 5 .
Bassam Mohd
Earl E. Swartzlander, Jr. Adnan Aziz In general, for an N-point pipeline FFT, the stage distance for stage i is equal to N/2 (i+1) . Memory contention occurs when the stage distance falls in a single memory space. From Section III, the memory size is equal to N/(2*M). Hence, memory contention occurs in stage i if the following condition is satisfied:
A stage that satisfies condition (2) will be referred to as a hazard stage; the rest of the stages are safe stages. For instance, in Figure 8 , stage 2 and stage 3 are hazard stages. Define memory pair (i, j) t as memory location x(i) and x(j) for stage t. In stage 2, the following memory pairs are hazard pairs: (0, 2) 2 , (1, 3) 2 , (4, 6) 2 , (5, 7) 2 . Other pairs will be referred to as safe pairs, for instance (3, 5) 2 . The stage distance can be represented in binary form:
Define pair (i, j) t as a hazard pair if and only if: 1. t is a hazard stage 2. The bit wise Exclusive-OR of addresses i and j is equal to the stage t distance. For example, the address pair (5, 7) 2 is a hazard pair since:
Stage-2 distance = 2 10 5 10 ⊕ 7 10 
Memory Management Operations
Let x i (t) and x j (t) be the i-th and j-th elements in stage t and i < j. Define the memory management operations as follows (see Figure 9 ): • Normal Operation: Inputs x i (t) and x j (t) are provided to the first and second inputs of the PE: a, b. The results c and d are saved in x i (t+1) and x j (t+1).
• Shuffle Operation affects how PE results are saved back in memory. In shuffle operation, the results c and d are saved in x j (t+1) and x i (t+1) • Swap Operation: The swap operation affects the order of PE inputs. In swap operation, x i (t) is provided to b (instead of a) and x j (t) is provided to a (instead of b). The reason for the swap operation is because the PE is an asymmetric unit and the memory management algorithm changes the normal order of data in the memory. If the algorithm detects a case with incorrect inputs, the swap operation is performed.
• Swap and shuffle operation: A PE operation can have both swap and shuffle memory operations at the same time.
Fig. 9. Memory Management Operations [21]
The Algorithm
The main idea of the pipeline algorithm is to identify hazard pairs in early stages and perform memory management operations to resolve the hazard. Because data is rearranged in memory, the algorithm has to track where data is. One idea to track the movement of data is to use a separate memory to store the data indexes (i.e., pointers), as shown in Figure 10 . This approach provides a great flexibility in moving data in the memory. It also simplifies the reordering logic of the final stage hardware. The downside of this approach is it increases memory size. Also, it increases loading the operands in the PE by one cycle to retrieve pointers from memory. Another (less flexible) solution is to move data in memory in a fixed way to simplify data tracking in the pipeline. This approach resolves hazards for next stage only. As a result of reordering data in the pipeline, results from the last stage in the pipeline should be reordered.
Bassam Mohd
Earl E. Swartzlander, Jr. Adnan Aziz Fig. 10 . Tracking Shuffled Data [21] The algorithm utilizes several counters to calculate memory addresses and determine memory management operations. There are three main counters which are described in the upper three rows of Table 6 . Other counters are derived from the main counters and described in the rest of the table. The flow of the algorithm of stage i is shown in Figure 11 . The pseudocode of the algorithm is listed at the end of the section. Figure 12 illustrates the shuffle and swap operations performed by the algorithm to resolve the memory contentions in Figure 8 example. Tables 8-12 give the PE operand pairs for Stages 1-5. Underlined pairs indicate shuffle operation. Since Stages 0-2 are safe stages, the first shuffle operation starts in Stage 2 to prevent hazards in stage 3. Table 13 lists the memory contents for pipeline stages. For example, the output of stage 2 has the memory contents for Memory 0 as follows: 0, 1, 2, 3, 12, 13, 14, and 15.
Earl E. Swartzlander, Jr. Adnan Aziz 1  1  1  1  1  3  3  2  2  2  2  6  6  6  3  3  3  3  7  5  5  4  4  4  12  12  12  12  5  5  5  13  13  15  15  6  6  6  14  10  10  10   0   7  7  7  15  11  9  9  8  8  8  8  8  8  8  9  9  9  9  9  11  11  10  10  10  10  14  14  14  11  11  11  11  15  13  13  12  12  12  4  4  4  4  13  13  13  5  5  7  7  14  14  14  6  2  2  2   1   15  15  15  7  3  1  1  16  16  16  16  16  16  16  17  17  17  17  17  19  19  18  18  18  18  22  22  22  19  19  19  19  23  21  21  20  20  20  28  28  28  28  21  21  21  29  29  31  31  22  22  22  30  26  26  26   2   23  23  23  31  27  25  25  24  24  24  24  24  24  24  25  25  25  25  25  27 27  27  27  27  31  29  29  28  28  28  20  20  20  20  29  29  29  21  21  23  23  30  30  30  22  18  18  18  31  31  31  23  19  17  17  32  32  32  32  32  32  32  33  33  33  33  33  35  35  34  34  34  34  38  38  38  35  35  35  35  35  37 6   55  55  55  63  59  57  57  56  56  56  56  56  56  56  57  57  57  57  57  59  59  58  58  58  58  62  62  62  59  59  59  59  63  61  61  60  60  60  52  52  52  52  61  61  61  53  53  55  55  62  62  62  54  50  50 
VI. Comparison with Other FFT Pipelines
The hardware complexity of a pipeline FFT is measured by the number of complex adders, complex multipliers and the memory size. A radix-2 butterfly consists of one complex multiplier and two complex adders which can be implemented using four real multipliers and six real adders. A radix-4 butterfly consists of three complex multipliers and eight complex adders and can be implemented using 12 real multipliers and 22 real adders. Less expensive (but slower) butterfly implementations exist especially for slow pipelines, e.g., SDF pipelines. The rest of this section uses counts of complex operations to compare different pipelines.
The SDF pipeline FFT has a total of (log r N -1) multipliers and N-1 delay elements. Further, the MDC pipeline FFT utilizes (r+1)N/2 -r delay elements, and (r-1) (log r N -1) real multipliers and roughly 2(r-1) (log r N -1) adders. Table 14 summarizes the hardware and timing complexities for FFT pipeline architectures discussed in references [18] , [20] . The table also illustrates the complexities for the switch based architecture (shown in the last row of the table.) The other pipeline architectures require delay elements in the pipeline implementation. Delays are implemented by shift registers (which dissipate high dynamic power) or by RAMs with additional address generation hardware (which increases design complexity). The modular pipeline reduces number of delay elements to 2( N -r). The switch-based pipeline uses SRAM memory arrays, which consume less power than registers and are easier to implement. Moreover, the throughputs of the other pipelines are limited to one (single-path) or a few (multi-path) data per clock, while the switch based implementation has a throughput of M. Unfortunately, the switch based pipeline requires larger memory size and more hardware in the data path. 
VII. CONCLUSION AND FUTURE WORK
This chapter extends results from [21] . It presents a switch-based architecture for FFT engine implementation. It also presents an algorithm to predict and resolve memory contentions. As a result the pipeline speedup is M*log 2 N, where N is the number of points and M is the number of processing elements. An implementation of a 64-point FFT machine using the proposed architecture is presented. The architecture compares favorably to other FFT pipelines. Future research should focus on reducing power consumption of the FFT pipeline.
