There is a substantial amount of publications on FFT par-(IFFT) are used in Orthogonal Frequency Division Multiplexing allelization, however, the focus has been on general purpose, (OFDM) systems for data (de) 
I. INTRODUCTION
The rest of the paper is organized as follows: the impleOrthogonal Frequency Division Multiplexing (OFDM) is mented FFT algorithm and the FFT-ASIP architecture for a multi-carrier modulation technique that has been adopted analysis are presented in section II. The analysis for instruction in various wireless communication standards such as Wire-and data level parallelization techniques are presented in less Local Area Networks (WLAN), Digital Video Broad-sections III, followed by the proposed interleaved technique casting (DVB) and OFDM-based Ultra Wideband (UWB). in sections IV and V. Concluding remarks are provided in The (de)modulation process uses Fast Fourier Transformation section VI. (FFT) and it's inverse (IFFT). These transformations are the most computationally intensive tasks in an OFDM system [1] .
II. THE FFT ALGORITHM AND THE FFT-ASIP Flexibility is a key requirement in wireless systems beyond 3G (B3G) [2] , where modems can be reprogrammed or The Cooley-Tukey (CT) [6] algorithm has been widely reconfigured to support different radio standards and operating adopted for FFT computation because of it's regularity. A modes. Consequently, Application Specific Instruction Set cached FFT (CFFT) algorithm which enables the exploitaProcessors (ASIPs) and Digital Signal Processor (DSPs) with tion of data locality for energy-efficient implementations was special support for FFT processing have been developed to proposed in [7] . Figure 1 shows a comparison of energy meet both flexibility and processing time constraints [3] , [4] . consumption of two ASIPs for the two FFT algorithms. The
In ASIPs, parallelization can be used to meet timing re-ASIPs are described in detail in [8] and [9] respectively. quirements (throughput). However, the type and degree of Because of higher energy-efficiency, the CFFT-ASIP was parallelization influence the efficiency of the implementation; selected for analysis. This ASIP is similar to the CT-ASIP: both with respect to energy dissipation and area utilization. it has the same pipeline length, the same data-path width, Therefore, we analyzed the efficiency of parallelization tech-the same memory configuration, and a butterfly instruction.
niques under high throughput requirements, when applied This instruction calculates the addresses of the coefficients and to FFT-ASIPs. Area and energy consumption were taken as data samples, fetches the operands, and computes the complete a measure of efficiency. The intention was to single out a butterfly with a latency of 6 clock cycles. The maximum clock parallelization technique with an outstanding efficiency for frequency of the ASIP is 250MHz. In the ASIP, a 32x32 exploitation in FFT-ASIPs. In the cached algorithm, the FFT computation is divided into Epochs (E), Groups (G) and Passes (P) [7] . The following pseudo-code shows how the algorithm is implemented.
with pipeline length shows that as the power consumption FOR e = 0 to E-1 becomes more important in the metric, shorter pipelines tend FOR g = 0 to G-1 to yield optimum results [10] . Since in this case power Load_Cache (e, g) consumption is a concern, short pipelines are preferred. Nev-FOR p = 0 to P ertheless, by using more complex instructions (higher explicit-FOR b = 0 to NumBFLY-1 ness), a high throughput can be obtained with short pipelines. Butterfly (e, g, p, b);
For a radix-2 FFT implementation, a butterfly instruction as END described in section II offers the highest degree of explicit END parallelization. However, this is not sufficient for meeting the Dump Cache (e, g); throughput requirement as shown below.
END
In the CFFT-ASIP, the 4ns long critical path goes through END the butterfly unit. The unit is distributed over three pipeline In this analysis, a modified version of the algorithm with a stages. By adding two additional stages, the critical path can be better cache utilization is used [9] . A better cache utilization shortened to 2.5ns. In figure 3 , the throughput of the ASIP is is achieved through an uneven distribution of passes between depicted for several FFT lengths. Clearly, the timing constraint the epochs. Figure 2 shows the flow graph for N=64 (cache for UWB cannot be met, even when a non-favorable Ins clock size= 16). is assumed to be possible.
For the last ILP alternative, the architecture was extended III. ANALYSIS OF EXISTING PARALLELIZATION by three slots, so that four butterflies could be computed in TECHNIQUES parallel. Because of data dependencies between the passes [9], On the architectural level, existing techniques can be only a speed-up factor of 1.73 was achieved for N=128, which grouped into Instruction, Data and Thread-Level Paralleliza-was far less than the required 11.41. This speed-up factor tion (ILP,DLP and TLP). Basically, the latter technique is use-cannot be achieved by increasing the number of slots further ful for time-multiplexing the processor usage. Consequently, due to a limited memory bandwidth, even with a 400MHz TLP cannot be used to increase the throughput of FFT clock, since only 312.5ns/2.5ns=125 cycles are available for computation on a dedicated ASIP.
FFT processing (312.5ns is the OFDM symbol length in Besides pipelining, ILP can be categorized according to UWB). This is because cache loading and dumping alone instruction issue size: single and multiple issue. The latter already consume 256 cycles. The reason for this is the order can be further categorized into explicit (e.g. complex or fused of computation. instructions), static (e.g. VLIW) and dynamic issue (e.g. super
The FFT computation proceeds as shown in figure 4 . In scalar). However, since power consumption is a limiting factor, this case, there are 2 epochs, 4 groups and 7 passes. The particularly for battery-powered modems, a dynamic issuing groups are computed consecutively (0-3). In each group, the scheme is not considered.
passes are computed in the order 0-4 for epoch 0, and 5-Generally, longer pipelines increase the throughput of a pro-6 for epoch 1. Epoch 0 is computed first. Cache loading cessor. However, the analysis of power/performance metrics is done in passes 0 and 5, and the dumping in passes 4 FFT Computation Time for the CFFT-ASIP P P P P P P P~~~~~~~~~~~~~~~~~~0 follows that the first 2 operands for computation of Go: Pi NBF units would be required to achieve the throughput at are available 3 cycles after Go Po has started. The next 2 after 400MHz, with a correspondingly very high area and energy 4 cycles, etc. Similarly, for Go P2, the operands are available consumption. This is the minimum number of NBF units. A 6, 7, 8,... cycles after Go: P1 has started, etc. Therefore, the mixed-radix approach such as [12] would lead to a low area execution can be interleaved as shown in figure 5 . The next cost because of higher-radix. However, a similar approach is group (G1: Po) cannot start until the first 2 cache elements not applicable in this case for flexibility reasons.
has been dumped in Go: P4. However, because of a latency For DLP, the analysis and the results are similar to the of 3 cycles, the next loading can start 2 cycles prior to the preceding ones. The difference is that in DLP techniques such dumping as shown in the figure. In this way, epoch 0 can be as vector instructions, only one instruction would be decoded, processed in 105 cycles. Similarly, epoch 1 can be finished rather than 7 as in the above 7-slot VLIW.
in 48 cycles, making a total of 153 cycles for N=128, so that A careful analysis of computation order reveals that it is the cycle constraint from section II is violated by 153-125=28 possible to interleave the execution, so that the throughput cycles. This is largely attributed to 13 cycles between each is achieved with only 4 NBF units, and with no memory cache loading between the groups (greyed area in figure 5 ). If partitioning. This technique is described in the following these 13 cycles are removed, then the total number of cycles section.
can be reduced by 13x3=39.
Since in each cycle Bload loads 2 cache registers, addi-IV. INTERLEAVED EXECUTION tional 26 cache registers are required. This would only add
The idea behind interleaved execution is to hide some of the approximately 30KGates to the CFFT-ASIP, which has, in execution cycles through instruction scheduling. This means, the current version, 430KGates (of which 63% are occupied as opposed to ILP techniques, temporal parallelism between by memories). To simplify cache addressing in presence of the passes is exploited, rather than spatial parallelism in the additional buffering registers, a modulo addressing scheme can groups. In subsequent discussion, a latency of 3 cycles for the be used. actual execution of a butterfly is assumed.
The execution of control and initialization instructions can With the preceding ILP technique in section III, the order of also be interleaved, so that the associated cycles are completely computation (figure 4) is Go Po -P4,G : PO-.... ,G : hidden. This technique requires a different instructions execu-PO -P4,Go P5 -.... ,G3 P5 -P6 [7] , [9] . 
