Abstract -Fast Fourier transform algorithms are vital in many digital signalprocessing (DSP) applications. In here, both, radix-2 and radix-4 complex fast Fourier transform (FFT) implementations for fixed-point applications, using single instruction multiple data (SIMD) instructions and sub-word parallelism (SWP) is presented. It is shown that data management, and memory access are key to unleashing the arithmetic power of highly parallel digital signal processing (DSP) cores. The presented radix-2 implementation works for unconditioned data with length, N, that are a power of 2, but cannot fully utilize multiply-accumulate (MAC) units. In contrast, the discussed mixed-radix-4 implementation works for pre-conditioned data as found in orthogonal frequency division multiplexing (OFDM) and is customized to length N=256. This leads to near optimal MAC utilization on the TigerSHARC™.
I Introduction
The FFT is one of the most important and widely used DSP algorithms. In fact, there is a multitude of FFT algorithms [4] . FFT algorithms are mathematically efficient methods for calculating the discrete Fourier transform (DFT). The DFT is used to convert an input signal from the discrete time domain, x(n), to the discrete frequency domain, X(r), and vice versa. For most DSP applications the computation time of an FFT plays a significant role. In this study, the performance of both radix-2 and radix-4 FFT is considered and in particular focuses on industrially adapted benchmarks for 256-point and 1024-point complex DFT's. Our radix-2 implementation operates on data-elements within the fixed-point range [-1, 1) , since each stage precedes scaling. In contrast, our radix-4 implementation does not include scaling. Therefore, data have to be pre-conditioned to prevent overflow and saturation. The radix-2 DIF is defined as [1] , Sub-algorithmic characteristic The characteristics of the target architecture The programming language used Sub-algorithmic characteristics such as the ways, in which data can be indexed, are closely related to the characteristics of the target architecture, since they have to be chosen to match the target capabilities. High-end DSP architectures exploit on-chip parallelism [3] on the instruction-level and data-level. Both have to be reflected in the subalgorithmic characteristics to exploit the targets computation units in full. The programming language for DSP is C and Assembler. Unfortunately, compiler technology has not been able to keep pace with the increasing hardware complexity [2] . Therefore, DSP kernel development in Assembly is not an option, but a must.
In this article, firstly, the TigerSHARC architecture is going to be reviewed in regards to major fixedpoint, memory, and scheduler features -many of which are shared with other static superscalar architectures found on state of the art architectures. Secondly, a fully optimised radix-2 DIF implementation for the TigerSHARC is devised for unconditioned complex data. Furthermore, an optimised radix-4 decimation in time (DIT) implementation is presented that operates on preconditioned complex data. In fact, our approach is more accurately described as mixed-radix-4 decomposition [10] . It will be shown that operating on short-vectors [9] leads to some interesting design challenges. The paper concludes with the implementation results and a brief discussion of the conclusions.
II Processor Architecture
This section summarizes the main characteristics of the TigerSHARC architecture that are required to describe the design challenge in section III. The focus is on the processor's 16-bit fixed-point number crunching capabilities. Floating-point processing, integer arithmetic, and the direct memory access (DMA) engine are not described. For a more comprehensive description refer to the manuals [5] [6] [7] . Figure 1 illustrates the functional on-chip units that are controllable in Assembly.
Each compute-block (CB) consists of one 32-entry general-purpose register file, arithmetic logic unit (ALU), multiplier (MAC), and shifter. The two CB's constitute the primary data path of the processor and each is connected to the three The maximum instruction latency is 2 cycles, but some instructions have a latency of 1 cycle. In Table 1 
III Implementations
This section develops the implementations by combining two common techniques: loop unrolling and software pipelining. These are used to prevent CB-unit blockage such as data-starvation, dataobesity, or pipeline stalls.
The pipeline is kept full by finding parallelism among instructions that can be overlapped in the pipeline. To avoid pipeline stalls, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction. The goal is that the MAC units in both compute blocks can be issued on every clock cycle. Further, data-load and data-store complexities should be ideally balanced with CB-unit complexities. Any shift in favour of onecomplexity results in a loss of performance. This has two important implications: Firstly, data accesses have to be vectorized to the maximum length. Secondly, both complex MAC operations must execute in parallel on multiple data.
a) Radix-2 DIF
Iterative FFT implementations are based on triplenested loops. Usually, the butterfly-loop is nested in the group-loop (GB nesting), see Table 2 , which mr1:0+=r24**r16 (c) q-store r15:14 IL4 sr27:26=(r3:2-r11:10)/2 sr8=mr1:0, mr1:0+=r25**r17 (c) q-store r1:0 IL5 sr3:2 =(r3:2+r11:10)/2 sr9 IL6 mr1:0+=r26**r18 (c) q-store r9:8 IL7 sr10=mr1:0, mr1:0+=r27**r19 (c) q-store r3:2 IL8 sr11=mr1:0 IL9 q-store r11:10 In stages m/2 to m, the group-loop execution dominates, and butterfly-loop overhead increases. Hence, after the first (m/2)-1 stages this loop structure is inversed. The group-loop is nested in the butterfly-loop (BG-nesting), see Table 3 , tominimise the loop-overhead that is induced by filling (inner-loop-prologue) and emptying (innerloop-epilogue) the pipeline.
It is obvious that a naive multiplication look-up mr1:0+=r24**r16 (c) q-store r15:14 IL4 sr8=mr1:0, mr1:0+=r25**r17 (c) q-store r1:0 IL5 sr27:26=(r3:2-r11:10)/2 sr9 IL6 sr3:2 =(r3:2+r11:10)/2 mr1:0+=r26**r16 (c) q-store r9:8 IL7 sr10=mr1:0, mr1:0+=r27**r17 (c) q-store r3:2 IL8 sr11=mr1:0 IL9 q-store r11:10 
Stages (m/2)-1 to m-3
The loop inversion to BG-nesting changes the indexing scheme that was established for GBnesting, although the established indexing rule can be applied again. The difference is that i is now the group-loop index, i=0, Gj, 2Gj,…,j/4Gj, and j is now the butterfly-loop index, j=0,1,…,Gj/4. In Table 3 the fully optimised inner-loop could be theoretically compressed to 8 IL's. This, however, is practically not possible, since an instruction slot is needed to schedule the loop-branch (not shown in the illustrations). Increasing the unroll-factor may create further improvement as this compensates for the added IL needed for the branch slot. The minor performance enhancement, however, does outdo the increasing program size.
Last 3 stages
The last 3 stages are performed separately in a single loop. Illustrations are not provided because of limitations in space. The interested reader may contact the authors for clarification. Iterations operate on 16 elements over 3 stages. Only 4 multiplication factors are needed. These are loaded prior to loop-entry and are stored in static registers. In contrary to a) and b) this loop operates on 2 independent pipeline-threads. The IL-count is 13 and all dependencies are resolved.
b) Mixed-Radix-4
Given a 1-dim N point DFT, it can be viewed in two dimensions as a N1/2 x N1/2 matrix of input data [10] . The algorithm processes the input-matrix in parallel over the columns. Then the matrix is transposed and again parallel processed over the columns. This can be expressed as follows. Given the signal x, and ) and is this function's M-point DFT.
It is worthwhile noting that the algorithm described here is perfectly applicable to FFTs with sizes not being a perfect square, e.g. N=512. The actual implementation is now shown for N=256 and uses equations (5), (6) and (7).
1. The input data x(n) is arranged linearly, but is regarded as a 16x16 matrix: 
2. Using equation (5), this can be re-written as: 
3. Parallel FFTs are computed on columns to obtaining:
(15) (15) (15) (15)
4. Following, a point-wise multiplication by rotation-matrix: 
Which, according to equation (10) 
(1)
6. Subsequently, parallel FFTs are computed on columns using equation (6) 
This is the desired FFT result. The implementation follows these 6 steps. Steps 1 and 2 can be discarded, since the input data is naturally arranged in the correct order. In step 3 it is required to compute 16 parallel 16-point FFTs on the columns of the input matrix, whereby four FFTs are done in parallel and repeated four times to compute all 16 FFTs. The 16-point FFTs are done according to equations (3) to (7) in radix-4 fashion. This results in two stages for each 16-point FFT. To minimize overhead, the first stage of each of the four FFTs is calculated first. Subsequently, the second stage of the four FFTs is calculated.
Step 4 is a point-wise complex multiply of each matrix element by the appropriate rotation factor of the multiplication Step 6 is identical to
Step 3 and operates on the new input matrix. Therefore, the code of Step 3 can be re-used. The block diagram of the code is shown in Figure 2 . 
IV Results
The radix-2 and mixed-radix-4 implementations are analysed in terms of performance. A direct comparison is not presented, since the implementations target different applications and work for different ranges of input data as mentioned above.
a) Radix-2
The developed implementation can be applied to any possible memory configuration. Additionally, it can operate in-place and out-of-place. The former overwrites the input-vector, whilst the latter makes use of a buffer of size N. Both approaches lead to identical cycle-counts. The memory configuration has a significant impact on performance.
In Table 2 and Table 3 , in each instruction cycle of the loop, a quad memory load and store is scheduled. This can be executed without any penalty in performance provided the load and store are to different memory blocks. Table 4 provides several memory configurations that illustrate this approach, leading to lowest cycle-complexity. The following cycle counts are collated using the optimal memory configuration.
Input Vector Temporary Buffer
Multiplication Table   Output Vector At the beginning of Section III we stated that peak performance is achieved if, and only if, in every IL of the innermost loop a complex MAC op ration is issued. This is possible given that all data dependencies are resolved. In theory, Equations (1) and (2) of resolution N=2m, are of cyclecomplexity 0.5*m*(N/2). As a result, for N=1024, the greatest lower bound is 2560 and for N=256 the infimum is 512 cycles. In Table 5 , the measured cycle-counts for the final implementation, composed of inner loops as described in a) are given for several lengths, N. Table 5 , several observations can be made. The greatest lower bounds have not been achieved. This is because GB nesting and BG nesting do not lead to optimal MAC utilization. GB nesting utilizes the MAC to 67%. Any further optimisation is not possible for radix-2. BG nesting achieves 89% MAC utilization. This figure increases by 5%, if the unroll-factor is doubled, though it influences the overall cycle-complexity of the FFT implementation by only 1.8%. Given these figures, the trade-off for greater code-size cannot be justified. In fact, BG nesting causes sub-optimal utilization on the TigerSHARC. Optimal utilization can only be achieved by fully unrolling the BG nests. The third column in Table 5 is confusing. Doubling of the length, N, increases the measured cycle-count by a factor of 2 and in addition by an additive figure. The latter is loop overhead induced by loop prologue and epilogue and is unavoidable. This overhead increases with length N. It is the middle-loop counter that increases with length N and induces, for each iteration, loop prologues and epilogues in the inner-loop. Further, doubling the length, N, increases the stage-count by 1. Both are reflected in the additive figure. b) Mixed-Radix-4
Length N Cycles Increase in Cycles
The cycle-complexity of the implementation depends on the data management, though it may be applied to any possible memory configuration. Additionally, it can operate in-place and out-ofplace. The former overwrites the input-vector, whilst the latter makes use of a buffer of size N. Both approaches lead to identical cycle-counts. This can be executed without any penalty in performance provided that load and store operations access differential memory blocks and the multiplication table resides in an independent memory block. Table 6 provides several memory configurations that illustrate this approach, leading to lowest cycle-complexity. The achieved cyclecomplexity is 593 cycles on the TS201.
V Conclusion
Optimised implementations to calculate the onedimensional DFT by means of radix-2 DIF and radix-4 DIT FFT's are discussed for unconditioned and pre-conditioned complex data, respectively, represented in 16-bit fixed-point format.
Both implementations are compared regarding cycle-complexity. The given designs are customised for the TigerSHARC architecture from Analog Devices, though it may be used for other processor families such as the Blackfin (Analog Devices), TMS320C6x (Texas Instruments) or TriMedia (Philips). Furthermore, it is shown that BG-nesting is superior to GB nesting, and achieves practical peak performance.
Input Vector
The radix-4 implementations is customised for N=256 and incorporated into the mixed-radix approach. We referred to it as mixed-radix-4 to indicate the underlying building blocks. The mixed-radix approach can be combined with other radices and is likely to lead to similar good results. The presented mixed-radix-4 implementation is found to lead to near optimal MAC utilization.
