ABSTRACT SIMD (single instruction multiple data) vector instructions, such as Intel's SSE family, are available on most architectures, but are difficult to exploit for speed-up. In many cases, such as the fast Fourier transform (FFT), signal processing algorithms have to undergo major transformations to map efficiently. Using the Kronecker product formalism, we rigorously derive a novel variant of the general-radix Cooley-Tukey FFT that is structured to map efficiently for any vector length ν and radix. Then, we include the new FFT into the program generator Spiral to generate actual C implementations. Benchmarks on Intel's SSE show that the new algorithms perform better on practically all sizes than the best available libraries Intel's MKL and FFTW.
INTRODUCTION
Most current off-the-shelf and many embedded computing platforms offer SIMD (single instruction multiple data) vector instructions. A prominent example is Intel's SSE family, which enables parallel operation on 2-way double precision, 4-way single-precision, or 8-way 16-bit and 16-way 8-bit fixed-point data types. The potential speedup is considerable, so these instructions should be used if highest performance is desired.
Unfortunately, achieving the possible speed-up with these instructions is difficult for all but the simplest computational problems since many constraints have to be obeyed to get performance. This includes proper data alignment during computation and the minimization of vector shuffle operations. As a consequence, existing fast algorithms often cannot be used directly but need to be restructured to map efficiently to vector architectures. This is certainly true for fast Fourier transform algorithms (FFTs) that possess a complex structure and pose a challenge even on platforms without special instructions.
In this paper we derive a novel variant of the general-radix Cooley-Tukey FFT, called general short-vector Cooley-Tukey FFT that has a structure that maps efficiently onto vector architectures. The new FFT applies to all radices, i.e., factorizations N = mn of the input size and for all vector lengths ν. The derivation is done rigorously using the Kronecker product formalism [1] . In previous work [2] we had derived a vectorized FFT under the condition ν 2 | N . The extension to arbitrary N is non-trivial. Further, we included the new FFT into the program generation and optimization system Spiral [3, 4] to produce actual code for benchmarking. The benchmarks are against the best available libraries FFTW [5, 6] (fully) reported in the literature. The results show that our generated code, based on the new vectorized FFT, outperforms these libraries for most sizes. Further, we show that we can successfully (i.e, with obtaining speed-up) vectorize a larger class of sizes than FFTW.
Organization. Section 2 provides the necessary background on the FFT, the Kronecker product formalism, and Spiral. The main contribution of this paper, the new vectorized FFT, is derived in Section 3. Benchmarks with generated code conclude the paper in Section 4.
BACKGROUND
SIMD vector instructions. SIMD extensions like Intel's MMX and SSE, Motorola's AltiVec, and IBM's VMX (available on the Cell BE processor) operate on vector registers (64-bit or 128-bit wide) that are broken into ν-way vectors of smaller data types. For instance, SSE supports (among other data types) 4-way singleprecision (ν = 4) and 8-way 16-bit integer arithmetic (ν = 8). SIMD extensions operate most efficiently if data is loaded and stored in full vectors that are naturally aligned (i.e., loading 16-byte aligned 128-bit words for SSE). Loading and storing of unaligned data and partial vectors incurs considerable overhead. To make things worse, loading or storing k out of the ν elements inside a vector, as required for handling arbitrary FFTs, requires the optimization of each combination of k and ν separately for each architecture.
Discrete Fourier transform. Computing the discrete Fourier transform (DFT) of an input signal x of length N is equivalent to the matrix-vector multiplication y = DFTN x, where
Fast Fourier transforms. Various fast Fourier transforms (FFTs) are available that enable the computation of the DFT with O(N log(N )) operations for all sizes N . The most important one is the Cooley-Tukey FFT. It can be expressed as a factorization of the DFTN into a product of structured sparse matrices provided that N = mn factorizes. Using the Kronecker, or tensor product notation [1] this FFT is given by
If x is viewed as an n × m matrix, stored in row-major order, then L mn m performs a transposition of this matrix. Further, In is the n × n identity matrix, and the tensor product is defined as
Dm,n is a diagonal matrix containing the so-called twiddle factors.
Further matrix notation. For later derivations, we also define the direct sum of matrices as
and introduce the notation A M = M T AM as well as a generalization of In, namely
Here, Om×n is the m × n all-zero matrix. Clearly, In = In×n. We call any expression that is constructed from the above matrices and operators ⊗, ⊕, · (product) a formula. For example, the right hand side of (1) is a formula.
The DFT is a complex transform, but implemented in real arithmetic. This can be expressed formally using an operator (·). Namely, if A is a complex m × n matrix, then A is the real 2m × 2n matrix obtained by replacing every entry a = a1 + ja2 with the matrix [
]. This way y = Ax becomes equivalent to y = Ax , where x , y are in the interleaved complex format: x contains alternating the real and imaginary parts of the elements in x. Implementing the DFT really means implementing y = DFTN x .
Efficient software implementation. When implemented in software, then, for efficiency, (1) is not performed in the four stages suggested by the four factors (from right to left: permutation, loop over smaller DFTs, scaling, loop over smaller DFTs), but instead the permutation L mn m is fused with the subsequent loop, and the scaling by Dm,n as well (as implemented, for example, in the library FFTW [5] ). This increases locality and reduces cache misses.
On parallel or vector architectures, more involved manipulations of the FFT (1) are necessary to achieve high performance. These manipulations, and the resulting variants of (1), can also be expressed using the tensor product notation. Early examples can be found in [1] . For the latest generation of platforms, [7] optimizes for shared memory parallelism, and [2] for ν-way vector instructions. However, [2] requires ν 2 | N . The generalization to arbitrary (composite) N is the subject of this paper.
Spiral. Spiral is a program generation and optimization system for transforms including the DFT [3] . Given a DFT and its size N , Spiral expands DFTN recursively using (1) or other FFTs until base cases (N = 2) are reached. The resulting formula (tensor product expression) is manipulated as explained in the previous paragraph and then translated into C code. Based on the runtime of this code, Spiral changes the initial recursive expansion (e.g., by choosing different factorizations N = mn) and repeats the process. This search eventually produces an implementation tuned to the given computer.
The derivation of a vectorized FFT in this paper is formalized to enable inclusion into Spiral. This way, we can generate code for this FFT automatically and search over alternatives for the fastest. More details are in the benchmark Section 4.
VECTORIZATION OF NON-TWOPOWER SIZES
Our goal is to derive a vectorized version of the Cooley-Tukey FFT (1) for any factorization N = mn and vector length ν. Our approach is as follows:
• We identify basic building blocks, which are formulas that can be efficiently mapped to vector code. The idea is that then any formula that consists exclusively of these constructs can be mapped as well.
• We identify a set of formula identities such that the application to (1) transforms the FFT into a vectorized FFT.
• Finally, we show the obtained vectorized FFT and discuss its structure.
The goals for the final FFT are 1) to make sure all arithmetic is performed using vector instructions on properly aligned data; and 2) to minimize the required load/store and shuffle instructions. We remind the reader that implementing or mapping to code of a matrix A means for the function y = Ax.
Basic building blocks. We list basic formulas that can be easily mapped to vector code and that are necessary to vectorize (1) .
For any matrix A, the formula
can be translated to vector code by generating scalar code for A and then replacing all scalar operations in that code by the corresponding vector operations and scalar variables by ν-way vector variables [2] . The permutations
are the only in-register permutations required to vectorize (1). For example, on Intel's SSE and ν = 4, the first two require only two assembly instructions. The construct
describes the loading of n consecutive values into m vector registers; the last mν − n entries of the last vector register are set to zero. In words, this matrix extends the input vector by zeros to make the length divisible by ν, and, when implemented, performs the actual load operation. The corresponding store operation is the construct
which writes m consecutive memory locations from n vector registers. The last nν − m entries of the last vector register are not written. We introduce the operator (.) n,ν to transform complex diagonal matrices into vectorizable formulas:
This somewhat complicated looking transformation first extends D with zeros to have a size divisible by ν. Then it permutes rows and columns simultaneously so the non-zero pattern is the same as that of I m⊗ n/ν ⊗A2⊗Iν , which matches (2) and can thus be implemented using vector operations only.
From the above formulas, we can build more complex ones that can still be mapped to ν-way vector code. This is captured in the following definition. Formula identities. We list the necessary formula identities to vectorize (1). They are divided into 3 classes:
Definition 1 We call a formula vectorized if it is either of the form
• The vectorization rules (7)-(9) ( • The simplification rules (17)-(19) ( Table 3 ) minimize the number of shuffle operations and partial vector loads/stores.
II 18
As a simple example of formula vectorization, consider again ν = 4 and DFT2 ⊗I3, which expresses a loop with 3 iterations (here, DFT2 is considered as operating on a real input). We use (7) to vectorize:
(10) (I2 ⊗ I4×3) loads two times 3 elements into a 4-way vector, zeroing the last element. (DFT2 ⊗I4) adds and subtracts both 4-way vectors. (I2 ⊗ I3×4) stores the first 3 elements of each 4-way vector consecutively to memory. Vectorized FFT. We now derive a vectorized Cooley-Tukey FFT for any ν and factorization N = mn. Due to space limitations, we only sketch the derivation and spend more time in interpreting the result.
The starting point is DFTN .
First, we expand (20) using (1) to expand the larger DFT into smaller DFTs. In the next step the vectorization rules (7)- (9) vectorize all factors in (1). Then, rules (11)-(16) translate the complex formula into its corresponding real version. Finally, rules (17)-(19) simplify the formula and drop base cases involving loads/stores or shuffles wherever possible. The result is the FFT in Table 4 , which we call the general short vector Cooley-Tukey FFT algorithm. It is parameterized by the vector length ν. Inspection shows that each of the factors (21)- (28) is indeed vectorized in the sense of Definition 1. Note, that (21)-(28) is vectorized independent of how the smaller DFTm and DFTn are further expanded using, for example, again (1) or a prime-factor or Rader FFT. In other words, (21)-(28), as (1), spans an entire set of algorithms that can be searched for the fastest, which is exactly what Spiral does (see also the discussion below). Also note that, as in (1), the occurring shuffle operations are never
(27)
(28) Table 4 . General short-vector Cooley-Tukey FFT for any ν and N = mn. The permutation Pm,n is shown in Table 5 .
explicitly performed but converted into readdressing or fused with vector load and store operations. We discuss the factors in Table 4 in the order they are performed from bottom to top. The input vector (of length 2N = 2mn) is initially in the interleaved format. (28) expands m, and thus the length of the input vector to be divisible by ν. (27) converts into vector interleaved format (ν real parts followed by ν imaginary parts and so on). (26) performs a perfect vectorized computation on aligned vectors and in its last step expands n to be divisible by ν. (25) is a permutation (shown in Table 5 ) that operates on entire vectors except for L Spiral. For implementation purposes, the main task was to identify the transformations in Tables 1-3 . We included them into Spiral's formula rewriting system, so Spiral derives autonomously the vectorized FFT in Table 4 , further expands the smaller DFTs in different ways, generates the actual C code including vector instructions, and searches for the fastest implementation. As m and n may be prime numbers, the general short vector Cooley-Tukey FFT (Table 4) does not exactly specialize to the short vector Cooley-Tukey FFT [2] which requires ν|m, n. Benchmarks with the generated code using our new vectorized FFT are shown next.
EXPERIMENTAL RESULTS
Benchmark setup. We evaluated our vectorization method on a 3.6 GHz Intel Pentium 4 running Windows XP, and using the Intel C++ compiler 8.1 with options "-O3 -QxKWP". We evaluated both the 4-way single-precision float and the 8-way 16-bit integer mode of the SSE vector instruction set.
All Spiral DFT programs are automatically generated and adapted through Spiral's search mechanism. Specifically for SSE, this means that Spiral first chooses a split (1), then rewrites into the form in Table 4 , and then has further freedom in expanding the smaller DFTs in (23) and (26) using also prime-factor and Rader FFTs. We compare our generated programs to the SSE version of FFTW 3.1.0 (pre-built Windows library available at fftw.org), and to the vendor library Intel Math Kernel Library (MKL) 8.1. Both libraries only provide vectorized floating-point code (except twopower sizes in MKL). We report performance in "pseudo Mflop/s" (float) and "pseudo Mfpop/s" (integer). Both are computed for DFTN by (5N log 2 N )/t, where t is the runtime in microseconds. Speed-up is computed by tscalar/tvector. Thus, in all performance metrics, higher is better. Fig. 1 summarizes the results. We evaluated all composite problem sizes N ≤ M , where M = 64 for float and M = 100 for 16-bit integer, with all prime factors p < 16.
Comparison to FFTW and MKL. Spiral-generated 4-way vector code is faster than FFTW and MKL for almost all considered sizes ( Fig. 1(a) ) and achieves a speed-up over the scalar (x87) code for all expect the smallest sizes. The same holds for the integer code ( Fig. 1(c) ). Thus, our vectorization methods works in general.
The actual speed-up obtained through vectorization over the scalar code is shown in Figs. 1(b) and (d) . Again, they show a speedup for Spiral for all except the smallest sizes. FFTW, for 4-way float, achieves a vectorization speed-up only for even sizes and is roughly comparable to Spiral only for sizes divisible by 4.
