This paper targets automatic performance tuning of numerical kernels in the presence of multilayered memory hierarchies and single-instruction, multiple-data (SIMD) 
I. INTRODUCTION
In order to turn Moore's law (exponential growth of the numbers of transistors per area unit) into actual performance gain, modern microprocessor architectures include performance boosting features like multilevel caches (resulting in a deep memory hierarchy), data prefetching, multiple execution units, and superscalar processor cores, as well as special instructions for compute-intensive applications. One important goal in compiler research is to map high-level language programs (written, for instance, in C or Fortran) to machine code that efficiently utilizes all performance boosting features of current processors. However, some of these features are difficult to exploit without detailed knowledge of the characteristics of the underlying algorithm. This is particularly the case when conflicting optimization goals aim at different hardware features and only the algorithm's structure provides hints on how to resolve this issue.
A. Short Vector Single-Instruction, Multiple-Data (SIMD) Extensions
Short vector instructions have been introduced to support compute-intensive multimedia applications on general-purpose processors. Originally, these instructions targeted integer computation but later were also expanded to include single and double precision floating-point computation, which makes them useful in scientific computing.
The main idea of short vector SIMD instructions is to have multiple floating-point units operating in parallel, however, restricting them to work on newly introduced vector registers only. All floating-point SIMD instruction set extensions feature constrained vector memory access, in-register data shuffling, and parallel computation. They are all based on packing two or four floating-point numbers into their vector registers. Examples include Intel's SSE family, AMD's 3DNow!, Motorola's AltiVec, and the vector extensions implemented in IBM's BlueGene/L 1 processors.
To utilize short vector SIMD extensions transparently within C and Fortran programs, automatic vectorization is a must. Vectorization is a well-studied topic: loop vectorization originates from vector computers, and the vectorization of basic blocks is related to instruction-level parallelism.
However, when these well-established techniques are targeted at short vector extensions, they introduce a considerable overhead in the resulting codes. Only algorithms having very regular structure (e.g., algorithms in three-dimensional (3-D) graphics) can be mapped to short vector instructions in this way without significant overhead; in particular, numerical kernels often give rise to a nonnegligible amount of overhead. It is, therefore, not surprising that only modest speedups are gained by vectorizing compilers [44] .
To bypass this performance bottleneck, most compilers targeting short vector extensions provide proprietary programming interfaces allowing the programmer to insert SIMD instructions manually. This approach exposes architectural features and requires programmers to deal with low-level issues like data alignment and to select appropriate vector instructions. The use of such language extensions reduces portability of source code (by the introduction of vector extension specific optimization) and requires considerably higher programming expertise. Straightforward application of the new instructions usually leads to disappointingly low speedups and may even result in actual slowdowns.
In performance optimization, it is often required to trade operation counts for structure, i.e., do suboptimal computation to allow for better vectorization and globally change the algorithm's structure. A second tradeoff is a satisfactory utilization of SIMD instructions versus vectorization overhead. Therefore, speedup values close to their theoretical upper bound are often realized only by expert programmers dealing with very simple algorithms.
B. Automatic Performance Tuning
An important optimization target is the memory hierarchy of the computer used. Optimization for data locality is a must to achieve satisfactory performance on today's computer systems. Most production quality C or Fortran compilers include memory hierarchy optimization, typically by applying aggressive loop reorganization. These compilers also use profiling data gathered by running the actual program for further optimization. A vast amount of literature exists on the topic of loop optimization and feedback directed compilation (see, for instance, [47] ).
However, due to the complexity of modern architectures, even the most sophisticated compiler techniques aiming at the memory hierarchy are incapable of achieving performance on a level close to that of automatic performance tuning systems like FFTW [22] - [24] , SPIRAL [39] , [43] , [53] , PHIPAC [9] , and ATLAS [13] , [51] . These systems use an empirical approach to automatically adapt numerical kernels to a given computer system. They search in the space of possible implementations and optimizations of their algorithm base and use actual runtimes of the generated programs as the cost function in the optimization process.
In terms of the performance of numerical kernels, an important question is how automatic performance tuning systems can be adapted to support short vector SIMD extensions. For portability reasons, these systems automatically generate and optimize C or Fortran code. The obvious solution is to let a vectorizing compiler take care of the SIMD extensions while the automatic performance tuning system is responsible for the optimization with respect to the memory hierarchy. However, this approach turns out to be too simple-minded. There are two reasons for this.
1) Optimization dealing with loop vectorization conflicts with optimizations related to the memory hierarchy. Efficient vectorization of loops requires long vector lengths while data locality calls for a minimization of vector lengths. 2) Automatic performance tuning systems carry out loop unrolling and algebraic simplification, which results in large basic blocks with complicated structure. Such basic blocks are the worst case for extracting instruction-level parallelism. As a result, vectorizing compilers in tandem with automatic performance tuning systems often produce poorly performing code.
C. Contributions of This Paper
This paper introduces methods that enable state-of-the-art automatic performance tuning systems to utilize floatingpoint short vector SIMD extensions efficiently. Support is provided at three different levels.
1) Code Generator Level. This paper introduces and describes SIMD-aware vectorization algorithms for digital signal processing (DSP) transforms designed to fit the needs of automatic performance tuning systems. 2) Vectorization of Automatically Generated Code. The Vienna MAP vectorizer utilizes the known structure of computational kernels generated by self-tuning numerical software to translate these kernels into efficient vector code. 3) Replacement of C Compiler. The Vienna MAP back end translates vectorized code into high-performance vector assembly code. The results from all three levels are connected to, and partly included in, the state-of-the-art automatic performance tuning systems SPIRAL, FFTW, and ATLAS. Moreover, FFTW-GEL [31] , a specialized FFTW version including the Vienna MAP vectorizer and back end, is provided.
The main results of this paper can be summarized as follows. Our methods for utilizing SIMD extensions achieve the same level of performance as hand-tuned vendor libraries while providing performance portability. Combined with leading-edge self-tuning numerical software-FFTW, SPIRAL, and ATLAS-we have, therefore, produced: 1) the fastest real and complex FFTs running on x86 machines (Pentium III, Pentium 4, and AMD processors) for certain vector lengths; 2) the only FFT routines supporting IBM's Power PC 440 FP2 double FPU used in BlueGene/L machines; 3) the only automatically tuned vectorized implementations of important DSP algorithms (including discrete sine and cosine transforms (DSTs and DCTs), Walsh-Hadamard transforms (WHTs), and multidimensional transforms); and 4) the only fully automatically vectorized ATLAS kernels.
By comparing industry-standard nonadaptive implementations with the SIMD-aware automatic performance tuning that we provide in tandem with our partners' software, an overall speedup of an order of magnitude has been achieved (this is comparable to five years of advance in hardware development).
D. Synopsis
The paper begins with a survey of the state-of-the-art in automatic performance tuning, compiler back ends, and short vector SIMD vectorization. Section II reviews compiler technology that plays an important role in the context of this paper. The idea of automatic performance tuning is introduced in this section and the automatic performance tuning systems ATLAS, FFTW, and SPIRAL are reviewed. Section III provides details of current short vector SIMD extensions, including vectorization techniques and their problems when applied to short vector SIMD extensions.
The remaining sections present the novel contributions of this paper. Section IV introduces symbolic vectorization of DSP transforms. Section V explains the vectorization techniques applied in the Vienna MAP vectorizer. Section VI introduces the Vienna MAP back end. All these novel technologies are assessed experimentally in Section VII.
II. COMPILER TECHNOLOGY AND AUTOMATIC PERFORMANCE TUNING
This section introduces and evaluates compiler techniques relevant to high-performance scientific computing. In addition, the concept of automatic performance tuning is introduced.
A. High-Performance Compilers
Today's production compilers for high-performance computing feature sophisticated optimization techniques to cope with the complexity of current computer systems. Optimization targets the utilization of resources within the CPU as well as memory locality, special instructions, and multiple processors.
Standard loop optimization techniques targeting locality of references and CPU resource utilization include loop pipelining, partial unrolling, exchanging, splitting, and fusing [40] . Other techniques to enhance CPU resource utilization include register allocation [2] , instruction scheduling, and strength reduction [40] . Features like data prefetching, multiple CPUs, and vector instructions are addressed by various algorithms. Feedback directed compilation [47] utilizes the actual runtime behavior of a program in the optimization process. An excellent survey of compiler techniques may be found in [7] .
Of these techniques, the quality of register allocation is particularly important for the compilation of numerical straight-line code. [25] assesses the simple, yet effective farthest first algorithm [47] in a compiler back end for MIPS processors. Experiments demonstrate that this alternative spilling strategy is superior to the standard techniques (which are based on graph coloring [2] ) when compiling numerical code.
B. Automatic Performance Tuning
For numerical kernels, which demand the highest possible level of performance, traditional optimizing compilers are not sufficient when used in isolation on portable source code.
Automatic performance tuning as a new software optimization paradigm targets this problem. It is a problem-specific approach to performance optimization beyond general-purpose compiler optimization. Automatic performance tuning systems feature code generators that utilize domain knowledge to generate many alternative implementations for a given algorithm, exploiting degrees of freedom in the problem to be solved. Actual runtimes of these alternative implementations drive a search process that automatically selects the best performing algorithm for a given machine.
ATLAS: ATLAS [13] , [50] , [51] uses empirical techniques to produce efficient performance portable implementations of Basic Linear Algebra Subprograms (BLAS). ATLAS automatically generates various implementations of a given BLAS operation and searches for optimal loop tiling, blocking, software pipelining, register blocking, instruction scheduling, etc., to find the best way of performing this operation on a given computer.
FFTW: The first effort to automatically generate highperformance FFT code was FFTW [23] . Typically, it produces code that runs faster than publicly available FFT codes and compares well to vendor libraries. A dynamic programming approach relying on a recursive implementation of the Cooley-Tukey FFT algorithm [49] provides for the adaptation of the FFT computation of a given size to a given target machine at runtime. The actual computation is done within routines called (twiddle and no-twiddle) codelets produced by a program generator named genfft [22] . The newly released FFTW 3 [24] includes results of this paper.
SPIRAL: SPIRAL [39] , [43] is a code generator for high performance DSP transforms. For a given transform to be implemented, the rule-based formula generator produces one out of many possible fast algorithms, represented by mathematical formulas in the signal processing language SPL. The formula translator (SPL compiler) [53] translates these formulas into C or Fortran code. Based on measured program runtimes, the search engine generates alternative formulas and triggers their implementation. Searching for good implementations by iterating this process tries to bring the best out of a given machine.
III. UTILIZING SHORT VECTOR SIMD EXTENSIONS EFFICIENTLY
Modern processors from embedded computing to supercomputers feature short vector SIMD extensions. These extensions operate on vectors of basic integer and floating-point data types. The vectors allow for fine-grained parallelism. To keep the complexity of the microprocessor design low, only restricted sets of operations are supported, Table 1 Floating-Point SIMD Instruction Set Extensions. For Each Extension Listed, the Vector Length, the Calculation Precision, and a List of Supporting Systems is Provided SIMD instruction set architecture (ISA) extensions supporting floating-point operations include Intel's streaming SIMD extensions (SSE, SSE 2, and SSE 3), AMD's 3DNow! and its extension called "enhanced 3DNow!", Motorola's AltiVec, and the SIMD extensions of IBM's double floating-point unit for BlueGene/L machines. Table 1 gives an overview over the SIMD ISA extensions supported by machines targeted in this paper. Fig. 1 presents typical examples of two-way and four-way SIMD operations.
A. Architectural Features of SIMD Extensions
Depending on the actual architecture and the precision used, floating-point SIMD extensions may boost the potential peak performance by up to a factor of four (see Table 2 ). However, speedup by itself does not indicate absolute performance. Implementations utilizing SIMD technology leading to good speedup and at the same time achieving a good absolute performance are rare.
Although short vector SIMD technology is similar in concept to the technology implemented in VLIW processors, vector processors, or super scalar processors, SIMD extensions have restrictions that distinguish them from these related concepts. These restrictions include the following. 1) Restricted Parallelism. Short vector SIMD floatingpoint instructions and data shuffling instructions are more general than those found on vector processors, but the respective vector length is much shorter, thus distinguishing SIMD processors from vector computers. The usage of vector registers makes SIMD technology more restrictive than VLIW and superscalar designs. An important issue is that an application's data flow may require many shuffle operations to allow for parallel computation of arithmetic instructions. Complicated data flow requires more shuffle instructions, which can lead to actual performance degradation. 2) Memory Access Peculiarities. Memory access on processors with short vector SIMD extensions is dominated by the memory hierarchy. Systems featuring such processors behave like modern superscalar processors and, thus, require optimization for the memory hierarchy. This is in contrast to the behavior of conventional vector computers, which do not feature a memory hierarchy. Another major restriction is that only properly aligned data can be loaded into vector registers efficiently. Unaligned or nonunit-stride memory access can, therefore, lead to extremely poor performance.
Due to the inherent restrictions of SIMD extensions, only very regular kernels may lead to satisfactory speedup. Even when SIMD optimizations in theory provide large speedup (up to a factor of four on Pentium 4), even highly tuned applications may see only a fraction of this. Nevertheless, not using SIMD extensions is throwing away a good part of the performance of your machine.
The techniques introduced in this paper overcome the restrictions of SIMD extensions and lead to demonstrable speedup factors of up to 1.8 for two-way SIMD extensions and 3.3 for four-way extensions as compared with the best-performing scalar FFT codes. In addition, our codes are twice as fast as the fastest codes obtained with vectorizing compilers. This impressive performance improvement has been obtained despite the fact that the structure of FFT computation does not fit well to the architectural features of SIMD extensions.
B. Vectorizing Codes for Short Vector SIMD Extensions
Currently available methods for producing programs that are able to utilize short vector SIMD instructions can be categorized as follows.
1) Interface Used. Currently, there are three commonly used interfaces available to programmers who want to utilize SIMD instructions: a) portable high-level language in tandem with vectorizing compilers, possibly requiring hints (pragmas) on how to vectorize; b) proprietary C language extensions that provide explicit access to short vector SIMD extensions on source level; and c) assembly language. 2) Who Vectorizes. The actual vectorization process can be done explicitly by programmers. Contrary to this approach is the use of vectorizing compilers (which may or may not be hinted by programmers) to extract parallelism from portable programs. As a third option, program generators may generate innately vectorized codes. 3) Structures Vectorized. Vectorization methods either extract parallelism from independent loop iterations or extract instruction-level parallelism from basic blocks. 4) Generality of Approach. Some vectorization approaches are general-purpose techniques (for instance, when they are applicable to any program that features loops or to any basic block) while other methods depend on the special structure of given algorithms like complex FFTs. All these approaches involve tradeoffs between portability and generality on the one hand and achievable performance on the other.
Hand-Coding: For FFTs, there exist well-known hand-coded vector algorithms like Stockham's FFT algorithm [29] , [48] and vector computer libraries like the SCIPORT library [34] . However, without further adaptation to the memory hierarchy, these algorithms lead to disappointingly low performance on current SIMD architectures.
A general-purpose hand-coding method utilizing instruction-level parallelism is to implement a program using the complex arithmetic of C99 [5] and let an appropriate compiler map the complex operations to sequences of two-way vector instructions.
To exploit an algorithm's intrinsic parallelism, ad hoc utilization of instruction-level parallelism within a program and the hand vectorization of loops is often performed. In this case, the programmer formulates the parallelism within the algorithm using proprietary language extensions, which inhibits portability.
In the field of DSP transforms, these hand-coding approaches are used in SIMD-enabled vendor libraries (examples include Intel's MKL and IPP [28], Apple's vDSP [6] and vBigDSP [12] , as well as AMD's core math library ACML [4] ), application notes (Intel's split radix FFT [26] ), and free implementations like the NEC V80R FFT [41] or the Linux SIMD library libSIMD [42] . SIMD-vectorized wavelet transforms are presented in [11] and a SIMD-vectorized FFT library is presented in [45] .
Vectorizing Compilers: There exist many research and production-quality compilers for SIMD extensions, including Intel's C++ compiler [27], IBM's XL C compiler for BlueGene/L supercomputers [3] , a vectorizing extension to the SUIF compiler [46] , Codeplay's VECTOR C compiler [10] , and the SWAR compiler scc [14] , [15] .
Automatic general-purpose loop-vectorizing compiler technology originates from vector computer research and is included in most vectorizing compilers [54] . These algorithms were designed for long vector lengths and other characteristics of conventional vector computers, like constant nonunit stride memory access. Some of these implicit assumptions are no longer valid for short vector SIMD extensions. In addition, vectorization-and locality-enhancing loop transformations often conflict. Compilers, therefore, often require user hints (for instance, by pragmas) in order to successfully vectorize loops [21] .
Automatic general-purpose methods based on the extraction of instruction-level parallelism in basic blocks are used in many compilers (e.g., Intel, VectorC, and IBM compilers). They originate from very large instruction word (VLIW) research [15] , [35] . These algorithms search for code subblocks that feature parallelism. In order to map the full computation, these parallel blocks must be connected, either by scalar operations or by data shuffling operations, which can introduce considerable vectorization overhead. Due to an exploding search space, these algorithms tend to fail on large basic blocks having complicated structure. Experiments show that in this case, these algorithms may produce negligible speedup or may even slow down the code.
A graph-based code selection technique for DSPs with SIMD support has been introduced in [36] . Techniques for SIMD utilization in the context of energy-aware compilation for DSPs are presented in [38] .
Pre-Existing Methods Used in Code Generators: ATLAS allows for the insertion of hand-coded kernels featuring SIMD instructions into its optimization cycle. ATLAS depends on programmers contributing such hand-coded kernels for new architectures [52] . These kernels are typically coded in assembly language.
The code generator of FFTW 3 includes instruction-level vectorization for two-way vector extensions that is based on properties of complex FFTs and utilizes C language extensions. For four-way vector extensions, a combination of this method and loop vectorization is applied.
C. Our Approach
This paper introduces two approaches to domain-specific vectorization. Section IV introduces a loop vectorization method for DSP algorithms. It provides a DSP-specific approach to optimizing memory access operations (in the presence of a deep memory hierarchy) and SIMD vectorization simultaneously. This method was included in the code generators and the runtime systems of experimental SIMD versions of FFTW and SPIRAL.
Section V provides a vectorizing compiler for numerical kernels. It introduces a domain specific method to extract instruction level parallelism. Domain knowledge is utilized to search for a low vectorization overhead and avoid combinatorial explosion. It is applied to the output of the automatic performance tuning systems ATLAS, FFTW, and SPIRAL.
IV. SYMBOLIC VECTORIZATION OF SIGNAL TRANSFORMS
This section introduces domain-specific vectorization techniques for DSP transforms [16] - [21] . The presented ap-proach originates from the observation that neither original vector computer DSP transform algorithms [34] nor vectorizing compilers [27], [35] are generally capable of producing high-performance DSP transform implementations for short vector SIMD architectures, even in tandem with automatic performance tuning [21] .
The intrinsic structure of DSP transforms prevents the successful application of these well-known methods. The main obstacle is that the structure of memory access (given by permutations and loop carried array references) occurring in DSP transform algorithms is incompatible with the features offered by currently available short vector SIMD target architectures.
To overcome these problems, we introduce new SIMD vectorization techniques that are designed to be used in code generators and the runtime environment of automatic performance tuning systems, specifically targeting FFTW and SPIRAL. The most challenging problem in our approach is the recursive breakdown of DSP algorithms or the manipulation of given recursive algorithms so that all arithmetic operations occur in vectorizable loops with iterations, and the resulting data access patterns can be handled efficiently across all current SIMD architectures.
A. Mathematical Framework
We use the representation of DSP algorithms as matrix factorizations applying the Kronecker product formalism [29] , [43] as language to express our vectorization algorithms. The approach is based on the fact that many Kronecker product formulas have an intuitive interpretation as programs [29] , [53] .
The 
where is the identity matrix, the complex twiddle diagonal matrix, and the stride permutation matrix. Recursive application of rules like (1) yields a fast algorithm. The degree of freedom in choosing a factorization of the transform size in each step leads to a large number of mathematically equivalent formulas with similar arithmetic cost, but different structure (data flow).
To describe algorithms for short vector SIMD extensions, a formal translation from complex matrices into real matrices is required [17] . The complex multiplication is equivalent to a real matrix-vector product (2) inducing the definition of the operator . Formally, the complex matrix-vector multiplication is translated into , where arises from by replacing every entry by the 2 2 matrix in (2) and is obtained by replacing each entry of by the 2-D real vector in (2) .
Algebraic manipulation of formulas by applying matrix identities allows to change the data flow of a fast algorithm [29] . Important examples in the context of this paper include stride permutation matrix factorizations like and as well as the conjugation of a matrix by a permutation as applied to a tensor product by and manipulation rules for the bar operator Different data layouts for complex numbers can be expressed in terms of the bar operator and conjugation. The full set of identities required to derive the formal vectorization algorithm for DSP transforms as well as the full derivation can be found in [17] . 
B. Vectorizable Formulas
Formulas describing DSP transforms are viewed as being built from certain block matrices with block size and matrices that are not such block matrices. The nonblock matrices are either mapped to less efficient extension specific vector code or to scalar code.
Block matrices are either built from diagonal matrices of size or they are special permutation matrices. Later on, the block matrices are mapped to highly efficient vector code independently of the target machine's short vector architecture. Their block structure provides that all memory access operations are properly aligned vector loads and stores. All data reordering is done in the vector registers and all arithmetic operations are pure vector operations. See Table 3 for vector instructions and examples of corresponding block matrices. The remainder of this section gives the particulars of the two types of block matrices.
Compute Matrices: All arithmetic operations are done in matrix-vector operations on vectors of length . Their building blocks are the real diagonal matrices and diag where denotes the zero matrix. One of the most important constructs in this class of block matrices is the tensor product (3) To obtain , all entries in are replaced by diag It is crucial to our approach that the vectorization of constructs of the form is done independently of the actual Kronecker formula describing the structure of .
To describe block matrices arising from complex diagonal matrices with entries , the operator defined as for diag (4) is used. The conjugation of the real matrix by produces a matrix built from real diagonal matrices of size . As an example, consider the matrix derived from diag An important block matrix is , which is built from diagonals of size and does not originate from a tensor product of the twiddle diagonal . Permutation Matrices: To be considered a block matrix, a permutation matrix must be the product divides the size of (6) built from the permutation matrices , , and . Permutations that are of type (6) only require vector memory access (addressing encoded in and ) and a moderate number (depending on ) of register-to-register data reordering operations, independently of the target architecture.
All other permutations must be implemented using scalar code or less efficient vector code.
C. Symbolic Vectorization Algorithm
Symbolic vectorization of DSP transforms translates the problem of vectorizing DSP transform algorithms into the problem of generating efficient scalar code for DSP transforms. First, formula manipulation is used to transform a DSP algorithm (given in Kronecker product notation) into a vectorizable algorithm with similar characteristics. This new transform algorithm is then implemented using vector instructions by utilizing existing code generators (FFTW's genfft [22] and SPIRAL's SPL compiler [53] ) and newly developed vector-specific extensions to these code generators.
Our formal vectorization approach uses tensor products as core constructs. Diagonal matrices and permutation matrices are vectorized with respect to tensor products. All vectorized constructs are transformed into symbols , defined by (7) with permutation matrices and . and are block matrices originating from real or complex diagonals. is an arbitrary formula in Kronecker product notation.
The implementation of a symbol is centered on the implementation of . First, existing code generators are utilized to generate efficient vector code for by generating efficient scalar code for and then replacing each scalar instruction by the respective vector instruction (for instance, is replaced by ). In the respective code for , all vector load and store operations are then replaced by the arithmetic operations required by and and the data reorganization operations required by and . Provided and match (6), this approach leads to an efficient vector implementation of . For example, Fig. 2 shows the generated code for the symbol DFT .
D. Short Vector Cooley-Tukey Vectorization
The most crucial part in achieving good performance by symbolic vectorization is to make the permutations and of a symbol S match (6) while keeping the changes to the original formula minimal. For many transform algorithms like WHTs or 2-D transforms, this is an easy task. In case of the vectorization of Cooley-Tukey FFTs, however, a more sophisticated approach is required.
The short vector Cooley-Tukey rules given by (8)-(10) solve this problem for FFTs of size with arbitrary . All matrices occurring in FFT algorithms vectorized by (8)-(10) are block matrices with diagonal matrices of size as blocks. All permutations match (6) . When applying the short vector Cooley-Tukey rules, the SIMD enabled versions of FFTW and SPIRAL optimize efficient SIMD implementations for the memory hierarchy by simultaneously searching for the best factorization of and for the best implementation of constructs DFT . We also derived a transposed version and versions for different complex data formats. Equation (9) originates from the vector recursion of FFTW 3 [24] .
The short vector Cooley-Tukey "entry rule" is given by DFT (8) translating DFT into the symbol DFT with and the recursive part . Any recursive part DFT is further factorized using the recursive rule
with leading to a symbol DFT and a new recursive part . Considering the factorization , the application of rule (9) "consumes" the factor by breaking the factor recursively into and . The recursion ends when all factors are "consumed" by the application of rule (9) , finally leading to the last recursive part DFT The application of the leaf rule DFT (10) with transforms into the required form to make the permutations and match (6).
V. VECTORIZATION TECHNIQUES FOR STRAIGHT-LINE CODE
This section introduces the Vienna MAP vectorizer [32] , [33] , [37] that automatically extracts two-way SIMD parallelism out of given numerical straight-line code. The MAP vectorizer additionally supports the extraction of SIMD FMA instructions.
MAP has been applied successfully to straight-line code produced by FFTW, SPIRAL, and ATLAS. Thus, a large variety of numerical computations ranging from FFTs and other DSP transforms to BLAS kernels can be vectorized automatically.
A. Fundamentals of Vectorization
Existing approaches to vectorizing basic blocks [15] , [35] , [36] try to find an efficient mix of SIMD and scalar instructions to do the required computation, MAP's vectorization mandates that all computation is performed by SIMD instructions, while attempting to keep the SIMD reordering overhead reasonably small.
MAP's vectorizer uses depth-first search with chronological backtracking to discover SIMD style parallelism in a scalar code block, aiming at a reduction of the overall instruction count. Obtaining a satisfactory SIMD utilization is tantamount to minimizing the amount of SIMD data reorganization while maximizing the coverage of the scalar direct acyclic graph (DAG) by natively supported SIMD instructions. Fig. 3 gives an example of short vector SIMD code obtained by vectorizing straight-line complex FFT code.
B. Vectorization Engine
The central goal of vectorization is to replace all scalar instructions by vector instructions. The description of the vectorization technique relies on the following definitions.
Operation Pairing: Pairing rules specify ways of transforming pairs of scalar instructions into a sequence of SIMD instructions. A pairing rule often provides several alternatives to do so. The vectorizer supports pairing rules of the following instruction combinations: 1) load/load; 2) store/store; 3) unary/unary; 4) binary/binary; 5) unary/binary; 6) unary/load; and 7) load/binary.
Variable Fusion: Two scalar variables , can be fused to a SIMD variable of the form or . The vectorization process ensures that no scalar variable is part of two different SIMD variables. 
C. Vectorization Quality
To produce vector code of the highest quality, the vectorization engine starts out by constraining all SIMD memory operations to access consecutive locations and by disabling suboptimal operation pairing rules of type 5), 6), and 7).
If these restrictions cause the vectorization process to fail, it is restarted after enabling operation pairing rules of type 5), 6), and 7), and the support for less efficient, i.e. nonconsecutive, memory accesses. This step substantially extends the class of vectorizable codes by allowing the extraction of some less efficient instruction combinations.
In the worst case, a fallback to the vector implementation of scalar code is made by leaving half of each SIMD instruction's capacity unused. On all surveyed x86 machines, the resulting codes are faster than the scalar legacy x87 codes generated by standard general-purpose compilers.
D. Vectorization Algorithm
Before the actual vectorization process is carried out, the following preparatory steps are taken. First, dependencies of the scalar DAG are analyzed and instruction statistics are assembled. This data is used to speed up the vectorization process by avoiding useless vectorization. Then, store instructions are combined nondeterministically by fusing their respective source operands.
The actual vectorization algorithm consists of two steps. 1) Pick two scalar instructions and that have not been vectorized, with ( , ) or ( , ) being an existing fusion. 2) Nondeterministically pair the two scalar operations into one SIMD operation. This step may produce new fusions for the respective source operands. The vectorizer alternatingly applies these two steps until either the vectorization succeeds, i.e., thereafter all scalar variables are part of at most one fusion and all scalar operations have been paired, or the vectorization fails. If the vectorizer succeeds, it commits to the first solution of the search process.
Nondeterminism in vectorization arises due to different vectorization choices. For a fusion ( , ) there may be several ways (see Fig. 4 ) of fusing the source operands , , , , depending on the pairing ( , ), as depicted in Fig. 5 .
The rule ranking, i.e. the order in which vectorization alternatives are tried, influences the order of the solutions of the vectorization process. As the vectorizer always commits to the first solution, the rule ranking is adapted in such a way that the first solution favors instruction sequences which are particularly well-suited for a given target machine. For instance, on an AMD K7 processor the vectorizer prefers extracting intraoperand (ACC) over interoperand (PAR, CHI) style SIMD instructions (see Fig. 5 ).
E. Peephole-Based Optimization
After vectorization, a local rewriting system is used to implement peephole optimization on the vector DAG.
The first group of rewriting rules aims at: 1) minimizing the number of instructions; 2) eliminating redundancies and dead code; 3) reducing the number of source operands; 4) copy propagation; and 5) constant folding.
The second group of rules can be used to extract SIMD-FMA instructions.
The third group of rules rewrites unsupported SIMD instructions into sequences of SIMD instructions actually supported by the target architecture.
Finally, the optimizer topologically sorts the instructions of the vector DAG. The scheduling algorithm minimizes the life span of variables by improving the locality of variable accesses. It is based on the scheduler of genfft [22] . 
VI. BACK-END TECHNIQUES FOR STRAIGHT-LINE CODE
The Vienna MAP back end [32] , [33] (see Fig. 6 ) introduced in this section generates assembly code optimized for short vector SIMD hardware. It is included in an experimental version of FFTW, FFTW-GEL [31] , and has been connected to SPIRAL and ATLAS. Currently supported targets include x86/3DNow! and x86/SSE2. A PowerPC version is currently being developed.
Like in [25] , the MAP back end uses the farthest first algorithm as its spilling policy in the register allocator. Additionally, as the MAP back end does not target a broad range of structurally different hand-written code, it makes use of domain specific meta information by exploiting the following properties of array access operations occurring in automatically generated straight-line code: 1) Any memory access is indexed, possibly with a stride as runtime parameter and 2) each memory location is read/written at most once.
A. Main Parts of the Back End
The MAP back end performs ISA specific optimization in: 1) register allocation; 2) computation of effective addresses; 3) usage of in-memory operands; and 4) register reallocation. ISA specific optimization addresses general properties such as the number (and type) of available logical registers, available instruction forms, and the existence of special instructions like lea (on x86).
The instruction scheduler performs processor-specific optimization by taking into account execution properties provided by a processor-specific execution model. This model specifies required execution resources, available resources, the maximum number of instructions issued at each clock cycle, and instruction latencies.
The last optimization step of MAP's back end is responsible for the avoidance of address generation interlocks (AGIs).
B. Nonrecurring Optimizations
Register allocation, computation of effective addresses, and AGI prevention are optimization techniques performed only once. Register allocation is followed by instruction scheduling, which tries to maximize the usage of the functional units and the pipeline of the processor on hand.
Register Allocation for Straight-Line Code: While there is no single best phase ordering in the context of general-purpose compilers targeting handwritten code, MAP's back end assumes that the input code has been reasonably scheduled on a higher level using domain-specific meta information. This kind of preprocessing can be done, for example, like genfft [22] does, by performing an FFT-specific topological sort of the computation DAG in an attempt to minimize the asymptotical number of register spills.
As any future access to temporary variables is known in advance, the MAP back end uses the farthest first algorithm [8] as its spilling heuristic. This means that whenever a temporary variable needs to be mapped to a logical register, the following strategy is used. 1) Whenever possible, choose a fresh logical register. 2) If there is at least one dead logical register, i.e. a logical register holding a value that is not referenced in the future, choose the one logical register that has been dead for the longest time. Notice that reusing any dead logical register introduces a false dependency. 3) If there are no dead logical registers, choose that logical register , which holds a value whose reference lies farther in the future than all other references to logical registers. Optimized Index Computation: Index computation is not normally a relevant issue in performance optimization of non-DSP code. However, straight-line codes produced by DSP program generators feature an unusually large proportion of memory access operations in relation to arithmetic operations. Thus, optimized index computation is crucial to achieving high performance in this case.
MAP targets code that operates on arrays of input and output data not necessarily stored contiguously in memory. Thus, access to an array element may result in a memory access operation either at address or , where and are parameters passed from the calling function. Memory access operations with constant stride do not need explicit effective address computation, whereas those with variable stride do.
To achieve an efficient effective address computation, the usage of general integer multiplication instructions is avoided by performing a combination of strength reduction and common subexpression elimination in the register allocator.
Whenever the register allocator needs to emit code for the calculation of an effective address, the (locally) shortest possible sequence of simple instructions (add, sub, shift, lea) is determined by forward chaining using depth-first iterative deepening with a depth limit that depends on the latency of the integer multiplication instruction on the given target architecture. In case the depth limit is exceeded, strength reduction is not applied.
As the shortest sequences of code doing effective address calculation tend to eagerly reuse already calculated contents of the integer register file as factors, a replacement policy based on the least recently used (LRU) heuristic is employed in the allocation of integer registers.
Prevention of AGIs: Although modern x86 compatible processors allow the out-of-order execution of instructions, the definition of Intel's Pentium architecture mandates that instructions accessing memory, i.e., loads and stores, must be executed in order. This requirement is the reason why AGIs occur. Whenever a memory operation in need of additional effort for effective address calculation (in scaled index-addressing mode) directly precedes a memory operation not requiring additional effort, both memory operations are delayed for the duration of the more expensive address calculation-an AGI occurs. The MAP back end tries to avoid such AGIs by reordering the affected instructions after leaving the feedback-driven optimization loop.
C. Feedback-Driven Optimizations
The instruction scheduler and the register reallocator form a feedback driven optimization loop. The instruction scheduler serves as a basis for estimating the runtime of the entire basic block. As long as the code's estimated execution time can be improved, feedback-driven optimization is carried out.
Basic Block Instruction Scheduling: To maximize the overall performance, the respective code has to be scheduled in a way to take maximum advantage of the pipelines provided by the architecture [40] , [47] . Instruction scheduling is an optimization technique that rearranges the microoperations executed in a processor's pipeline, attempting to maximize the number of functional units operating in parallel and to minimize the time they spend waiting for each other [30] .
MAP's instruction scheduler deals with the instructions of a single basic block by using local list scheduling [40] , [47] . The scheduling algorithm utilizes information about the critical-path lengths of the underlying data dependency graph as a heuristic when selecting an instruction to be issued. The list scheduling algorithm implemented in MAP interacts with an execution model of the target processor to simulate the effects of superscalar in-order execution of an instruction sequence.
Register Reallocation: Register allocation and instruction scheduling have conflicting goals. As the register allocator tries to minimize the number of register spills, it prefers introducing a false dependency (by reusing a dead logical register) over spilling a live logical register. The instruction scheduler, on the other hand, tries to maximize the pipeline usage of the processor by spreading out the dependent parts of code sequences according to the latencies of the respective instructions.
As the MAP back end performs register allocation before instruction scheduling, false dependencies introduced by the register allocator severely reduce the degree of freedom of the instruction scheduler.
To address this problem, the register reallocator tries to lift some of the restrictive data dependencies introduced by register allocation, enabling the instruction scheduler to do a better job in a subsequent pass. Additionally, the register reallocator uses information about the spills introduced by the register allocator to minimize the code size by appropriately utilizing complex instruction set computer (CISC)-style instructions (with in-memory operands) and by optimizing x86 copy instructions.
VII. EXPERIMENTAL RESULTS
This section provides experimental evidence for the impressive performance gain unleashed by the methods introduced in this paper. In particular, high-performance implementations of DSP algorithms are obtained by including support for short vector SIMD extensions to SPIRAL and FFTW.
Symbolic vectorization and the Vienna MAP vectorizer provide the fastest FFTs currently available on x86 architectures featuring 3DNow!, SSE, or SSE 2, for certain problem sizes. In addition, these methods produce the only currently available FFT software for IBM's BlueGene/L PowerPC 440 FP2 processors which takes full advantage of their double FPU. Formal vectorization and the Vienna MAP vectorizer also provide the only current vectorized implementations of general one-dimensional (1-D) and multidimensional DSP transforms that are automatically tuned, as well as supplying the only fully automatically vectorized ATLAS kernels.
A. Experimental Setup
Numerical experiments were conducted on the following machines featuring different short vector extensions: 1) a prototype of BlueGene/L's PowerPC 440 FP2 running at 500 MHz featuring a double FPU; 2) various Pentium 4 machines featuring SSE and SSE 2 running at 1.8, 2, 2.6, and 3 GHz; 3) a 1.53 GHz Athlon XP 1800+featuring 3DNow! professional; and 4) a 933 MHz~MPC7450 G4 featuring AltiVec.
SSE and AltiVec are four-way vector extensions, while 3DNow! and IBM's double FPU are two-way vector extensions (see Table 1 ). SSE (on Pentium 4 processors) and Altivec (on the PowerPC) provide a theoretical speedup of four, while 3DNow! (on the AMD Athlon and Opteron) and IBM's double FPU (on the IBM PowerPC 440 FP2) provide a theoretical speedup of two (see Table 2 ). In addition, AltiVec and the double FPU provide fused multiply-add (FMA) instructions. Note that these speedup figures do not imply anything about the theoretical performance per cycle. For instance, both Athlon XP processors with 3DNow! and Pentium 4 processors with SSE can retire four single-precision flops per cycle, despite their different SIMD vector lengths of two and four, respectively.
The vectorization techniques introduced in this paper were assessed using ATLAS kernels and general DSP transforms with special focus on FFTs. Due to the variable flop count of different FFT algorithms, FFT performance is given in pseudo Gflop/s, i.e., runtime (in nanoseconds). For all other transforms actual performance in Gflop/s is measured. This keeps the relation of runtime between different implementations for all transforms.
B. Main Insights
Figs. 7-9 display a selection of instruction statistics and performance data to provide evidence for the most important experimental results.
The main insights provided by the numerical experiments can be summarized as follows.
Speed: Both symbolic vectorization and the Vienna MAP vectorizer considerably speed up computation across all the assessed machines. Figs. 8(a) and 9(a)-(h) show the performance of the generated codes across the test machines. Using four-way SIMD, speedups of up to 3.3 have been achieved leading to FFTs running at 7.3 pseudo Gflop/s on a 3-GHz Pentium 4, displayed in Fig. 9(a) .
Using two-way SIMD, high performance FFTs running at 2.4 pseudo Gflop/s on a 2-GHz Pentium 4 have been achieved, as displayed in Fig. 9(e) .
Versatility: High speedup values are maintained across a large variety of transforms: real and complex FFTs, both for power of two as well as arbitrary vector lengths displayed in Figs. 8(a) and Fig. 9(a) and (c)-(f) . More general transforms like 2-D DCTs have been sped up as well, as shown in Fig. 9(b) .
The Vienna MAP vectorizer is able to vectorize ATLAS kernels [illustrated in Fig. 9(h) ] and FFT codes [ Fig. 9(e)-(g) ]. The relative instruction counts displayed in Fig. 7 show that by using SIMD instructions the total number of arithmetic instructions is in some cases reduced by 50%, and in most cases reduced significantly. Slightly increased instruction counts may occur in some cases, which does not introduce severe problems, as the execution units for SIMD data shuffling and SIMD floating-point operations can operate in parallel.
Vectorizing Compilers and Portable Nonadaptive FFT Libraries: Both tested portable nonadaptive FFT libraries (Numerical Recipes and GNU GSL) feature optimized algorithms but are much slower than the best scalar adaptive code. They are slower than the best SIMD vectorized codes by a factor of five to ten, as can be seen in Figs. 8(a) , 9(a), and 9(e). These experiments provide evidence that modern (vectorizing) compilers are not able to generate fast machine code in conjunction with portable nonadaptive libraries.
Vectorizing Compilers and Automatic Performance Tuning: Automatic performance tuning in combination with vectorizing compilers does not, in practice, reach the same level of performance as hand-tuned code and code generated using symbolic vectorization or the Vienna MAP vectorizer and back end. In the best case, when using loop vectorization, Intel's C++ compiler reaches 60% of the speedup of symbolic vectorization, as demonstrated in Fig. 9(a) and (b) . For automatically generated FFT codes, vectorization of basic blocks using instruction level parallelism sometimes achieves a slight acceleration, but in other cases leads to a significant performance degradation. This effect is especially striking when using IBM's XL C compiler for BlueGene/L's PowerPC 440 FP2 processor, as displayed in Fig. 8(a) and (b) . In all these cases the Vienna MAP vectorizer achieves satisfactory speedup values.
Hand-Tuned Libraries With SIMD Support: Symbolic vectorization and the MAP vectorizer provide about the same performance as vendor-supplied hand-tuned libraries that are obtained in a laborious and costly process. In contrast, the presented techniques utilize modern software technology to automatically adapt code to a given machine. The resulting performance behavior is illustrated by Fig. 9(a), (d) , and (e). The highly satisfactory performance level achieved by the MAP vectorizer when applied to linear algebra kernels [as illustrated by Fig. 9(h) ] is the same as that achieved by ATLAS.
Effect of the MAP Back End: The Vienna MAP back end accelerates automatically generated DSP code by up to 25%, compared with standard C compilers like the GNU C and Intel C++ compiler, as displayed in Fig. 9(g) .
VIII. CONCLUSION
This paper presents special-purpose compilation techniques targeting the SIMD vectorization of DSP transforms and straight-line code, both in the context of automatic performance tuning.
The paper addresses the difficulty of optimizing for SIMD extensions and memory hierarchy simultaneously. The problems of generic loop vectorization and instruction-level parallelism extraction present in state-of-the-art vectorizing compilers are avoided by utilizing knowledge of the properties of the algorithms underlying the programs to be vectorized.
Most approaches described in literature try to find a good SIMD coverage, and have to trade the scope of SIMD application against low vectorization overhead. In contrast, the presented techniques guarantee full SIMD coverage and a very small vectorization overhead for most important BLAS and DSP kernels. This is achieved by utilizing knowledge about the algorithms at three different layers.
1) Symbolic vectorization of DSP transforms handles a large class of DSP transforms, including FFTs, WHTs, and all multidimensional DSP transforms. 2) Alternatively, automatic vectorization of straight-line code targets the large basic blocks generated by automatic performance tuning systems. 3) Finally, a special-purpose compiler for large basic blocks featuring vector instructions achieves fast object code. The presented compiler technology obtains unusually high speedup values-1.85 for two-way and 3.3 for four-way SIMD extensions-on top of the fastest scalar codes available, thus leading to an unprecedentedly high overall performance. In this way the performance level of hand-tuned vendor libraries is attained conjointly with performance portability.
In conjunction with the leading automatic performance tuning systems SPIRAL, FFTW, and ATLAS, high-performance short vector SIMD implementations of FFTs, general DSP transforms, and BLAS kernels have been obtained. Some of the techniques described in this paper have been included in the current release of the industry-standard numerical library FFTW and will become part of IBM's numerical library for BlueGene/L supercomputers. 
