Abstract. We present the automatic synthesis of the HPC Challenge's Global FFT, a large 1D FFT across a whole supercomputer system. We extend the Spiral system to synthesize specialized single-node FFT libraries that combine a data layout transformation with the actual on-node FFT computation to improve the network performance through enabling all-to-all collectives. We run our optimized Global FFT benchmark on up to 128k cores (32 racks) of ANL's BlueGene/P "Intrepid" and achieved 6.4 Tflop/s, outperforming ANL's 2008 HPC Challenge Class I Global FFT run (5 Tflop/s). Our code was part of IBM's winning 2010 HPC Challenge Class II submission. Further, we show first singlethread results on BlueGene/Q.
Background
In this section we discuss the necessary background for this paper. We review the Kronecker product formalism to describe fast Fourier transform (FFT) algorithms, the autotuning and program generation system Spiral, and the BlueGene/P supercomputer.
Fast Fourier Transform. Given n real or complex inputs x 0 , . . . , x n−1 , the discrete Fourier transform (DFT) is defined as
with ω n = exp(−2πi/n), i = √ −1. Stacking the x ℓ and y k into vectors x = (x 0 , . . . , x n−1 )
T and y = (y 0 , . . . , y n−1 ) T yields the equivalent form of a matrixvector product: y = DFT n x, DFT n = [ω kℓ n ] 0≤k,ℓ<n .
Computing the DFT by its definition (2) requires Θ(n 2 ) many operations. FFT algorithms reduce the runtime to O(n log(n)) and can be written in the form of structured sparse matrix factorization using the Kronecker product formalism [6, 12] . The twopoint FFT is given by the butterfly matrix,
In the following, we use I n to denote an n × n identity matrix, and
for the tensor (Kronecker) product of matrices. It replaces every entry a k,ℓ of A by the matrix a k,ℓ B. Most important for FFTs are the cases where A or B is the identity matrix. As examples consider
Further we introduce the stride permutation matrix defined by
L mn m can be seen as transposing a n × m matrix which is stored in row-major order and is derived from reshaping a mn-dimensional vector into a n × m matrix. As example consider
Equation (4) shows the general mixed-radix Cooley-Tukey FFT algorithm:
In (4), T mn n is a complex diagonal matrix [12] . Using (3) and (4), an 8-point FFT can be derived by two recursive applications:
Formulas can be manipulated using formula identities like
In (5)- (10), A is a m × m matrix and B and C are n × n matrices. Further we denote the conjugation of a matrix A by a permutation P by A P = P ⊤ AP and note that for permutation matrices
void FFT8(_Complex double * Y, _Complex double * X) { __alignx(16,Y); __alignx(16,X); _Complex double s34, s35, s36, s37, s38, t100, t101, t102 , t103, t104, t94, t95, t96, t97, t98, t99; t94 = ( * (X) + * ((X + 4))); t95 = ( * (X) -* ((X + 4))); t96 = ( * ((X + 2)) + * ((X + 6))); s34 = (__I * ( * ((X + 2)) -* ((X + 6)))); t97 = (t94 + t96); t98 = (t94 -t96); t99 = (t95 + s34); t100 = (t95 -s34); t101 = ( * ((X + 1)) + * ((X + 5))); t102 = ( * ((X + 1)) -* ((X + 5))); t103 = ( * ((X + 3)) + * ((X + 7))); s35 = (__I * ( * ((X + 3)) -* ((X + 7)))); t104 = (t101 + t103); s36 = (__I * (t101 -t103)); s37 = ((0.70710678118654757 + __I * 0.70710678118654757) * (t102 + s35)); s38 = ((-0.70710678118654757 + __I * 0.70710678118654757) * (t102 -s35)); * (Y) = (t97 + t104); * ((Y + 4)) = (t97 -t104); * ((Y + 1)) = (t99 + s37); * ((Y + 5)) = (t99 -s37); * ((Y + 2)) = (t98 + s36); * ((Y + 6)) = (t98 -s36); * ((Y + 3)) = (t100 + s38); * ((Y + 7)) = (t100 -s38); } Spiral. Recursive application of rules like (3) and (4) yields many different algorithms for a FFT size. Spiral [6] uses this fact to search for the fastest on a given platform. A user-specified transform (like DFT 256 ) is expanded by Spiral using rules into Table 1 and uses traditional compiler techniques like unrolling, array scalarization, constant folding, and strength reduction to produce high quality fixed-size FFT functions from a given formula. The runtime of the program is measured and fed into a search module, which triggers, in a feedback loop, the generation of a modified formula based on a search strategy. Upon termination, Spiral out the fastest program found. Figure 1 shows an 8-point FFT generated by Spiral for BlueGene/P, using the complex data type extension of C99. For sizes too large to be implemented as a single basic block, Spiral is automatically generating a recursive mixed-radix FFT library [7] similar to FFTW [5] . Spiral employs a rewriting system to symbolically expand breakdown rules like (4) to find a closure of recursive functions that is needed to implement the recursive FFT library. It then automatically implements these recursive functions as well as recursion leafs (codelets) for a sufficiently large set of sizes. At runtime, a planner autotunes the recursive decomposition of the FFT in an one-time setup effort. After tuning, a fast FFT library call for the respective problem size is available.
The key insight is that a straightforward implementation of (4) suggests four steps corresponding to the four factors, where two steps call smaller DFTs. However, to improve locality, the initial permutation L mn m is usually not performed but interpreted as data access for the subsequent computation, and the twiddle diagonal T mn n is fused with the subsequent DFTs. This strategy is chosen, for example, in the library FFTW 2.x and the code can be sketched as shown in Figure 2 . A simplified description of performing this process by hand can be found in [13] .
BlueGene/P. BlueGene/P is the second generation BlueGene architecture from IBM, succeeding BlueGene/L [14] . In its compute nodes BlueGene/P uses four PowerPC 450 cores operating at 850 MHz with a double precision, dual pipe floating point
, y + i, t1 + i); } // DFT variants needed void dft_iostride(int n, int istride, int ostride, complex * y, complex * x); void dft_scaled(int n, int stride, complex * d, complex * y, complex * x); unit per core. Each node has 13.6 Gflop/s peak performance (3.4Gflop/s per core) and 2 GB RAM with 13.6 GB/s memory bandwidth. Each core has a private 32 kB L1 cache and the four cores of a node share an 8 MB L3 cache. The compute nodes are connected with multiple interconnection networks including a 3-D torus (used for standard messaging), a global collective network (used for reductions), and a global barrier network. Each node has six bi-directional network links supporting 425 MB/s in each direction into the torus network leading to 5.1 GB/s bidirectional bandwidth per node. The BlueGene/P system "Intrepid" installed at Argonne National Laboratory (ANL) consists of 40 BlueGene/P racks. Each rack contains 1,024 compute nodes (32 node cards, each holding 32 compute nodes), and each compute node four cores (one quad-core CPU).
BlueGene/P messaging layer. We use the IBM UPC runtime system as messaging layer. It provides an equivalent to the MPI all-to-all collective operation that fully utilizes BlueGene/P's 3D torus interconnection network. To achieve best performance, exactly one large message of equal size that is contiguous in the node memory should be sent from every processor to every other processor.
BlueGene/Q. BlueGene/Q is the third generation BlueGene architecture from IBM, succeeding BlueGene/P [15] . The Blue Gene/Q Compute chip [16] is a system-on-aChip (SOC) ASIC with 16 user-accessible 4-way SMT (Symmetric Multi Threading) A2 cores clocked at 1.6 GHz. A quad floating unit implementing the QPX instruction set is associated with each core. The BlueGene/Q A2 chip achieves 204.8 Gflop/s peak performance. At 1024 chips per rack, the 48 rack ANL "Mira" systems achieves a peak performance of 10 Pflop/s and the 96 rack LLNL "Sequoia" system 20 Pflop/s, respectively.
Global FFT Algorithm
We now derive our novel 1D Global FFT algorithm, which is a variant of the Six Step FFT algorithm. Like the Six Step FFT algorithm, it has three global data exchanges. However, we block the global transpositions so that exactly one pair of large messages that are contiguous in memory are exchanged between every pair of processors in every communication step. This can be mapped efficiently to collective communication functions (all-to-all). We formally merge the ensuing data scrambling necessary to produce consume the contiguous messages with the on-node FFT computations and derive modified FFT libraries (working on custom scrambled data format) that perform the reordering at no extra cost compared to standard FFT libraries. We use the Kronecker product formalism to derive the algorithm and use Spiral to automatically build the modified node FFT libraries from the Kronecker product specification. Finally, we show pseudocode for the top-level parallel (single program multiple data, SPMD) function that calls the modified node FFT libraries. Algorithm derivation. Using (5)-(13), the Six Step FFT algorithm can be derived from (4):
By flipping both tensor products into their parallel form (I n ⊗ DFT m and I m ⊗ DFT n ), the algorithm is guaranteed to perform all DFT computations within the local memory of each node. This is achieved by reshaping the data vector of length mn into a n × m matrix and explicitly transposing it back and forth (a total of three transpositions is required). Typically, choosing m ≈ √ mn and n = mn/m gives the so-called "squareroot decomposition" which maximizes the number of processors that the DFT can be run on in parallel and provides good load balancing. A visual (data flow) representation of the 16-point six-step FFT is shown in Fig. 3 . Details can be found in [17] .
Below assume p processors and p | m, n. Using (5)- (13) and associativity and distributivity the stride permutation L mn m can be expressed as three permutation stages. First we use (12) to obtain
Next we use (5) to obtain
Inserting (16) into (15) yields
after further simplification. Equation (17) describes treating the n × m matrix as block matrix of p × p blocks with block size nm/p × nm/p. Each of the p processor holds p blocks in its local memory. (17) states that a distributed matrix can be transposed by transposing all blocks (p 2 local transpositions, each transposing a local block of size nm/p × nm/p) followed by transposing the blocks (one p × p transposition moving whole blocks, implemented as all-to-all collective communication). The mechanics of the Kronecker product formalism requires three factors to describe the two steps. In our algorithm derivation we also require a transposed version of (17) where we first swap m and n in (17) and the apply (7) to obtain the transposed expression for L mn m , leading to
Inserting (17) and (18) into (14) and regrouping the ensuing expression using (5) leads to the final algorithm (for a more detailed derivation see [11, 18] ),
Eq. (19) makes the minimal necessary changes to the Six Step algorithm to make it compatible to highly optimized all-to-all communication calls, and to allow for specialized high-performance local recursive FFT libraries. Reading (19) from right to left, first each processor performs local data scrambling (local transpose) in their own memory space to produce the first set of contiguous messages. This cannot be folded into any FFT library call but could be merged with computation that produces the input data. Next all processors invoke all-to-all collective communication; all p processors send one message of size mn/p 2 to every of the p processors (including themselves). Then a modified node FFT-the out-of-place scaled FFT library-is called to perform the local FFT computation on scrambled data and performs twiddle scaling. Next the same all-to-all call is invoked a second time, followed by the second modified node FFT library, an inplace FFT library operating on scrambled data. Note that too make this stage inplace, one needs to chose (17) and (18) carefully. Lastly, the same all-to-all collective communication is called a third time to redistribute the data to the target processor and a final local transpose unscrambling phase puts the data back into natural order. This final scrambling cannot be merged with any of the modified libraries but could be merged with the code consuming the transformed data.
Specialized FFT node libraries. Our derivation extracted the formal definition of two modified node FFT libraries that are invoked independently but in parallel on all p processors. The first node library is specified as
with T mn n,i being the global FFT twiddle factors for processor i. The library specified by (20) performs an out-of-place batch FFT (m/p FFTs of size n) plus twiddle scaling on a block-matrix data format. The second library is specified as
and performs an inplace strided batch FFT (n/p FFTs of size m) that can be viewed as column FFT. The modified node FFT libraries are automatically generated from the specification using Spiral's general size library generation framework [7] . To turn the algorithmic advantage into a performance advantage, the automatically generated libraries need to be of equivalent performance as FFTW or the vendor library ESSL. Since we are targeting Global FFT for 128k processors, the largest FFT sizes are up to mn = 2 38 , and thus m and n can be up to 2 19 . Thus, the node FFT libraries built from the specifications (20) and (21) need to provide good performance for batches of large FFTs. The generated libraries must perform all state-of-the-art optimizations including SIMD vectorization for the Double FPU [8, 9] and must be parallelized across the four cores of a BlueGene/P node [9] when running in SMP mode. Further, aggressive memory hierarchy optimizations like buffering and vector recursion need to be applied [5, 9] . All these optimizations need to be performed fully automatically [7] .
Full Global FFT code. In Figure 4 we show the full HPCC Global FFT algorithm using a partitioned global address space (PGAS) abstraction similar to Unified Parallel C (UPC). The data vectors x and y are block distributed (mn/p elements reside in the local memory of each of the p nodes) and all parallel for loops are run across p nodes of the parallel machine. For simplicity, on-node threading and SIMD vectorization is omitted.
Experimental Results
We experimentally evaluated our optimized Global FFT benchmark on BlueGene/P configurations from one node card (32 quadcore nodes or 128 cores) up to 32 racks (32k quadcore nodes or 128k cores), with one process per node. We used the IBM UPC runtime for process and thread management and as messaging layer. The benchmark is executed as UPC program that calls external (C/C++) libraries for the on-node FFT computation. UPC uses IBM's XL C compiler as backend, and our generated synthesized on-node libraries were compiled with IBM's XL C compiler and options "-O3 -qarch=440d". We implemented a baseline Global FFT version that uses IBM's BlueGene/P ESSL for local FFTs and UPC coalesced transpose (the equivalent of MPI all-to-all) for messaging. This implementation requires explicit data reordering between the UPC messaging and the invocation of ESSL but provides best performance for the FFT computation and the messaging in separation. This implementation is part of IBM's winning 2010 HPC Challenge Class II UPC submission. Figure 5 summarizes the performance results. We run the UPC+ESSL baseline benchmark on the IBM T.J. Watson BlueGene/P system for up to eight racks. We run our Spiral-generated library from one node card to 2 racks on the T.J. Watson machine and on ANL's "Intrepid" from 4 racks to 32 racks. The Spiral-generated Global FFT generally outperforms the UPC+ESSL baseline which shows that (a) Spiral's automatically generated node libraries offer performance competitive with ESSL, and (b) the memory traffic savings obtained by merging data scrambling with the node-libraries improves performance. Finally, the Spiral-generated Global FFT reaches 6.4 Tflop/s on 32 racks of "Intrepid". The winning 2008 ANL HPC Challenge Class I submission reported 5 Tflop/s Global FFT performance on the same machine. Thus, the combination of algorithmic optimization and library generation improved the Global FFT on "Intrepid" by 1.4 Tflop/s or 28%.
Single Node performance. Figure 6 shows single node performance on the BlueGene/P quadcore PowerPC 450D. We compare the GNU Scientific Library (GSL) [19] to Spiral-generated sequential and multi-threaded scalar and Double FPU-vectorized code. Spiral-generated scalar single-core code significantly outperforms the GSL for in-cache sizes and performs equally to the GSL for memory-bound sizes, demonstrat- ing the quality of Spiral's base line code generation on BlueGene/P. Spiral's Double FPU two-way SIMD vector code provides between 50% and 2x speed-up on top of the scalar base-line. Using all four cores of the BlueGene/P multicore CPU yields speed-up of 2x-2.5x except for the smallest sizes where parallelization overhead makes sequential code the fastest choice. Using only two threads is never a winning strategy. Towards Global FFT on BlueGene/Q. We are in the process of porting the Spiral Global FFT code generation to the next generation BlueGene machines, BlueGene/Q. One major difference is that BlueGene/Q features a new 4-way SIMD vector unit called QPX that is twice as wide as the Double FPU of BlueGene/P. In Figure 7 we show first performance results of Spiral-generated QPX code run on a single thread of a BlueGene/Q node. We observe that for small FFT sizes Spiral-generated code substantially outperforms both FFTW and ESSL. We are currently porting and adapting the remaining two levels of parallelism of the Global FFT (intra-node threading and inter-node message passing) to BlueGene/Q.
Conclusion
The increased complexity and performance levels of high performance and supercomputing systems makes the automatic generation and tuning of performance libraries for the petascale and beyond a promising alternative to hand-tuning. In this paper we present an novel 1D Global FFT algorithm for the HPC Challenge. Extending the Spiral system, we automatically generate specialized node FFT libraries that support the data layout required by the messaging layer while providing FFT performance of the native FFT data layout. The resulting reduction in memory traffic and high node performance enabled us to reach 6.4 Tflop/s on 128k cores of ANL's BlueGene/P system, improving performance by 28% over the previously reported Global FFT Class I benchmark. Finally, we show first single-node results on BlueGene/Q in which we significantly outperform FFTW and ESSL.
