Abstract-Graphic Processing Units (GPU) has been proved to be a promising platform to accelerate large size Fast Fourier Transform (FFT) computation. However, GPU performance is severely restricted by the limited memory size and the low bandwidth of data transfer through PCI channel. Additionally, current GPU based FFT implementation only uses GPU to compute, but employs CPU as a mere memory-transfer controller.
I. INTRODUCTION
Fast Fourier Transfonn (FFT) is one of the most widely used numerical algorithms in science and engineering domains. It is not rare that large scientific and engineering computation, such as large-scale physics simulations, signal processing and data compression, spend majority of execution time on large size FFTs. Such FFT implementations require large amount of computing resources and memory bandwidth. Compared with current multi-core CPUs, Graphical Processing Units (GPUs) have been recently proved to be a more promising platfonn to solve FFT problems since GPUs have much more parallel computing resources and can often achieve an order of magnitude performance improvement over CPUs on compute intensive applications [1] .
Efforts have been focused on solving in-card FFT problems whose sizes can fit into the device memory of GPU. It means that only two simple data transfers are needed in the solving of one FFT problem, one copying all the source data from CPU memory to GPU memory using the PCI bus, and the other copying all the results back. Since the data transfer does not have much to optimize, the prior works focused on the decomposition of FFT problems for the two-level organization of processing cores on GPU and the efficient usage of GPU on-device memory hierarchy. Libraries such as CUFFT from NVIDIA [2] , Nukada's work on 3D FFT [3] , [4] , Govindaraju's [5] , [6] and Gu's work on 2D and 3D FFT [1] can be classified into this group. Recently, Gu et.a!. [7] demonstrated a GPU-based out-of card FFT library that can solve FFT problems larger than GPU device memory. Since one data transfer cannot move all data between CPU and GPU, mUltiple data transfers are needed. Gu et.al. proposed a joint optimization paradigm that co-optimizes the communication and the computation phases of FFT, and an empirical searching method to find the best tradeoff be tween the two factors. For even larger FFT problems, Chen et.al. presented a GPU cluster based FFT implementation [8] . However, the work has been almost exclusively focused on the optimization of communication over inter-node channels.
In spite of highly influential results in prior FFT work on GPUs, no matter the on-card FFT libraries, the out-of card FFT libraries or the GPU-c1uster based solutions, the real performance of GPUs is not significantly higher than that of current high-performance CPUs as expected, since GPU performance is severely restricted by the limited GPU memory size and the low bandwidth of data transfer between CPU and GPU through PCIe channel. In addition, in the prior FFT research on GPUs, CPU is only used as a memory or communication controller, that is, managing memory transfer requests between CPU and GPU, or between nodes. The computing power of CPUs is wasted. Therefore, the objective of this paper is to propose a hybrid parallel FFT framework to concurrently execute both CPU and GPU for computing large-scale FFTs that exceed GPU memory. Incorporating CPU has several advantages: (1) Multi-core CPU is capable of computing partial work concurrently with GPU to release the pressure which is originally assigned to the GPU, and to make full use of available underlying computing resources for performance improvement, (2) CPU helps increase data transfer bandwidth remarkably by enabling efficient utilization of PCIe bus and save much GPU memory resource since partial work is kept into local CPU memory to execute without being transferred to GPU, and (3) CPU has much larger memory size than that of GPU in genera!.
Ogata et.al. [9] recently attempted to divide the compu tation to both CPU and GPU, though targeting at problems whose sizes can fit into the GPU memory. The small problem assumption makes the optimization of data communication between CPU and GPU trivial because all data can be copied to GPU in one data copying, which largely avoids the challenges of co-optimizing both computation and communication be tween two different types of devices. In this paper, we present a hybrid FFT library that engages both CPU and GPU in the solving of large FFT problems that can not fit into the GPU memory. The key problem we solve is to engage heteroge neous computing resources and newly improved optimization strategies in the acceleration of such large FFfs.
Making FFf run concurrently on CPU and GPU comes with significant challenges. First of all, CPU and GPU are two computer devices with totally different performance character istics. Even though FFf can be decomposed in many different ways, not a single method can arbitrarily divide a problem into subtasks with two different performance patterns. In FFT, a simple change to the division of computation will lead to global effects on the data transfers, because ultimately any single point in the output of a FFT problem is mathematically dependent on all input points. We cannot just optimize for CPU or just optimize for GPU. In simpler words, the first problem we need to solve is to divide a FFT workload between two types of computing devices that are connected by a slow communication channel, and to determine the optimal load distribution among such heterogeneous resources.
The second challenge is the magnitude of the vast space of possible hybrid implementations for one FFT problem. In addition to the large number of possible algorithmic trans formations, as outlined in the first challenge, CPU and GPU architectural features also need to be considered in the search. Reconciling CPU and GPU architectures is hard because they simply like different styles of computation/communication mix. Moreover, the decision of workload assignment needs to be put into a search space that consists of many different ways of decomposition and different ways of data transfer. In particular, computation and communication can be efficiently overlapped, an important performance booster, only if the data dependency between the CPU parts and the GPU parts is appropriately arranged. In other words, even if we already find the best algorithm for a FFf problem, i.e., the best division of computation, the implementation of the algorithm still needs to be co-tuned for two different architectures. This paper is the first effort to propose a hybrid implemen tation of FFf that concurrently executes both multithreaded CPU and GPU in a heterogeneous computer node to compute large FFT problems that cannot fit into GPU memory. The paper makes four main contributions: (I) a hybrid large-scale FFf decomposition framework that enables the extraction and the tailoring of different workload and data transfer patterns appropriate for the two different computing devices, (2) an empirical performance modeling to determine optimal load balancing between CPU and GPU, which estimates perfor mance based on several key parameters, and replaces an exhaustive walk-through of the vast space of possible hybrid implementations of FFT on CPU/GPU with a guided empirical search, (3) an optimizer that exploits substantial parallelism for both GPU and CPUs, and (4) effective heuristics to pur posefully expose opportunities of overlapping communication with computation in the process of decomposing FFT. Overall, our hybrid FFT implementation outperforms several latest and widely used large-scale FFT implementations. For instance, our double-precison Tesla C2070 performance is 29% and 1.4 x faster than that of Gu et.al.'s [7] and Ogata et.al.'s [9] work, and l. 9 x and 2.1 x faster than 4-thread SSE-enabled FFfW [10] , [II] , [12] and Intel MKL [13] , with max speedups 4.6 x and 2.8 x, respectively.
II. OVERVIEW OF FFT ALGORITHM
FFT algorithms recursively decompose aN -point DFf into several smaller DFfs [14] , and the divide-and-conquer approach reduces the operational complexity of a Discrete Fourier Transform (DFf) from 0(N2) into O(Nlo gN ). There are many FFf algorithms, or in other words, different ways to decompose DFf problems. In this section we briefly introduce the FFf algorithms used in this paper and overview how they are incorporated into our hybrid approach. The DFT transform of an input series x(n) , n = 0, I, . e -J2 nab/ c is the twiddle factor introduced by [15] . Therefore,
In essence, Cooley-Tukey introduces a decomposition approach that divides one dimensional compu tation into two. Moreover, the Radix algorithm is a special case of the Cooley-Tukey algorithm for power-of-two FFf problems.
In this paper, we extend the 110 tensor representation intro duced in FFfW [12] [15] to exploit more algorithmic parallelism. In our hybrid 2D FFr framework, the extended tensor representation of work distribution on GPU, i.e. ugpu, and on CPU, i.e. uc P /" are transformed as equation (2).
where Y = Y1 X Y2, Sync denotes data transfer and synchroniza tion between CPU and GPU within computation. As a result, three dimensional computations, i.e. Yl, Y2, X in order, need to be executed. Also note that a twiddle factor computation t�� is introduced by the Cooley-Tukey decomposition between Y1 and Y2 step. Figure 1 shows the high-level working flow of our hybrid 2D FFr framework.
On GPU, a portion of Yl dimensional FFrs exceeding GPU global memory need to be computed. In this case, we divide the 2D FFr of GPU part into several passes such that the sub-problem of each pass can fit into GPU memory and be executed with the CPU portions concurrently. The number of passes equals to 2) GPU Computation Optimizations: On GPU, all the Yl dimensional ID FFrs are calculated by codelets, i.e., highly efficient straight-line code segments that solve small FFr problems. The codelet provides automatic matrix transposing within FFr substeps such that much transposition time can be saved. The concept of codelet was first introduced in FFrW, though its codelet generator only generates CPU code. We extend the FFrW codelet generator to generate GPU-based codelets. In addition, if size Yl is still large, we further decom pose Yl = Yll X Y12 sized ID FFr into two dimensional FFTs with smaller sizes Yll and Y12, respectively. On GPU, device memory has much higher latency and lower bandwidth than on-chip memory. Therefore, shared memory is utilized for the decomposed FFrs to increase device memory bandwidth. For example, NVIDIA GTX480 GPU has 48KB shared memory which can store 6K complex single-precision data or 3K complex double-precision data in maximum. Yl W x Yll X Y12 sized shared memory needs to be allocated, where Yl W is chosen to 16 for half-warp of threads in GTX480 to enable coalesced access to device memory. The number of threads in each block, for both Yll and Y12-step sub-FFrs, is therefore
Yl W x max(Yll, Y12) to realize maximum data parallelism on GPu. To calculate each Yl-step ID FFr, a size Yll codelet is first executed to load data from global memory into shared memory for each block. Next, all threads in each block are synchronized to finish their work before data in shared memory is reused by the Y12-step codelet and subsequently written back to global memory. Such shared memory technique effectively hides global memory latency and increases data reuse, both contributing to the GPU performance.
3) Asynchronous Strided Data Transfer: Another perfor mance hurdle is strided memory transfers between CPU and GPu. Since we separate the load between CPU and GPU, the portion of input data required to be continuous in GPU memory is not contiguous in host memory. For each pass of our 2D FFT, if we use simple CUDA memory copy operations to transfer the total # ofp:;s��;�l o���e ams sized data into GPU, we need to utilize PCI bus Yl x # of passes �� of slTeams times because we transfer Xgpu sized data each tinle. Clearly, the high PCI transfer overhead will kill all potential perfornlance gain. CudaMemcpy2DAsyncO can make only one call to transfer a strided 2D memory area of size # of pa s�:::# x:l slr'earns into GPU at a time. Therefore, the total number of PCI transfers is reduced to only Yl. Moreover, such data transfer optimization supports for CUDA asynchronous concurrent execution. Dif ferent data transfers managed by different streams are executed concurrently and are overlapped with different streamed GPU kernel executions.
To best use cudaMemcpy2DAsyncO in hybrid FFrs, when copying the GPU output back to CPU, it is used to copy multiple 2D strided arrays. Each streamed PCI transfer at this . Id ter e 1-step FFrs, all the streams on GPU are synchronized, and a subsequent barrier is set to synchronize GPU with CPU.
Our asynchronous strided transfer scheme achieves higher bandwidth than that of PCI transmission approach proposed in Gu's out-of-card FFr work [7] since we do not need to waste additional CPU resource to prepare buffer and therefore we get rid of overhead caused by buffering method in Gu's paper. To demonstrate the improvement of our PCI bandwidth, we used the same subarray test [7] as that in Gu's work, where there are C regular subarra y s of length W each. Two regular subarrays are separated by a stride X -W in a large arra y of size C x X. Note that the large array is contiguous in system memory but regular subarrays are not contiguous. Figure 2 shows the improvement of our PCI bandwidth over Gu's work on GTX480 GPU. Overall, our 2D hybrid implementation can achieve 6 GB/s PCI bandwidth on average comparing to only 4.2 GB/s of Gu's work and 3.4 GB/s of naive PCI transfer. each ID FFT needs to do a strided transpose. Both strided memory accesses and strided transpose are very expensive on CPU. Instead, we group the transformation of multiple complex arrays into a concurrent group operation and allow it to operate on non-contiguous (strided) data. Therefore, we need no input or output transposition and save much execution time. We set the number of arrays-X cpu-to be the maximum of what a FFTW group plan could execute at a time. For each grouped array, the plan computes size Yl ID FFT across a stride of Y2 x X for input and X for output. We need to execute such kind of plan for totally Y2 times.
5) Co-Optimization oj 2D FF T Jor CPUs and GPU:
The overall coordination between CPU and GPU works like following. The workload of CPU, conceptually a loop of size Y2, is parallelized into 4 concurrent subsections using 4 threads. Each thread is responsible to execute a distinct subsection in which independent grouped FFT computations are carried out. Simultaneously, the workload of GPU, including data transfers and kernel executions, is parallelized with CPU. Afterwards, jobs on GPU driven by different streams are synchronized before the task synchronization completes between GPU and CPUs. There is no matrix transposition on either GPU or CPU since computations in either side is re-organized to naturally subsume the strided transposition.
The subsequent calculation of twiddle factor multiplication t�� and Y2 & X dimensional FFTs is left for GPU. For Y2 = Y2l X Y22 dimensional FFTs, Cooley-Tukey decomposition is again applied since relative large size of Y2 would hurt the performance of codelet based GPU computing. Similarly, shared memory is taken into account for reusing data between the decomposed Y2 1 -step FFTs and the subsequent Y22-step FFTs. For the last X-step, i.e., ID contiguous FFT sub problems, CUFFT library is used because it provides good performance for row-major contiguous ID FFTs. Instead of using ordinary CUFFT plan, we make use of stream-enabled CUFFT plan so that all Y2 and X dimensional computations plus both PCI transfers of Y2'S input and X's output become stream-based asynchronous executions.
6) Comparison to heterogeneous out-oj-card CUFF TIFF TW implementation:
To demonstrate the effective ness of our approach on parallelizing FFT into heterogeneous processors, we compare our hybrid FFT library against a naive out-of-card hybrid 2D FFT implementation which simply uses CUFFT for GPU computation and FFTW for the CPU part. This heterogeneous distribution was first proposed in Ogata's [9] paper. Computation is firstly distributed along X dimension in the pI-round of 2D FFT and along Y dimension for the 2n d -round. On GPU, sub-problems are divided into several passes to facilitate data transfer between GPU and CPU. Matrix transpose, CUFFT and data transfer are processed in asynchronous manner. On CPU, FFTW advanced interface is utilized to process strided data. The purpose of this comparison is to see how our optimization technique improves over a naive hybrid CUFFTIFFTW solution. In the experiment, we vary the CPU/GPU work ratio from 0% to 100% for the naive solution and show its double precision performance curve of size 2 l S x 2 13 on C2070 in Figure 3 . The peak performance of Ogata's naive hybrid FFT [9] gains only 7.7 GFLOPS which is far below that of our hybrid version. The main reason is the lacking of co-optimizations in the naive solution.
B. Hybrid 3D FF T Framework To describe how our hybrid 3D FFT works, we start with a simple hypothetical scenario where all the work is assigned to GPU, and then continue to reveal how computation is extracted from this GPU-only hypothetical case and is assigned to CPU. Suppose that Z = Zl X Z2 and Y = Yj X Y2, the U3d can be The problem with this initial formula is that the workload in the formula cannot be well balanced between two computing devices. To save the number of PCI transfers and to balance computations between CPU and GPU, the computations sub steps need to be reordered [7] . The reordered computations are summarized in equation (4). framework that if we want to achieve actual high performance from heterogeneous implementation, we need to minimize the uses of PCI bus transfers. As Formula (4) indicates, in addition to the initial input transfer and final output transfer through PCI bus, at the very minimum, only two extra PCI transfers are called for, including copying output of Y2 dimensional FFTs from GPU back to CPU, and copying the input of subsequent Z2 dimensional FFTs from CPU into GPU. Therefore if we want to employ CPU for computing along with GPU, we need to arrange the data exchange between CPU and GPU to occur between Y2 and Z2 dimensional FFTs to reduce the total number of PCI bus transfers. Otherwise, more PCI transfers have to be invoked to merge the partial results of CPU and GPU between the two sub-steps FFTs. Hence, CPU is used to compute Z" Y, and Y2 dimensional FFTs, and the subsequent calculation of Z2 and X dimensional FFTs would be left for GPu. In summary, the heterogeneous 30 FFT tensor u gpu on GPU and the tensor u cpu on CPU are represented as formula (5), where S y nc represents data transfer and synchro nization between CPU and GPU within computation. We use cudaMemcpy30AsyncO to transfer a strided 30
.
X"puxYxZ2 .
GPU t f memory area of Slze # of passesx# of streams mto a a lme.
Total number of PCI transfer is Z,. For copying output back from GPU to CPU, we still need the 30 strided memory coPY. Each streamed PCI transfer at this time c ? uld � opy XgpuxYxZ2xZ, sized data. All Z" Y, and Y2 dlmenslOnal # ofpassesx# of streams computations plus Z,'s input and Y2'S output transfers are asynchronous executions, and only required synchronization is set after Y2-step. Similar as 20 transfer scheme, our 30 method stills outperforms Gu's out-of-card FFT work. PCI bandwidth for our 30 case on GTX480 can achieve 5.9 GB/s on average comparing to only 3.9 GB/s of Gu's work.
3) Co-Optimization for CPUs and GPU: Similar to 20 hybrid FFT, we execute the size Z, FFTs in groups on CPU. For each grouped array, FFTW plan computes size Z, 10 FFTs across a stride of X x Y X Z2 for input and Xcpu* Y for output. The total number of executions of such plans is Xcpu x Z2. For the following Y, and Y2 dimensional FFTs on CPU, we only need to calculate size Y FFTs instead. The number of grouped array is Xcpu. The plan computes size Y 10 FFTs across a stride of Xcpu for input and X for output. Total number of such plans is Z. Moreover, all the grouped FFT tasks on CPU are distributed to 4 concurrent threads. There is only one invocation to cudaOeviceSynchronizeO to synchronize all the streams on GPu. A subsequent barrier is set to synchronize the work of GPU with CPUs.
IV. LOAO BALANCING BETWEEN GPU ANO CPU
The 20 and 30 hybrid FFT frameworks layout the basic schemes of workload distribution between CPU and GPU.
However, there are parameters whose values need to be tuned for the optimal load balancing for different CPUlGPU combi nations. In this work, we combine both performance modeling and empirical searching to finish the last mile towards the optimal load balancing. The empirical tuning is done at build time.
Our approach is to split the total execution in either GPU or CPU into several primitive sub-steps, analyze the heterogeneous execution flow, and derive a performance model for each primitives. The model provides estimated execution time that is parameterized with the load ratio of GPU to total work. For each hardware configuration, we calibrate the models with two profiling runs, one on CPU and one on GPU, to determine the values of model parameters in different distribution ratios. Afterwards, using those parameters, we can automatically estimate, rather than really measuring, the total execution time of our implementation under varying ratios. We further use dynamic-programming to find the optimal implementation for different problems using the primitives as building blocks. The estimated performance might not be completely precise. Therefore, we only use it to provide a small region of potentially good choices. Within the region we exhaustively measure the performance and choose the best one. Overall, we avoid a walk-through of the vast space of all possible combinations of primitives.
A. Load Balancing of 2D FF T
Using the hybrid 2D FFf as an example, suppose that the total problem size is Yl x Y 2 X X. The load ratio of GPU to total work is defined as R g = x'!t along X dimension, therefore the ratio of CPU to the total is I -R g • The execution time of the whole process can be modeled as 8 parameters, which are summarized in Table I . We used two runs, one on GPU and CPU each, to determine T 2 dH 2 D-gpu, TYI kernel-gpu, and T 2 dD 2 H-gpu as execution time of corresponding Table I's parameters in GPU-only case, and to determine TYI fft w-cpu as execution time of TYl fft w( 1-R g ) in CPU-only case. Therefore, each parameter value in Table I can be modeled with different distribution ratios. On GPU side, for hybrid Yl dimensional FFfs, the execu tion time is estimated as TG20 shown in equation (6) .
[Yl x T 2 dH 2 D(# streams-I,Rg ) (6) + TYI kernel (# streams-I, R g ) + T 2 dD 2 H (# streams-I, R g )]; } On CPU side, for hybrid Yl dimensional FFfs, the execu tion time is estimated as TC2D = #/�;dS X TYI fft w (1-R g ).
Since synchronization is set after Yl-step FFT on both GPU and CPU side to guarantee the correctness of results, the execution time of hybrid Yl dimensional FFf can be modeled as the maximum of the GPU time and CPU time, i.e., TYI = max{TG2D, TC20}. And the total time estimation will be consequently calculated as Tr otal = max{TG2D, TC20} + TY2 &X, Afterwards, empirical searching is employed to find the pa rameter values that can make TG2D equal to TC2D, as well as the sub-steps along other dimensions, which indicates the optimal load balancing.
B. Load Balancing of 3D FF T
The load balancing in the hybrid 3D FFf framework is similar to that of the 2D cases. Suppose that the total problem size is 21 x 2 2 X Yl x Y 2 x X. The load ratio of GPU to total work is denoted as R g = X3t along X dimension and ratio of CPU to total problem is 1-R g . Performance parameters for the sub-steps in 3D hybrid FFf are summarized in Table II . Two profiling runs still help determine T3dH 2 D-gpu, TZI kernel-gpu, TYI kernel-gpu, TY2kernel-gpu, T3dD 2 H-gpu, and TZI fftw-cpu, TYff t w-cpu as execution time in respective GPU-only and CPU-only case for the parameters in Table II . On GPU side, for hybrid Zl&Y dimensional FFrs, the execution time is estimated as TG30 shown in equation (7).
On CPU side, for hybrid Zl&Y dimensional FFrs, the ex ecution time is estimated as TC30 represented in equation (8).
Similarly, since synchronization is set after Z 1 &Y -step FFr on both GPU and CPU side, the execution time of the hybrid Zl&Y dimensional FFr can be modeled as the maximum of the GPU time and CPU time, i.e., T ZI & Y = max{TG3o, TC30}. The total time estimation is calculated as 1(ota l = max{TG3o, TC30} + Tq&x. Empirical searching techniques similar to 2D cases are used to balance the substeps, as well as those along other dimensions.
V

PERFORMANCE EVALUATION
In this section, we evaluate the hybrid 2D and 3D FFT im plementation on three heterogeneous computer configurations. A single model of CPU, Intel i7 920, is coupled with three different NVIDIA GPUs, i.e. GeForce GTX480, Tesla C2070 and Tesla C2075 in the three experiments. Configurations of the FFr libraries, the GPUs and the CPU with total 24GB host memory are summarized in Table III , where NVCC is the compiler driver for NVIDIA CUDA GPUs.
As mentioned in section III.A.3), the perfonnance reported here includes both computational time and data transferring time between host and device. Our library in both single-and double-precisions are compared to FFrW and Intel MKL, two of the best performing FFr implementations on CPU. More over, our hybrid FFr library is compared with Gu's out-of-card FFr work [7] , a highly efficient GPU-based FFT library and the only one that we know can handle the problems sizes larger than GPU memory. The whole design of this perfonnance evaluation is to let us see how much perfonnance improvement can be achieved by using both CPU and GPU in compu tation, against the best-perfonning GPU-only or CPU-only FFr implementations. Please note that we can't compare our library with pure CUFFr implementation because it requires problem size to be smaller than GPU memory, and therefore won't accept problem sizes used in this evaluation. In FFrW, Streaming Single Instruction Multiple Data Extensions (SSE) on Intel CPU is enabled for better performance. Also FFrW results are got with the 'MEASURE' flag, the second most extensive perfonnance tuning mode. The 'EXHAUSTIVE' flag in FFrW, which represents the most extensive searching and tuning, is not used because the problem sizes in this evaluation are so large that FFTW can't finish its search under the 'EXHAUSTIVE' mode. For example, we tried running FFrW in 'EXHAUSTIVE' mode for a 2 2 8 FFT problem, but found FFTW couldn't finish the search in 3 days. In addition, Intel MKL automatically enables SSE at run time. Both FFrW and MKL are chosen to run with four threads. Even though the i7 CPU supports 8 hyperthreads, the 8-thread FFrW and MKL didn't show perfonnance advantage over, actually in some cases were slower than, the 4-thread versions.
All FFr problem sizes are larger than GPU memory. For double-precision implementation on GTX480, we choose the test cases from 32M points (i.e. 2 2 5 ) to 256M points (i.e. 2 2 8). 32M-point FFT is twice the maximal problem size that GTX480 memory can accommodate and 256M-point FFr is the maximum problem size that can fit into host memory. For single precision tests on GTX480, the sizes are from 64M points (i.e. 2 26 ) to 512M points (i.e. 2 2 9 ). Similarly, for Tesla C20701C2075, test cases are from 256M points (i.e. 2 2 8) to 51 2M points (i.e. 2 2 9 ) for double precision implementation and from 512M points (i.e. 2 2 9 ) to 1024M points (i.e. 2 
A. Peifonnance Tuning
For both 2D and 3D FFrs, our perfonnance modeling and empirical searching find the optimal ratio and best performance for different input sizes. Figure 5 demonstrates the effect and accuracy of load distribution ratio tuning on overall FFr performance. The figure shows the actual and modeled double precision 2D FFr performance on three different GPUs under different load ratios with problem size 2 15 x 2 1 3 . The x-axis is the distribution ratio from 0% to 100%. In particular, 0% rep resents running our hybrid FFr library only on CPU and 100% represents running only on GPU. The two extreme cases will help demonstrating the intrinsic overhead incurred by splitting computation!communication into two devices. Table IV shows the values of model parameters from the profiling runs of GPU only and CPU-only case described in section IVA and IV,B, where # passes, # streams and # thds are 2, 8, 4, respectively.
The modeled optimal and average performance is 99%, 96%, 95%, and 98%, 93%, 91 % of the actual measured on GTX480, Tesla C2070 and C2075, respectively. Our perfonnance mod eling, even without empirical tuning, is accurate. Figure 6 shows the actual and modeled double-precision 3D FFT performance with size 2 10 x 2 9 X 2 9 . The modeled optimal and average performance is again very accurate, and is 99%, 95%, 93%, and 98%, 94%, 91 % of the actual values.
In addition, the final perfonnance of our library is also compared against the cases that run our library only on GPU or only on CPU. As shown in Figure 5 for 2D FFf, on GTX480, Tesla C2070 and C2075, the optimal ratio of GPU to total work is 71.9%, 78.2% and 78.2%. The library's best performance is 21.4%, 19.1 % and 20.7% faster than GPU-only performance, and is 1.09x, 1.59x and 1.76x faster than CPU-only cases. Moreover, for 3D FFf in Figure 6 , the optimal ratio of GPU to the total is 75.0%, 78.2% and 78.2%. The load-balanced library performs 25.6%, 22.8% and 23.1 % faster than GPU only performance, and is 1.25 x, 1.51 x and 1.62 x faster than CPU-only case.
Not shown in this figure, but the single-precision perfor mance tuning has similar curve as that of the double-precision case. Also the optimal ratio of GPU to CPU in single-precision version is larger than that of double precision since GPU has relatively higher performance on single precision operations than CPU. 
B. Evaluation fo r 2D Hybrid FFT
We evaluate various 2D FFf problems on the three het erogeneous configurations. In all the figures, the test points are indexed in an increasing order of Y in the problem sizes. Figure 7 shows our single-precision 2D FFf performance on Geforce GTX480 with problem sizes from 2 2 6 to 2 29 . On aver age, our single-precision 2D hybrid FFf on GTX480 achieves 25.5 GFLOPS. Our optimally-distributed performance is 16% faster than Gu's pure GPU version, and is also 95% faster than the 4-thread FFfW and 1.06x faster than the 4-thread MKL. In particular, even if we run our hybrid FFT only on GPU, it is still faster than Gu's work, a high-performance GPU-based FFT implementation, mainly attributing to the asynchronous transfer schemes in our hybrid algorithm.
Furthermore, we also test 2D hybrid FFf performance in double-precision as shown in Figure 8 . Our double-precision 2D hybrid FFf on GTX480 achieves 13.1 GFLOPS. Moreover, our optimal performance is 20% faster than Gu's pure GPU implementation, and is 98% faster than the 4-thread FFfW and 1.04x faster than the 4-thread MKL.
Additionally, Figure 9 and Figure 10 show our large 2D FFf results on the Tesla C2070/C2075 with even larger problem sizes in single and double precision. On average, our single-precision 2D hybrid FFf achieves 37.2 GFLOPS on Tesla C2075 and 33.7 GFLOPS on Tesla C2070, which represent speedups of 26% and 24% over Gu's pure GPU implementation, 2.23x and 1.93x over the 4-thread FFfW, and 2.41 x and 2.09x over the 4-thread MKL, respectively.
For double precision, the performance is 19.1 GFLOPS and 17.8 GFLOPS on Tesla C2075 and C2070, which represent 29% and 28% speedups over Gu's pure GPU implementation, 2.08 x and 1.87 x speedups over the 4-thread FFfW and 2.24 x and 2.02x speedups over the 4-thread MKL.
Particularly notable is that as Y increases, the performance of both FFfW and MKL decreases rapidly because the data lo cality loses rapidly along the Y dimensional computation when Y increases. On the contrary, our hybrid FFf demonstrates a much more stable performance. performance, FFfW's and MKL's 3D perfonnance decrease quickly as Z increases due to the loss of data locality though MKL generally performs better than FFfW for large Zs. Our hybrid library generally maintains its good performance for the same large Z cases.
D. Accuracy of Our Hybrid FFT
The correctness of our hybrid FFT library is verified against FFfW and MKL. All three libraries are tested with the same input data randomly chosen from -0.5 to 0.5 and the difference in output is quantified as normalized RMSE over the whole data set. The normalized RMSE evaluates the relative degree of deviations and is a wildly used metric for numeric accuracy. The normalized RMSE is defined as The normalized RMSEs of single-and double-precision for both 2D and 3D FFTs are shown in Figure 15 and Figure 16 . As we can see the normalized RMSE is extremely small and is in the range from 2.41 x 10-07 to 3.18 X 10-07 for single precision and 5.82 x 10-16 to 8.02 X 10-16 for double precision. In other words, our hybrid FFT library produces the same accurate results as FFTW and MKL.
VI. CONCLUSION
In this paper, we proposed a hybrid FFT library that con currently uses both CPU and GPU to compute large FFT prob lems. The library has four key components: a decomposition paradigm that mixes two FFT algorithms to extract different types of computation and communication patterns for the two different processor types; an optimizer that exploits substantial parallelism for both GPU and CPUs; a load balancer that assigns workloads to both GPU and CPU, and determines the optimal load balancing by effective performance modeling; and a heuristic that empirically tunes the library to best tradeoff among communication, computation and their overlapping. Overall, our hybrid library outperforms several latest and widely used large-scale FFT implementations.
