In this report we analyze the performance of the fast Fourier transform (FFT) on graphics hardware (the GPU), comparing it to the best-of-class CPU implementation FFTW.WedescribetheFFT,thearchitectureoftheGPU,andhowgeneral-purpose computation is structured on the GPU. We then identify the factors that influence FFT performance and describe several experiments that compare these factors between the CPUandtheGPU.Weconcludethattheoverheadoftransferringdataandinitiating GPU computation are substantially higher than on the CPU, and thus for latencycritical applications, the CPU is a superior choice. We show that the CPU implementation is limited by computation and the GPU implementation by GPU memory bandwidth and its lack of a writable cache. The GPU is comparatively better suited for larger FFTs with many FFTs computed in parallel in applications where FFT throughputismostimportant;ontheseapplicationsGPUandCPUperformanceisroughly on par. We also demonstrate that adding additional computation to an application that includes the FFT, particularly computation that is GPU-friendly, puts the GPU at an advantage compared to the CPU