Abstract-The Fast Fourier Transform is a fundamental tool in scientific and technical computation. The highly parallelizable nature of the algorithm makes it a suitable candidate for GPU acceleration. This paper focuses on exploiting the speedup due to using the half precision multiplication capability of the latest GPUs' tensor core hardware without significantly degrading the precision of the Fourier Transform result. We develop an algorithm that dynamically splits the input single precision dataset into two half precision sets at the lowest level, uses half precision multiplication, and recombines the result at a later step. This work paves the way for using tensor cores for high precision inputs.
I. INTRODUCTION
The Fast Fourier Transform (FFT) is a widely used numerical algorithm that plays a vital role in many scientific and engineering applications. In large computational applications, including image processing, speech recognition, and large scale simulations, a majority of execution time is allotted to computing the FFT [1] [2] [3] . In order to improve performance of the FFT, many investigations have been made on implementing the FFT on the computationally superior Graphical Processing Unit (GPU) platform [4] .
Recently, half precision floating point arithmetic (FP16) is gaining popularity with its faster speed and energy saving ability. With the introduction of the tensor cores on the NVIDIA Volta GPU Hardware, a large speed up, up to 12x, in half precision matrix multiplications, has been introduced [5] . The FFT can benefit greatly from the advantages offered by tensor cores, as it is a matrix multiplication intensive algorithm.
Unfortunately, this half precision hardware cannot be exploited in scientific FFT applications where single precision (FP32) is required. In order to satisfy the accuracy requirement while utilizing the advanced half precision hardware, a mixed precision method utilizing dynamic splitting is developed. This method efficiently uses the computational capability of tensor cores without a significant drop in precision.
II. BACKGROUND AND RELATED WORK
In numerical computing, there have been many attempts to utilize the fast operations on low precision numbers to emulate high precision computation [6] [7] [8] . The more compact numerical representation better utilizes the memory space and eliminates memory communication costs between memory hierarchy. Besides memory advantage, low precision calculation tends to outperform high precision calculation for more than two times in terms of GFLOPS. Therefore, it is desirable to emulate the accuracy of a high precision number with two low-precision number. Previous works have presented various methods to mix double precision and single precision computation to optimize the performance, but there has been little discussion on how to exploit the high-speed half precision hardware.
III. FFT ALGORITHM
The Discrete Fourier Transform (DFT) converts a finite discrete signal in the time domain to a one in the frequency domain according to the following equation:
The inverse is given by:
Where x(n) is the discrete signal in the time domain, X[k] is the discrete signal in the frequency domain and N is the entire length of the sequence. The DFT can clearly be rewritten as a matrix multiplication with the number of computations required of the order O(N 2 ). The class of algorithms that efficiently calculate the DFT with a lower number of computations is known as FFT. Gentleman and Sande described the FFT algorithm that rewrote the length N sequence as N = n1 × n2 in order to compute the DFT with a lower number of computations [9] . This can be represented as:
Where x n1×n2 is the vector x reshaped as a matrix of n1 × n2 and F N is the Fourier matrix defined by
For convenience, the × operation is used to denote scalar multiplication and the · operation is used to denote matrix multiplication.
A. Adapting the Algorithm 1) Choosing the radix:
The real and imaginary Fourier matrices are defined as:
By choosing a radix of 4, or only allowing N = 4, we can observe that the elements of the real and imaginary Fourier matrix is either 1, 0, or −1. This is exactly representable in FP16 without loss of precision.
2) Elimination of a Few Transposes:
Large matrix transpositions are bulky operations limited by communication and memory bandwidth. To reduce the number of transpositions required, we may employ common matrix properties to simplify the FFT equation.
This can be further simplified by observing that the real and imaginary Fourier matrices of a length 4 sequence are symmetrical. Therefore,
This simplification reduces the number of transpositions required from 3 to 1. But, this introduces a complexity in the code; the FFTs computed in step 3 and 6 cannot be calculated in an identical fashion. The order of matrix multiplication is interchanged in the two steps. However, this complication is preferable to computing extra transpositions.
B. Adapted FFT Algorithm
Keeping the previous adaptations in mind, the implemented FFT algorithm is as follows: 
IV. DYNAMIC SPLITTING
In order to exploit the throughput of the tensor cores, a mixed precision approach is developed. This method ensures that only the matrix multiplication operations are done on the half precision input dataset but the rest of the FFT algorithm operates on the single precision dataset.
At the lowest level of the FFT algorithm, the single precision data sequence is converted into two half precision datasets. Every FP32 number is expressed as a scaled sum of two FP16 numbers. As the FFT is a linear algorithm, Length-4 FFTs are applied separately to the half precision datasets and recombined. This splitting operation is called twice, right before the FFT matrix multiplication in step 3 of the adapted FFT algorithm and before the FFT matrix multiplication in step 5 of this algorithm.
In order to retain as much accuracy as possible, a dynamic splitting algorithm is employed. Scaling vectors, s 1 and s 2 are utilized to minimize the error caused by the FP32 to FP16 conversion. These scaling factors are determined for each column of the input matrix and are single precision numbers.
Note that if x fp32 is already well-scaled, then s1 is about O (1) and s2 about O(10 −3 ) so that x1 fp16 represents the first 3 significant digits, and x2 fp16 the next 3 significant digits.
A. Dynamic Splitting Algorithm:
Step 1: Find the absolute norm of each column of the input matrix to decide s 1 and divide the respective column by the scaling factor
Step 2: Convert the input FP32 matrix x fp32 into FP16 matrix x1 fp16
Step 3: Calculate the residual error caused by conversion and store as x2 fp32
Step 4: Find the absolute norm of each column of the residual matrix to decide s 2 and divide the respective column by the scaling factor
Step 5: Convert the residual FP32 matrix x2 fp32 into FP16 matrix x2 fp16
In order to avoid mathematical round off error, the scaling factors, s 1 and s 2 , are chosen to be powers of 2. In this case then multiplication or division may be done by simply shifting the exponent bit pattern, leaving the mantissa unmodified.
V. IMPLEMENTATION
Our experimental platform is a heterogeneous processor consisting of a CPU and a GPU. We implemented the FFT algorithm as described in sections 3 and 4 on an NVIDIA Volta V100 graphics card. Implementing the FFT on the graphics card is a relatively straightforward process simplified by utilizing the commonly used cuBLAS library API. The algorithm consists of 3 major arithmetic operations: splitting FP32 numbers into two FP16 numbers, transposing matrices, and multiplying matrices. Customized kernels are written for the splitting operation and the transpose operation. The matrix multiplication is computed using the CublasGemmEx and CublasGemmStridedBatch functions.
Of the three operations, only the matrix multiplication operation utilizes the tensor core hardware. We expect the overhead due to splitting the input is not greater than the speedup offered by computing the matrix multiplication on the computationally superior tensor cores.
VI. EXPERIMENTAL RESULTS
In this section, we evaluate our FFT implementation by comparing it with cuFFT library APIs. We run experiments to test their performance under various configurations. The problem size, i.e. the number of elements in a single vector, increases from 4 to 65536 1 . We also test the multi-vector (batched) cases and the batch size is from 1 to 1024. The input data are randomly generated and are restricted to a given range.
The major performance metrics in the evaluation are speed and accuracy. We choose not to compare the FLOPS as the algorithm used in cuFFT APIs are automatically determined, and the number of operations varies among different algorithms. We make use of CUDA Event to mark the beginning and completion of computation to measure the execution time. The error of our implementation and the half-precision cuFFT is assessed based on the results of single-precision cuFFT. We normalize the error by the input range: Figure 1 compares the accuracy of two half-precision FFT: our implementation and cuFFT that takes 16-bit inputs, with the 32-bit cuFFT taken as reference. This figure is on a logarithmic scale. 100 dynamically generated sequences were fed into the FFT implementations and each error was averaged. The error statistics indicate that our implementation reduces the computation error significantly. It preserves high accuracy even with growing input sizes, which may qualify it for many scientific applications.
An important aspect to consider is the execution time of the matrix multiplication as compared to the splitting and subsequent recombining of the FP32 numbers. If the overhead due to splitting is larger than the speed up due to the tensor core multiplication, then the mixed precision method would not achieve and benefit over the normal method. Figure 2 shows that this is not the case. As the input size increases, the execution time of the splitting and recombining operations stays relatively low as compared to the GEMM (General Matrix Multiply) matrix multiplication. This implies utilizing the tensor cores would provide speed up to the FFT algorithm despite the overhead incurred due to mixed precision.
Lastly, the execution time of our dynamic splitting implementation is studied. We also noted that our implementation is currently inferior to the highly optimized cuFFT library. However, from the previous results, we infer using this mixed precision method with optimized kernels would achieve lower execution time.
VII. CONCLUSION AND FUTURE WORK
We have designed and implemented a FP32-FP16 mixed precision FFT that takes advantage of the recent tensor core hardware. The dynamic splitting method effectively emulates single-precision calculation and produces highly accurate results from a variety of inputs. The speed of current cuBLASbased implementation is inferior to cuFFT APIs, but we expect it to gain an advantage with larger input size as the tensor core can be fully utilized and the setup cost can be amortized.
The current implementation can handle a large variety of inputs. The relative error exceeds 0.1 or the program throws an error when: input data range greater than 3E10; or input data range less than 5E-11 (close to the dynamic range of single precision number). The range may be further enlarged by pre-scaling all input numbers.
Our GPU-based mixed precision FFT implementation can easily be applied to many areas of scientific computing where increased accuracy is desired.
There are several interesting directions for further optimizations. Many operations can be tuned specifically for the problem size involved in the FFT calculation. The time spent on 16-bit GEMM grows quickly when input size exceeds 16384. This may due to the inefficient cuBLAS implementation for the problem set. This may be improved by implementing transpose and GEMM kernels. Another direction is to design an auto-tuning splitting algorithm that supports ill-conditioned inputs, and further optimizes the splitting overhead. A more sophisticated splitting algorithm may be designed. This could be done using bit manipulations. Lastly, our implementation of matrix transpose kernel has yet to take advantage of the shared memory. It can be accelerated by applying the tiled design.
