2 research outputs found
Large-Scale Discrete Fourier Transform on TPUs
In this work, we present two parallel algorithms for the large-scale discrete
Fourier transform (DFT) on Tensor Processing Unit (TPU) clusters. The two
parallel algorithms are associated with two formulations of DFT: one is based
on the Kronecker product, to be specific, dense matrix multiplications between
the input data and the Vandermonde matrix, denoted as KDFT in this work; the
other is based on the famous Cooley-Tukey algorithm and phase adjustment,
denoted as FFT in this work. Both KDFT and FFT formulations take full advantage
of TPU's strength in matrix multiplications. The KDFT formulation allows direct
use of nonuniform inputs without additional step. In the two parallel
algorithms, the same strategy of data decomposition is applied to the input
data. Through the data decomposition, the dense matrix multiplications in KDFT
and FFT are kept local within TPU cores, which can be performed completely in
parallel. The communication among TPU cores is achieved through the one-shuffle
scheme in both parallel algorithms, with which sending and receiving data takes
place simultaneously between two neighboring cores and along the same direction
on the interconnect network. The one-shuffle scheme is designed for the
interconnect topology of TPU clusters, minimizing the time required by the
communication among TPU cores. Both KDFT and FFT are implemented in TensorFlow.
The three-dimensional complex DFT is performed on an example of dimension with a full TPU Pod: the run time of KDFT is 12.66
seconds and that of FFT is 8.3 seconds. Scaling analysis is provided to
demonstrate the high parallel efficiency of the two DFT implementations on
TPUs