In this work, we present a parallel algorithm for large-scale discrete Fourier transform (DFT) on Tensor Processing Unit (TPU) clusters. The algorithm is implemented in Tensor-Flow because of its rich set of functionalities for scientific computing and simplicity in realizing parallel computing algorithms. The DFT formulation is based on matrix multiplications between the input data and the Vandermonde matrix. This formulation takes full advantage of TPU's strength in matrix multiplications and allows nonuniformly sampled input data without modifying the implementation. For the parallel computing, both the input data and the Vandermonde matrix are partitioned and distributed across TPU cores. Through the data decomposition, the matrix multiplications are kept local within TPU cores and can be performed completely in parallel. The communication among TPU cores is achieved through the one-shuffle scheme, with which sending and receiving data takes place simultaneously between two neighboring cores and along the same direction on the interconnect network. The one-shuffle scheme is designed for the interconnect topology of TPU clusters, requiring minimal communication time among TPU cores. Numerical examples are used to demonstrate the high parallel efficiency of the large-scale DFT on TPUs.
I. INTRODUCTION
The discrete Fourier transform (DFT) is critical in many scientific and engineering applications, including time series and waveform analyses, convolution and correlation computations, solutions to partial differential equations, density function theory in first-principle calculations, spectrum analyzer, synthetic aperture radar, computed tomography, magnetic resonance imaging, and derivatives pricing [1] - [4] . However, the computation efficiency of DFT is often the formidable bottleneck in handling large-scale problems due to the large data size and real-time-processing requirement [5] , [6] . In general, efforts on enhancing the computation efficiency of DFT fall into two categories: seeking fast algorithms and adapting the algorithms to parallel computing. One breakthrough of the fast-algorithm category is the Cooley-Tukey algorithm [7] , also known as the fast Fourier transform (FFT), which reduces the complexity of N -point DFT from O(N 2 ) to O(N log N ). The Cooley-Tukey algorithm assuming that the number of data is a power of two is known as the Radix-2 algorithm and followed by Mixed-Radix [3] and Split-Radix [8] algorithms. In addition to the fast algorithms, the performance of parallel computing has been steadily driving the efficiency enhancement of DFT computation: the first implementation of the FFT algorithm was realized on ILLIAC IV parallel computer [9] , [10] ; over the years, the DFT computation has been adapted to both shared-The authors are with Google Research, 1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA, e-mail: {tianjianlu, yifanchen, blakehechtman, wangtao, janders}@google.com. memory [11] , [12] and distributed-memory architectures [13] - [17] .
The advancement of hardware accelerators has enabled massive parallelization for DFT computation. One such example is deploying the FFT computation on manycore processors [18] . Another example is implementing the FFT algorithm on clusters of graphics processing units (GPUs) [19] . A GPU cluster contains a number of nodes (machines) and within each node, GPUs are connected through PCIe, a high-speed serial interface. The Cookey-Tukey algorithm and its variants often require a large number of memory accesses per arithmetic operation such that the bandwidth limitation of PCIe becomes the computation bottleneck of the overall performance of FFT on GPU clusters. Prior to the recent development of novel high-speed interconnects such as NVLink [20] , [21] , many efforts related to the GPU-accelerated DFT computation are spent on minimizing the PCIe transfer time [3] , [22] . It is worth mentioning that the route of algorithm-hardware codesign has also been taken with Field Programmable Gate Arrays (FPGAs) to optimize the configurations of a customized hardware accelerator for high-performance computing of DFT [23] - [25] .
The recent success of machine learning (ML), or deep learning (DL) in particular, has spurred a new wave of hardware accelerators. In many ML applications, it becomes increasingly challenging to balance the performance-cost-energy of processors with the growth of data. Domain-specific hardware is considered as a promising approach to achieve so [26] . One example of the domain-specific hardware is Google's Tensor Processing Unit (TPU) [27] . As a reference, TPU v3 provides 480 teraflops and 128 GiB high-bandwidth memory (HBM) [28] . In witnessing how DFT computation benefits from the development of hardware accelerators, it is tempting to ask whether TPU can empower the large-scale DFT computation. It is plausible with the following four reasons. First, TPU is a ML application-specific integrated circuit (ASIC), devised for neural networks (NNs). NNs require massive amounts of multiplications and additions between the data and parameters and TPU can handle these computations in terms of matrix multiplications in a very efficient manner [29] . Similarly, DFT can also be formulated as matrix multiplications between the input data and the Vandemonde matrix. Second, TPU chips are connected directly to each other with dedicated high-speed interconnects of very low latency and without going through a host CPU or utilizing networking resources. Therefore, the large-scale DFT computation can be distributed among multiple TPUs with minimal communication time and hence very high parallel efficiency. Third, the large capacity of the in-package memory of TPU makes it possible to handle large-scale DFT efficiently. Forth, TPU is programmable with software front ends such as TensorFlow [30] and PyTorch [31] . TensorFlow offers a rich set of functionalities for scientific computing and the TensorFlow TPU programming stack can express the parallel computing algorithms with simple and easy-to-understand code without sacrificing the performance, both of which make it straightforward to implement the parallel computing of DFT on TPUs. In fact, all the aforementioned four reasons have been verified in the high-performance Monte Carlo simulations on TPUs [32] , [33] .
In this work, we present the implementation of large-scale DFT on TPUs. The implementation is in TensorFlow. The DFT formulation is based on matrix multiplications between the input data and the Vandermonde matrix with the complexity of O(N 2 ) for N -point DFT. This formulation takes full advantage of TPU's strength in matrix multiplications and allows nonuniformly sampled input data without modifying the implementation. The nonuniform DFT has important applications in signal processing, medical imaging, numerical solutions of partial differential equations, and machine learning [34] - [37] . For the parallel computing, both the input data and the Vandermonde matrix are partitioned and distributed across TPU cores. Through the data decomposition, the matrix multiplications are kept local within TPU cores and can be performed completely in parallel. The communication among TPU cores is achieved through the one-shuffle scheme, with which sending and receiving data takes place simultaneously between two neighboring cores and along the same direction on the interconnect network. The one-shuffle scheme is designed for the interconnect topology of TPU clusters, which requires minimal communication time among TPU cores. Numerical examples are used to demonstrate the high parallel efficiency of the large-scale DFT on TPUs.
II. TPU SYSTEM ARCHITECTURE
Understanding the advantages of deploying the large-scale DFT computation on TPUs cannot be separated from the [38] . However, one float32 number can be decomposed into multiple bfloat16 numbers and with appropriate accumulations, highprecision MAC operation can be achieved [39] . The DFT computation in this work leverages the strategy of decomposition and accumulation and achieves the precision of float32. As shown in Fig. 1(b) , each TPU core has 16 GiB high-bandwidth memory (HBM). The large capacity of in-package memory makes it possible to solve large-scale problems in a highly efficient manner. TPU is designed as coprocessor on the I/O bus: each board shown in Fig. 1(a) is paired with one host server consisting of CPU, RAM, and hard disk; TPU executes the instructions sent from CPU on the host server through PCIe.
In a Pod configuration, TPU chips are connected through dedicated high-speed interconnects of very low latency. Figure  2 shows a TPU v3 Pod in a data center where a total number of 2048 cores are connected to each other. The interconnect topology is a two-dimensional (2D) toroidal mesh with each chip connected to its four nearest neighbors such that the communication takes place in four directions. As the interconnects are on the device, the communication among TPU cores does not go through a host CPU or any networking resource. In order to achieve a high parallel efficiency, the communication time needs to be minimized, which further requires designing the communication strategy in accordance with the TPU interconnect topology. The load balancing and the corresponding one-shuffle communication schemes employed in the DFT computation are explained in the following section.
B. Software architecture
We program TPUs through TensorFlow. It consists of four steps to run a program on TPUs, namely, generating a computational graph, preparing the graph, lowering to high-level Fig. 3 . The TensorFlow client sends the graph to a TensorFlow server. Second, the TensorFlow server partitions the computational graph into portions that run on TPU and CPU, respectively. If multiple TPUs are to be employed, the graph is marked for replication. Third, the TensorFlow server compiles the sub-graph that runs on TPUs into a HLO program and invokes the accelerated linear algebra (XLA) compiler. Last, the XLA compiler takes in the HLO program and converts it into a LLO program, which is effectively the assembly code for TPUs. Both the generation and compilation of the computational graph occur on the host server. The compiled LLO code is loaded onto TPUs for execution from the host server through PCIe.
The memory usage of a TPU is determined at compile time. Because both the hardware structure of MXU and the memory subsystem on a TPU core prefer certain shapes of a tensor variable for an operation, XLA performs the data layout transformations in order for the hardware to efficiently process the operation. If a tensor variable does not align with the preferred shape, XLA pads zeros along one dimension to make it a multiple of eight and the other dimension to a multiple of 128. Zero padding under-utilizes the TPU core and leads to sub-optimal performance, which should be taken into account in load balancing for the parallel computing of DFT on TPUs.
III. DFT FORMULATION
In this section, we formulate the DFT as matrix multiplications through the Kronecker product [40] between the input data and the Vandermonde matrix. This formulation fully utilizes the strength of TPU, in particular, MXU in matrix multiplications. In addition, it can be applied to nonuniformly sampled data [34] , [35] without additional modification to the implementation.
A. Matrix formulation
The general form of DFT is defined as
where denotes "defined to be", x n represents the input, and
are N distinctly and arbitrarily sampled points on the z-plane. Equation (1) can be rewritten into the matrix form
where
and
Note that [V ] is the Vandermonde matrix of dimension N ×N .
Computing the inverse DFT is equivalent to solving the linear system in Equation (2). In the situation when {z k } N −1 k=0 are uniformly sampled on the z-plane, the Vandermonde matrix [V ] becomes unitary and contains the roots of unity.
The general form of a 2D DFT can be written as
where [x] has the dimension of N 1 × N 2 and
represents the set of distinctly and arbitrarily sampled points in (z 1 , z 2 ) space. It is worth mentioning that the sampling with (z 1k , z 2k ) has to ensure the existence of the inverse DFT. If the sampling is performed on rectangular grids, Equation (4) can be rewritten into the matrix form as
. . . . . . . . . . . .
Note that both [V 1 ] and [V 2 ] are Vandermonde matrices of dimensions N 1 ×N 1 and N 2 ×N 2 , respectively. One can further lump [x] into a vector such that Equation (5) can be written into the same matrix form as Equation (2), in which [V ] for the 2D DFT is expressed as
where ⊗ denotes the Kronecker product [40] . Similarly, the three-dimensional (3D) DFT defined by
(11) can be rewritten into the matrix form as
For the 3D DFT defined in Equation (12), the sampling is performed on rectangular grids in (z 1 , z 2 , z 3 ) space and the Vandermonde matrix [V 3 ] has the dimension of N 3 ×N 3 . It can be seen that the Kronecker product bridges the gap between the matrix and tensor operations, through which the contraction between a rank-2 tensor and another rank-3 tensor in the 3D DFT can be formulated as matrix multiplications. The DFT formulation with the Kronecker product can be easily extended to higher dimensions.
IV. PARALLEL IMPLEMENTATION ON TPUS
In this section, we take the 3D DFT as an example to illustrate the data decomposition and the corresponding oneshuffle scheme in the parallel computing of DFT on TPUs.
Data decomposition is applied to the input data along all three dimensions. The decomposition is based on a TPU computation shape (P 1 , P 2 , P 3 ) where P 1 , P 2 , and P 3 denote the number of TPU cores along the first, the second, and the third dimension, respectively. Given the TPU computation shape (P 1 , P 2 , P 3 ) and the input data of dimension N 1 × N 2 × N 3 and after the data decomposition, each TPU core contains a data block of dimension N1 P1 × N2 P2 × N3 P3 as shown in Fig. 4(a) . The partition of the Vandermonde matrix is one-way and along the frequency domain. As shown in Fig. 4(b) , each core has a slice of the Vandermonde matrix with dimension Ni Pi × N i , i = 1, 2, 3. Two major operations are employed in the implementation: the tensor contraction between the Vandermonde matrix and the input data; and the communication among TPU cores to send and receive the block of input data. The tensor contraction is based on einsum and the communication among TPU cores is with collective_permute through the oneshuffle scheme. Figure 5 illustrates the one-shuffle scheme. We use c 0 , c 1 , and c 2 to denote three adjacent TPU cores, the operations on which are highlighted with three different colors accordingly as shown in Fig. 5 . After the data decomposition, core c 0 contains a block of the input data x 01 and a slice of the Vandermonde matrix [V 00 , V 01 , V 02 ], core c 1 contains x 11 and [V 10 , V 11 , V 12 ], and core c 2 contains
With three einsum and two collective_permute operations, one obtains the partial DFT written as V 00 · x 01 + V 01 · x 11 + V 02 · x 21 on core c 0 , where · represents the tensor contraction. The steps taken by the partial DFT computation along one dimension are as follows:
1. apply einsum to compute V 00 · x 01 on core c 0 , V 11 · x 11 on core c 1 , and V 22 · x 21 on core c 2 as shown in Fig.  5(a) ; 2. apply collective_permute to shuffle the input between neighboring TPU cores with source-target pairs (c 1 , c 0 ), (c 2 , c 1 ), and (c 0 , c 2 ) in the form of (source, target) such that core c 0 contains x 11 , core c 1 contains x 21 , and core c 2 contains x 01 as shown in Fig.  5(b) ; Fig. 5 : The one-shuffle scheme is illustrated with the example of the DFT along one dimension in the 3D case. We use c 0 , c 1 , and c 2 to denote three adjacent cores, the operations on which are highlighted with blue, yellow, and green, respectively. The data decomposition results in the block of the input data x 01 and the slice of the Vandermonde matrix [V 00 , V 01 , V 02 ] on core c 0 , x 11 and [V 10 , V 11 , V 12 ] on core c 1 , and x 21 and [V 20 , V 21 , V 22 ] on core c 2 . The steps involved in the one-shuffle scheme are: (a) computing V 00 · x 01 on core c 0 , V 11 · x 11 on core c 1 , and V 22 · x 21 on core c 2 with · representing the operation of tensor contraction; (b) collectively permuting the inputs between two neighboring cores such that x 11 on core c 0 , x 21 on core c 1 , and x 01 on core c 2 and computing V 01 ·x 11 on core c 0 , V 12 · x 21 on core c 1 , and V 20 · x 01 on core c 2 ; (c) collectively permuting the inputs such that x 21 on core c 0 , x 01 on core c 1 , and x 11 on core c 2 and computing V 02 · x 21 on core c 0 , V 10 · x 01 on core c 1 , and V 21 · x 11 on core c 2 .
The collective_permute operation in shuffling the input between neighboring TPU cores is with source-target pairs (c 1 , c 0 ), (c 2 , c 1 ), and (c 0 , c 2 ) in the form of (source, target).
3. apply einsum to compute V 01 · x 11 on core c 0 , V 12 · x 21 on core c 1 , and V 20 · x 01 on core c 2 and add the results from step 1; 4. apply collective_permute with source-target pairs (c 1 , c 0 ), (c 2 , c 1 ), (c 0 , c 2 ), after which core c 0 contains x 21 , core c 1 contains x 01 , and core c 2 contains x 11 as shown in Fig. 5(c) ; 5. apply einsum to compute V 02 · x 21 on core c 0 , V 10 · x 01 on core c 1 , and V 21 · x 11 on core c 2 and add the results from step 3.
The DFT computations along the other two dimensions in the 3D case follow similar steps. With the one-shuffle scheme, sending and receiving data takes place simultaneously between two neighboring cores and along the same direction on the 2D toroidal network. The oneshuffle scheme best utilizes the topology of the interconnect and hence its bandwidth. Together with the high-speed and low-latency nature of the interconnects, the one-shuffle scheme requires minimal communication time. Besides, all tensor contractions remain local within individual TPU cores, which are computed completely in parallel. Therefore, the proposed implementation of DFT on TPUs can achieve very high parallel efficiency.
V. PARALLEL EFFICIENCY ANALYSIS
In this section, numerical examples are provided to demonstrate the parallel efficiency of DFT on TPUs for both the 2D and 3D cases. The strong scaling is adopted in the parallel efficiency analysis, in which the problem size is kept the same as proportionally more TPU cores are employed. Figure 6 shows the computation time of the 2D DFT with up to 128 TPU cores on an example of dimension 8192 × 8192. It can be seen from Fig. 6 that a near-linear scaling of the computation time with respect to the number of TPU cores is achieved. As a reference, the ideal computation time is also provided in Fig. 6 , which is defined by
A. 2D DFT
where T 2 denotes the total computation time with two TPU cores and N core is the total number of TPU cores being used. As mentioned in the parallel implementation, the total computation time consists of two parts: the time of matrix multiplications, or einsum to be specific, and the communication time of sending and receiving the block of input data across TPU cores. It can be seen from Fig. 6 that the time of matrix multiplications scales linearly with respect to the total number of TPU cores. This is because the matrix multiplications are kept completely local within individual cores. The total computation time of the 2D DFT scales approximately linearly with respect to the number of TPU cores, with the gap between the total and the ideal computation time arising from the communication among TPU cores. 
B. 3D DFT
The parallel efficiency of the 3D DFT is demonstrated through an example of dimension 2048×2048×2048. Similar to the 2D case, the problem size is also fixed as proportionally more TPU cores are employed. The total computation time is depicted in Fig. 7 . As a reference, the ideal computation time is provided in Fig. 7 , which is defined by
where T 32 denotes the total computation time with 32 TPU cores. It can be seen from Fig. 7 that the total computation time scales approximately linearly with respect to the number of TPU cores. The gap between the total and the ideal computation time in the 3D case also results from the communication among TPU cores. As mentioned in the parallel implementation, the data decomposition is applied to the input data along all the three dimensions with a TPU computation shape. The computation shape in this example has the form of (4, 4, n 2 ) with four TPU cores along the first dimension, four along the second dimension, and n 2 along the third dimension. It is indeed the number of TPU cores along the third dimension that varies in Fig. 7 . For example, the computation shapes (4, 4, 16) and (4, 4, 32) are corresponding to 256 and 512 TPU cores, respectively. As the number of TPU cores along the third dimension doubles itself, the size of the input data contained on each core is reduced by half. As a result, the computation time associated with a single operation of collective_permute or einsum is also reduced by half, which is shown in Fig. 8 . However, as more cores are being used, the total number of collective_permute operations increases. For example, it requires a total number of 31 collective_permute operations in the Fourier transform along the third dimension in the case of 512 TPU cores or with the TPU computation shape (4, 4, 32) , whereas only 15 collective_permute operations are required in the case of 256 TPU cores or with the TPU computation shape (4, 4, 16) . It can be seen that even though the time associated with one single collective_permute operation decreases when more TPU cores are used, the total communication time for the DFT along the third dimension does not change. This explains the gap between the total and the ideal computation time in Fig. 7 .
In addition to the strong scaling analysis, the computation time of a few 3D DFT examples on a full TPU Pod with 2048 cores is provided in Table I . As the problem size increases from 2048 × 2048 × 2048 to 4096 × 4096 × 4096, the computation time increases 9.7 times. Similarly, the computation time increases 11.8 times when the problem size increases from 4096 × 4096 × 4096 to 8192 × 8192 × 8192.
VI. CONCLUSION AND DISCUSSION
In this work, we proposed the parallel implementation of DFT on TPUs. Through the implementation, TPU-the domain-specific hardware accelerator for machine learning applications-is employed in the parallel computing of largescale DFT. The formulation of the DFT is through the Kronecker product and based on matrix multiplications between the input data and the Vandemonde matrix. There are two major advantages associated with this formulation: first, it takes full advantage of TPU's strength in matrix multiplications; second, it applies to nonuniformly sampled input data without additional modification to the implementation. In the parallel implementation, data decomposition is applied to both the input data and the Vandermonde matrix. Through the data decomposition, the matrix multiplications are kept local within individual TPU cores and performed completely in parallel. As for the communication among TPU cores, the one-shuffle scheme is designed for the TPU interconnect topology, with which sending and receiving data takes place simultaneously between two neighboring cores and along the same direction on the interconnect network. The one-shuffle scheme requires minimal communication time among TPU cores and achieves very high parallel efficiency. Numerical examples of both 2D and 3D are used to demonstrate the high parallel efficiency of the DFT on TPUs. The implementation of DFT on TPUs is with TensorFlow owing to its rich set of functionalities for scientific computing and simplicity in expressing the parallel computing algorithms.
With the demonstrated computation efficiency, the largescale DFT on TPUs opens an array of opportunities for scientific computing. As a future work, the DFT formulation in this paper can be combined with the Cooley-Tukey algorithm in a way that the overall complexity of the algorithm can be reduced whereas the utilization of TPU on matrix multiplications is not compromised. However, this may sacrifice the capability of dealing with nonuniformly sampled data in the current implementation. Future work can also be done to improve the precision of matrix multiplications from float32 to float64. Another possibility of extending this work is to integrate the large-scale DFT on TPUs with the problems of medical image reconstruction, where a large number of iterations of nonuniform Fourier transform is often required in the optimization scheme. Finally, it is also possible to extend the implementation into a framework and address large-scale Fourier transform of higher dimensions.
VII. ACKNOWLEDGMENT

