Large-Scale Discrete Fourier Transform on TPUs by Lu, Tianjian et al.
1Large-Scale Discrete Fourier Transform on TPUs
Tianjian Lu, Yi-Fan Chen, Blake Hechtman, Tao Wang, and John Anderson
Abstract—In this work, we present a parallel algorithm for
large-scale discrete Fourier transform (DFT) on Tensor Process-
ing Unit (TPU) clusters. The algorithm is implemented in Tensor-
Flow because of its rich set of functionalities for scientific comput-
ing and simplicity in realizing parallel computing algorithms. The
DFT formulation is based on matrix multiplications between the
input data and the Vandermonde matrix. This formulation takes
full advantage of TPU’s strength in matrix multiplications and
allows nonuniformly sampled input data without modifying the
implementation. For the parallel computing, both the input data
and the Vandermonde matrix are partitioned and distributed
across TPU cores. Through the data decomposition, the matrix
multiplications are kept local within TPU cores and can be
performed completely in parallel. The communication among
TPU cores is achieved through the one-shuffle scheme, with
which sending and receiving data takes place simultaneously
between two neighboring cores and along the same direction
on the interconnect network. The one-shuffle scheme is designed
for the interconnect topology of TPU clusters, requiring minimal
communication time among TPU cores. Numerical examples are
used to demonstrate the high parallel efficiency of the large-scale
DFT on TPUs.
Index Terms—Discrete Fourier transform, Hardware acceler-
ator, Parallel computing, TensorFlow, Tensor processing unit
I. INTRODUCTION
The discrete Fourier transform (DFT) is critical in many sci-
entific and engineering applications, including time series and
waveform analyses, convolution and correlation computations,
solutions to partial differential equations, density function the-
ory in first-principle calculations, spectrum analyzer, synthetic
aperture radar, computed tomography, magnetic resonance
imaging, and derivatives pricing [1]–[4]. However, the compu-
tation efficiency of DFT is often the formidable bottleneck in
handling large-scale problems due to the large data size and
real-time-processing requirement [5], [6]. In general, efforts
on enhancing the computation efficiency of DFT fall into two
categories: seeking fast algorithms and adapting the algorithms
to parallel computing. One breakthrough of the fast-algorithm
category is the Cooley-Tukey algorithm [7], also known as the
fast Fourier transform (FFT), which reduces the complexity
of N -point DFT from O(N2) to O(N logN). The Cooley-
Tukey algorithm assuming that the number of data is a power
of two is known as the Radix-2 algorithm and followed by
Mixed-Radix [3] and Split-Radix [8] algorithms. In addition to
the fast algorithms, the performance of parallel computing has
been steadily driving the efficiency enhancement of DFT com-
putation: the first implementation of the FFT algorithm was
realized on ILLIAC IV parallel computer [9], [10]; over the
years, the DFT computation has been adapted to both shared-
The authors are with Google Research, 1600 Amphitheatre Pkwy, Moun-
tain View, CA 94043, USA, e-mail: {tianjianlu, yifanchen, blakehechtman,
wangtao, janders}@google.com.
memory [11], [12] and distributed-memory architectures [13]–
[17].
The advancement of hardware accelerators has enabled mas-
sive parallelization for DFT computation. One such example
is deploying the FFT computation on manycore processors
[18]. Another example is implementing the FFT algorithm
on clusters of graphics processing units (GPUs) [19]. A GPU
cluster contains a number of nodes (machines) and within each
node, GPUs are connected through PCIe, a high-speed serial
interface. The Cookey-Tukey algorithm and its variants often
require a large number of memory accesses per arithmetic
operation such that the bandwidth limitation of PCIe becomes
the computation bottleneck of the overall performance of FFT
on GPU clusters. Prior to the recent development of novel
high-speed interconnects such as NVLink [20], [21], many
efforts related to the GPU-accelerated DFT computation are
spent on minimizing the PCIe transfer time [3], [22]. It is
worth mentioning that the route of algorithm-hardware co-
design has also been taken with Field Programmable Gate Ar-
rays (FPGAs) to optimize the configurations of a customized
hardware accelerator for high-performance computing of DFT
[23]–[25].
The recent success of machine learning (ML), or deep learn-
ing (DL) in particular, has spurred a new wave of hardware
accelerators. In many ML applications, it becomes increas-
ingly challenging to balance the performance-cost-energy of
processors with the growth of data. Domain-specific hardware
is considered as a promising approach to achieve so [26]. One
example of the domain-specific hardware is Google’s Tensor
Processing Unit (TPU) [27]. As a reference, TPU v3 provides
480 teraflops and 128 GiB high-bandwidth memory (HBM)
[28]. In witnessing how DFT computation benefits from the
development of hardware accelerators, it is tempting to ask
whether TPU can empower the large-scale DFT computation.
It is plausible with the following four reasons. First, TPU is
a ML application-specific integrated circuit (ASIC), devised
for neural networks (NNs). NNs require massive amounts of
multiplications and additions between the data and parameters
and TPU can handle these computations in terms of matrix
multiplications in a very efficient manner [29]. Similarly, DFT
can also be formulated as matrix multiplications between
the input data and the Vandemonde matrix. Second, TPU
chips are connected directly to each other with dedicated
high-speed interconnects of very low latency and without
going through a host CPU or utilizing networking resources.
Therefore, the large-scale DFT computation can be distributed
among multiple TPUs with minimal communication time and
hence very high parallel efficiency. Third, the large capacity
of the in-package memory of TPU makes it possible to handle
large-scale DFT efficiently. Forth, TPU is programmable with
software front ends such as TensorFlow [30] and PyTorch [31].
ar
X
iv
:2
00
2.
03
26
0v
1 
 [c
s.M
S]
  9
 Fe
b 2
02
0
2(a)
(b)
Fig. 1: (a) TPU v3 has four chips on the same board and (b)
each chip contains two cores.
TensorFlow offers a rich set of functionalities for scientific
computing and the TensorFlow TPU programming stack can
express the parallel computing algorithms with simple and
easy-to-understand code without sacrificing the performance,
both of which make it straightforward to implement the paral-
lel computing of DFT on TPUs. In fact, all the aforementioned
four reasons have been verified in the high-performance Monte
Carlo simulations on TPUs [32], [33].
In this work, we present the implementation of large-scale
DFT on TPUs. The implementation is in TensorFlow. The DFT
formulation is based on matrix multiplications between the
input data and the Vandermonde matrix with the complexity
of O(N2) for N -point DFT. This formulation takes full
advantage of TPU’s strength in matrix multiplications and
allows nonuniformly sampled input data without modifying
the implementation. The nonuniform DFT has important ap-
plications in signal processing, medical imaging, numerical
solutions of partial differential equations, and machine learn-
ing [34]–[37]. For the parallel computing, both the input data
and the Vandermonde matrix are partitioned and distributed
across TPU cores. Through the data decomposition, the matrix
multiplications are kept local within TPU cores and can be
performed completely in parallel. The communication among
TPU cores is achieved through the one-shuffle scheme, with
which sending and receiving data takes place simultaneously
between two neighboring cores and along the same direction
on the interconnect network. The one-shuffle scheme is de-
signed for the interconnect topology of TPU clusters, which
requires minimal communication time among TPU cores.
Numerical examples are used to demonstrate the high parallel
efficiency of the large-scale DFT on TPUs.
II. TPU SYSTEM ARCHITECTURE
Understanding the advantages of deploying the large-scale
DFT computation on TPUs cannot be separated from the
Fig. 2: TPU v3 Pod in a data center.
knowledge of TPU system architecture. In this section, we
provide an overview of the TPU system architecture on both
the hardware and software components.
A. Hardware architecture
Figure 1 shows one TPU board or unit: there are four
TPU chips on the same board; each chip has two cores; and
each core contains the scalar, vector, and matrix units (MXU).
MXU provides the bulk of the compute power of a TPU chip.
Structured as a 128 × 128 systolic array, MXU can handle
16 K multiply-accumulate (MAC) operations in one single
clock cycle. Both the inputs and outputs of MXU are float32,
whereas MXU performs MAC operations with bfloat16 [38].
However, one float32 number can be decomposed into multiple
bfloat16 numbers and with appropriate accumulations, high-
precision MAC operation can be achieved [39]. The DFT com-
putation in this work leverages the strategy of decomposition
and accumulation and achieves the precision of float32. As
shown in Fig. 1(b), each TPU core has 16 GiB high-bandwidth
memory (HBM). The large capacity of in-package memory
makes it possible to solve large-scale problems in a highly
efficient manner. TPU is designed as coprocessor on the I/O
bus: each board shown in Fig. 1(a) is paired with one host
server consisting of CPU, RAM, and hard disk; TPU executes
the instructions sent from CPU on the host server through
PCIe.
In a Pod configuration, TPU chips are connected through
dedicated high-speed interconnects of very low latency. Figure
2 shows a TPU v3 Pod in a data center where a total number
of 2048 cores are connected to each other. The interconnect
topology is a two-dimensional (2D) toroidal mesh with each
chip connected to its four nearest neighbors such that the com-
munication takes place in four directions. As the interconnects
are on the device, the communication among TPU cores does
not go through a host CPU or any networking resource. In
order to achieve a high parallel efficiency, the communication
time needs to be minimized, which further requires designing
the communication strategy in accordance with the TPU inter-
connect topology. The load balancing and the corresponding
one-shuffle communication schemes employed in the DFT
computation are explained in the following section.
B. Software architecture
We program TPUs through TensorFlow. It consists of four
steps to run a program on TPUs, namely, generating a com-
putational graph, preparing the graph, lowering to high-level
3Fig. 3: A computation graph of TensorFlow operations.
optimizer (HLO) operations, and lowering to low-level op-
timizer (LLO) operations. First, a TensorFlow client converts
the TensorFlow operations into a computational graph. A sam-
ple computation graph performing addition and convolution
operations is shown in Fig. 3. The TensorFlow client sends
the graph to a TensorFlow server. Second, the TensorFlow
server partitions the computational graph into portions that
run on TPU and CPU, respectively. If multiple TPUs are
to be employed, the graph is marked for replication. Third,
the TensorFlow server compiles the sub-graph that runs on
TPUs into a HLO program and invokes the accelerated linear
algebra (XLA) compiler. Last, the XLA compiler takes in the
HLO program and converts it into a LLO program, which is
effectively the assembly code for TPUs. Both the generation
and compilation of the computational graph occur on the host
server. The compiled LLO code is loaded onto TPUs for
execution from the host server through PCIe.
The memory usage of a TPU is determined at compile
time. Because both the hardware structure of MXU and the
memory subsystem on a TPU core prefer certain shapes of
a tensor variable for an operation, XLA performs the data
layout transformations in order for the hardware to efficiently
process the operation. If a tensor variable does not align with
the preferred shape, XLA pads zeros along one dimension
to make it a multiple of eight and the other dimension to a
multiple of 128. Zero padding under-utilizes the TPU core and
leads to sub-optimal performance, which should be taken into
account in load balancing for the parallel computing of DFT
on TPUs.
III. DFT FORMULATION
In this section, we formulate the DFT as matrix multi-
plications through the Kronecker product [40] between the
input data and the Vandermonde matrix. This formulation fully
utilizes the strength of TPU, in particular, MXU in matrix
multiplications. In addition, it can be applied to nonuniformly
sampled data [34], [35] without additional modification to the
implementation.
A. Matrix formulation
The general form of DFT is defined as
Xk , X(zk) =
N−1∑
n=0
xnz
−n
k , (1)
where , denotes “defined to be”, xn represents the input, and
{zk}N−1k=0 are N distinctly and arbitrarily sampled points on
the z-plane. Equation (1) can be rewritten into the matrix form
{X} = [V ] {x} , (2)
where
{X} = (X(z0), X(z1), · · · , X(zN−1))T ,
{x} = (x0, x1, · · · , xN−1)T ,
and
[V ] =

1 z−10 z
−2
0 · · · z−(N−1)0
1 z−11 z
−2
1 · · · z−(N−1)1
...
...
...
. . .
...
1 z−1N−1 z
−2
N−1 · · · z−(N−1)N−1
 . (3)
Note that [V ] is the Vandermonde matrix of dimension N×N .
Computing the inverse DFT is equivalent to solving the linear
system in Equation (2). In the situation when {zk}N−1k=0 are
uniformly sampled on the z-plane, the Vandermonde matrix
[V ] becomes unitary and contains the roots of unity.
The general form of a 2D DFT can be written as
X(z1k, z2k) =
N1−1∑
n1=0
N2−1∑
n2=0
x(n1, n2)z
−n1
1k z
−n2
2k , (4)
where [x] has the dimension of N1 × N2 and
{(z1k, z2k)}N1N2−1k=0 represents the set of distinctly and
arbitrarily sampled points in (z1, z2) space. It is worth
mentioning that the sampling with (z1k, z2k) has to ensure
the existence of the inverse DFT. If the sampling is performed
on rectangular grids, Equation (4) can be rewritten into the
matrix form as
[X] = [V1] [x] [V2]
T
, (5)
where
[X] =

X(z10,z20) X(z10,z21) ··· X(z10,z2,N2−1)
X(z11,z20) X(z11,z21) ··· X(z11,z2,N2−1)
...
...
. . .
...
X(z1,N1−1,z20) X(z1,N1−1,z21) ··· X(z1,N1−1,z2,N2−1)
 ,
(6)
[x] =
 x(0,0) x(0,1) ··· x(0,N2−1)x(1,0) x(1,1) ··· x(1,N2−1)... ... . . . ...
x(N1−1,0) x(N1−1,1) ··· x(N1−1,N2−1)
 , (7)
[V1] =

1 z−110 z
−2
10 · · · z−(N1−1)10
1 z−111 z
−2
11 · · · z−(N1−1)11
...
...
...
. . .
...
1 z−11,N1−1 z
−2
1,N1−1 · · · z
−(N1−1)
1,N1−1
 , (8)
4and
[V2] =

1 z−120 z
−2
20 · · · z−(N2−1)20
1 z−121 z
−2
21 · · · z−(N2−1)21
...
...
...
. . .
...
1 z−12,N2−1 z
−2
2,N2−1 · · · z
−(N2−1)
2,N2−1
 . (9)
Note that both [V1] and [V2] are Vandermonde matrices of
dimensions N1×N1 and N2×N2, respectively. One can further
lump [x] into a vector such that Equation (5) can be written
into the same matrix form as Equation (2), in which [V ] for
the 2D DFT is expressed as
[V ] = [V1]⊗ [V2] , (10)
where ⊗ denotes the Kronecker product [40].
Similarly, the three-dimensional (3D) DFT defined by
X(z1k, z2k, z3k) =
N1−1∑
n1=0
N2−1∑
n2=0
N3−1∑
n3=0
x(n1, n2, n3)z
−n1
1k z
−n2
2k z
−n3
3k
(11)
can be rewritten into the matrix form as
{X} = [V1]⊗ [V2]⊗ [V3] {x} , (12)
where
[Vj ] =

1 z−1j0 z
−2
j0 · · · z−(Nj−1)j0
1 z−1j1 z
−2
j1 · · · z−(Nj−1)j1
...
...
...
. . .
...
1 z−1j,Nj−1 z
−2
j,Nj−1 · · · z
−(Nj−1)
j,Nj−1
 ,
j ∈{1, 2, 3} . (13)
For the 3D DFT defined in Equation (12), the sampling is
performed on rectangular grids in (z1, z2, z3) space and the
Vandermonde matrix [V3] has the dimension of N3×N3. It can
be seen that the Kronecker product bridges the gap between the
matrix and tensor operations, through which the contraction
between a rank-2 tensor and another rank-3 tensor in the 3D
DFT can be formulated as matrix multiplications. The DFT
formulation with the Kronecker product can be easily extended
to higher dimensions.
IV. PARALLEL IMPLEMENTATION ON TPUS
In this section, we take the 3D DFT as an example to
illustrate the data decomposition and the corresponding one-
shuffle scheme in the parallel computing of DFT on TPUs.
Data decomposition is applied to the input data along all
three dimensions. The decomposition is based on a TPU com-
putation shape (P1, P2, P3) where P1, P2, and P3 denote the
number of TPU cores along the first, the second, and the third
dimension, respectively. Given the TPU computation shape
(P1, P2, P3) and the input data of dimension N1 ×N2 ×N3
and after the data decomposition, each TPU core contains a
data block of dimension N1P1 × N2P2 × N3P3 as shown in Fig. 4(a).
The partition of the Vandermonde matrix is one-way and along
the frequency domain. As shown in Fig. 4(b), each core has
a slice of the Vandermonde matrix with dimension NiPi ×Ni,
i = 1, 2, 3.
(a)
(b)
Fig. 4: Through the data decomposition with the TPU com-
putation shape (P1, P2, P3), each TPU core contains (a) the
Vandermonde matrix of dimension NiPi × Ni, i = 1, 2, 3 and
(b) the block of input data of dimension N1P1 × N2P2 × N3P3 for
3D DFT.
Two major operations are employed in the implementation:
the tensor contraction between the Vandermonde matrix and
the input data; and the communication among TPU cores to
send and receive the block of input data. The tensor contraction
is based on einsum and the communication among TPU
cores is with collective_permute through the one-
shuffle scheme.
Figure 5 illustrates the one-shuffle scheme. We use c0, c1,
and c2 to denote three adjacent TPU cores, the operations on
which are highlighted with three different colors accordingly
as shown in Fig. 5. After the data decomposition, core c0
contains a block of the input data x01 and a slice of the
Vandermonde matrix [V00, V01, V02], core c1 contains x11 and
[V10, V11, V12], and core c2 contains x21 and [V20, V21, V22].
With three einsum and two collective_permute op-
erations, one obtains the partial DFT written as V00 · x01 +
V01 · x11 + V02 · x21 on core c0, where · represents the tensor
contraction. The steps taken by the partial DFT computation
along one dimension are as follows:
1. apply einsum to compute V00 ·x01 on core c0, V11 ·x11
on core c1, and V22 · x21 on core c2 as shown in Fig.
5(a);
2. apply collective_permute to shuffle the input
between neighboring TPU cores with source-target
pairs (c1, c0), (c2, c1), and (c0, c2) in the form of
(source, target) such that core c0 contains x11, core c1
contains x21, and core c2 contains x01 as shown in Fig.
5(b);
5(a)
(b)
(c)
Fig. 5: The one-shuffle scheme is illustrated with the example
of the DFT along one dimension in the 3D case. We use c0, c1,
and c2 to denote three adjacent cores, the operations on which
are highlighted with blue, yellow, and green, respectively. The
data decomposition results in the block of the input data x01
and the slice of the Vandermonde matrix [V00, V01, V02] on
core c0, x11 and [V10, V11, V12] on core c1, and x21 and
[V20, V21, V22] on core c2. The steps involved in the one-shuffle
scheme are: (a) computing V00 · x01 on core c0, V11 · x11
on core c1, and V22 · x21 on core c2 with · representing the
operation of tensor contraction; (b) collectively permuting the
inputs between two neighboring cores such that x11 on core
c0, x21 on core c1, and x01 on core c2 and computing V01 ·x11
on core c0, V12 · x21 on core c1, and V20 · x01 on core c2; (c)
collectively permuting the inputs such that x21 on core c0,
x01 on core c1, and x11 on core c2 and computing V02 · x21
on core c0, V10 · x01 on core c1, and V21 · x11 on core c2.
The collective_permute operation in shuffling the input
between neighboring TPU cores is with source-target pairs
(c1, c0), (c2, c1), and (c0, c2) in the form of (source, target).
3. apply einsum to compute V01 ·x11 on core c0, V12 ·x21
on core c1, and V20 · x01 on core c2 and add the results
from step 1;
4. apply collective_permute with source-target pairs
(c1, c0), (c2, c1), (c0, c2), after which core c0 contains
x21, core c1 contains x01, and core c2 contains x11 as
shown in Fig. 5(c);
5. apply einsum to compute V02 ·x21 on core c0, V10 ·x01
on core c1, and V21 · x11 on core c2 and add the results
from step 3.
The DFT computations along the other two dimensions in the
3D case follow similar steps.
With the one-shuffle scheme, sending and receiving data
takes place simultaneously between two neighboring cores and
along the same direction on the 2D toroidal network. The one-
shuffle scheme best utilizes the topology of the interconnect
and hence its bandwidth. Together with the high-speed and
low-latency nature of the interconnects, the one-shuffle scheme
requires minimal communication time. Besides, all tensor
contractions remain local within individual TPU cores, which
are computed completely in parallel. Therefore, the proposed
implementation of DFT on TPUs can achieve very high
parallel efficiency.
V. PARALLEL EFFICIENCY ANALYSIS
In this section, numerical examples are provided to demon-
strate the parallel efficiency of DFT on TPUs for both the
2D and 3D cases. The strong scaling is adopted in the parallel
efficiency analysis, in which the problem size is kept the same
as proportionally more TPU cores are employed.
A. 2D DFT
Figure 6 shows the computation time of the 2D DFT with up
to 128 TPU cores on an example of dimension 8192× 8192.
It can be seen from Fig. 6 that a near-linear scaling of the
computation time with respect to the number of TPU cores is
achieved. As a reference, the ideal computation time is also
provided in Fig. 6, which is defined by
ideal time =
2× T2
Ncore
, (14)
where T2 denotes the total computation time with two TPU
cores and Ncore is the total number of TPU cores being
used. As mentioned in the parallel implementation, the total
computation time consists of two parts: the time of matrix
multiplications, or einsum to be specific, and the commu-
nication time of sending and receiving the block of input
data across TPU cores. It can be seen from Fig. 6 that the
time of matrix multiplications scales linearly with respect to
the total number of TPU cores. This is because the matrix
multiplications are kept completely local within individual
cores. The total computation time of the 2D DFT scales
approximately linearly with respect to the number of TPU
cores, with the gap between the total and the ideal computation
time arising from the communication among TPU cores.
62 4 8 16 32 64 128
Number of TPU Cores
10-3
10-2
10-1
1
Ti
m
e 
(se
co
nd
s)
Total time
Ideal time
Time on einsum
Fig. 6: The computation time of the 2D DFT with up to 128
TPU cores on an example of dimension 8192× 8192.
32 64 128 256 512 1024 2048
Number of TPU Cores
10-2
10-1
1
10
Ti
m
e 
(se
co
nd
s)
Total time
Ideal time
Fig. 7: The computation time of the 3D DFT with up to 2048
TPU cores on an example of dimension 2048× 2048× 2048.
B. 3D DFT
The parallel efficiency of the 3D DFT is demonstrated
through an example of dimension 2048×2048×2048. Similar
to the 2D case, the problem size is also fixed as proportionally
more TPU cores are employed. The total computation time is
depicted in Fig. 7. As a reference, the ideal computation time
is provided in Fig. 7, which is defined by
ideal time =
32× T32
Ncore
, (15)
where T32 denotes the total computation time with 32 TPU
cores. It can be seen from Fig. 7 that the total computation
time scales approximately linearly with respect to the number
of TPU cores.
The gap between the total and the ideal computation time
in the 3D case also results from the communication among
TPU cores. As mentioned in the parallel implementation, the
data decomposition is applied to the input data along all
the three dimensions with a TPU computation shape. The
computation shape in this example has the form of (4, 4, n2)
with four TPU cores along the first dimension, four along
16 32 64 128
Number of TPU Cores along the Third Dimension
10-4
10-3
10-2
Ti
m
e 
(se
co
nd
s)
Actual time of one collective permute
Actual time of one einsum
Ideal time
Fig. 8: The computation time of one single operation of
collective_permute and einsum, respectively, in the
3D DFT along the third dimension with respect to the number
of TPU cores.
TABLE I: Computation time of the 3D DFT on a full TPU
Pod with 2048 TPU cores.
Problem size Time (seconds)
1 8192× 8192× 8192 12.66
2 4096× 4096× 4096 1.07
3 2048× 2048× 2048 0.11
4 1024× 1024× 1024 0.02
the second dimension, and n2 along the third dimension. It
is indeed the number of TPU cores along the third dimension
that varies in Fig. 7. For example, the computation shapes
(4, 4, 16) and (4, 4, 32) are corresponding to 256 and 512
TPU cores, respectively. As the number of TPU cores along
the third dimension doubles itself, the size of the input data
contained on each core is reduced by half. As a result,
the computation time associated with a single operation of
collective_permute or einsum is also reduced by
half, which is shown in Fig. 8. However, as more cores are
being used, the total number of collective_permute
operations increases. For example, it requires a total number
of 31 collective_permute operations in the Fourier
transform along the third dimension in the case of 512 TPU
cores or with the TPU computation shape (4, 4, 32), whereas
only 15 collective_permute operations are required in
the case of 256 TPU cores or with the TPU computation shape
(4, 4, 16). It can be seen that even though the time associated
with one single collective_permute operation decreases
when more TPU cores are used, the total communication time
for the DFT along the third dimension does not change. This
explains the gap between the total and the ideal computation
time in Fig. 7.
In addition to the strong scaling analysis, the computation
time of a few 3D DFT examples on a full TPU Pod with 2048
cores is provided in Table I. As the problem size increases
7from 2048 × 2048 × 2048 to 4096 × 4096 × 4096, the com-
putation time increases 9.7 times. Similarly, the computation
time increases 11.8 times when the problem size increases
from 4096× 4096× 4096 to 8192× 8192× 8192.
VI. CONCLUSION AND DISCUSSION
In this work, we proposed the parallel implementation
of DFT on TPUs. Through the implementation, TPU—the
domain-specific hardware accelerator for machine learning
applications—is employed in the parallel computing of large-
scale DFT. The formulation of the DFT is through the Kro-
necker product and based on matrix multiplications between
the input data and the Vandemonde matrix. There are two
major advantages associated with this formulation: first, it
takes full advantage of TPU’s strength in matrix multiplica-
tions; second, it applies to nonuniformly sampled input data
without additional modification to the implementation. In the
parallel implementation, data decomposition is applied to both
the input data and the Vandermonde matrix. Through the data
decomposition, the matrix multiplications are kept local within
individual TPU cores and performed completely in parallel.
As for the communication among TPU cores, the one-shuffle
scheme is designed for the TPU interconnect topology, with
which sending and receiving data takes place simultaneously
between two neighboring cores and along the same direction
on the interconnect network. The one-shuffle scheme requires
minimal communication time among TPU cores and achieves
very high parallel efficiency. Numerical examples of both 2D
and 3D are used to demonstrate the high parallel efficiency
of the DFT on TPUs. The implementation of DFT on TPUs
is with TensorFlow owing to its rich set of functionalities for
scientific computing and simplicity in expressing the parallel
computing algorithms.
With the demonstrated computation efficiency, the large-
scale DFT on TPUs opens an array of opportunities for
scientific computing. As a future work, the DFT formulation in
this paper can be combined with the Cooley-Tukey algorithm
in a way that the overall complexity of the algorithm can
be reduced whereas the utilization of TPU on matrix mul-
tiplications is not compromised. However, this may sacrifice
the capability of dealing with nonuniformly sampled data in
the current implementation. Future work can also be done to
improve the precision of matrix multiplications from float32
to float64. Another possibility of extending this work is to
integrate the large-scale DFT on TPUs with the problems
of medical image reconstruction, where a large number of
iterations of nonuniform Fourier transform is often required in
the optimization scheme. Finally, it is also possible to extend
the implementation into a framework and address large-scale
Fourier transform of higher dimensions.
VII. ACKNOWLEDGMENT
We would like to thank David Majnemer, Reid Tatge, Dehao
Chen, Yusef Shafi, Damien Pierce, James Lottes, Matthias
Ihme, and Rene Salmon for valuable discussions and helpful
comments, which have greatly improved the paper.
REFERENCES
[1] R. N. Bracewell and R. N. Bracewell, The Fourier transform and its
applications. McGraw-Hill New York, 1986, vol. 31999.
[2] A. Grama, V. Kumar, A. Gupta, and G. Karypis, Introduction to parallel
computing. Pearson Education, 2003.
[3] D. Takahashi, Fast Fourier transform algorithms for parallel computers.
Springer, 2019.
[4] R. Cont, Frontiers in quantitative finance: Volatility and credit risk
modeling. John Wiley & Sons, 2009, vol. 519.
[5] G. B. Giannakis, F. Bach, R. Cendrillon, M. Mahoney, and J. Neville,
“Signal processing for big data [from the guest editors],” IEEE Signal
Processing Magazine, vol. 31, no. 5, pp. 15–16, 2014.
[6] E. Olshannikova, A. Ometov, Y. Koucheryavy, and T. Olsson, “Visualiz-
ing big data with augmented and virtual reality: challenges and research
agenda,” Journal of Big Data, vol. 2, no. 1, p. 22, 2015.
[7] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation
of complex Fourier series,” Mathematics of computation, vol. 19, no. 90,
pp. 297–301, 1965.
[8] P. Duhamel and H. Hollmann, “Split radix FFT algorithm,” Electronics
letters, vol. 20, no. 1, pp. 14–16, 1984.
[9] G. Ackins, “Fast fourier transform via ILLIAC IV,” ILLIAC IV Docu-
ment, no. 146, 1968.
[10] J. E. Stevens Jr et al., “A fast Fourier transform subroutine for ILLIAC
IV,” CAC document; no. 17, 1971.
[11] P. N. Swarztrauber, “Multiprocessor FFTs,” Parallel computing, vol. 5,
no. 1-2, pp. 197–210, 1987.
[12] D. H. Bailey, “FFTs in external or hierarchical memory,” The journal
of Supercomputing, vol. 4, no. 1, pp. 23–35, 1990.
[13] A. Gupta and V. Kumar, “The scalability of FFT on parallel computers,”
IEEE Transactions on Parallel and Distributed Systems, vol. 4, no. 8,
pp. 922–932, 1993.
[14] M. Frigo and S. G. Johnson, “The design and implementation of
FFTW3,” Proceedings of the IEEE, vol. 93, no. 2, pp. 216–231, 2005.
[15] D. Pekurovsky, “P3DFFT: A framework for parallel computations of
Fourier transforms in three dimensions,” SIAM Journal on Scientific
Computing, vol. 34, no. 4, pp. C192–C209, 2012.
[16] D. Takahashi, “An implementation of parallel 3-D FFT with 2-D
decomposition on a massively parallel cluster of multi-core processors,”
in International Conference on Parallel Processing and Applied Math-
ematics. Springer, 2009, pp. 606–614.
[17] D. T. Popovici, M. D. Schatz, F. Franchetti, and T. M. Low, “A
flexible framework for parallel multi-dimensional DFTs,” arXiv preprint
arXiv:1904.10119, 2019.
[18] J. Kim. (2018) Leveraging optimized FFT on Intel Xeon
platforms. [Online]. Available: https://www.alcf.anl.gov/support-center/
training-assets/leveraging-optimized-fft-intel-xeon-platforms
[19] K. R. Roe, K. Hester, and R. Pascual. (2019) Multi-GPU FFT
performance on different hardware configurations. [Online]. Available:
https://developer.nvidia.com/gtc/2019/video/S9158
[20] D. Foley and J. Danskin, “Ultra-performance Pascal GPU and NVLink
interconnect,” IEEE Micro, vol. 37, no. 2, pp. 7–17, Mar 2017.
[21] A. Li, S. L. Song, J. Chen, J. Li, X. Liu, N. Tallent, and K. Barker, “Eval-
uating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch
and GPUDirect,” arXiv preprint arXiv:1903.04611, 2019.
[22] Y. Chen, X. Cui, and H. Mei, “Large-scale FFT on GPU clusters,” in
Proceedings of the 24th ACM International Conference on Supercom-
puting. ACM, 2010, pp. 315–324.
[23] S. K. Nag and H. K. Verma, “Method for parallel-efficient configuring
an FPGA for large FFTs and other vector rotation computations,” Feb. 1
2000, US Patent 6,021,423.
[24] M. Garrido, M. Acevedo, A. Ehliar, and O. Gustafsson, “Challenging the
limits of FFT performance on FPGAs,” in 2014 International Symposium
on Integrated Circuits (ISIC). IEEE, 2014, pp. 172–175.
[25] C.-L. Yu, K. Irick, C. Chakrabarti, and V. Narayanan, “Multidimensional
DFT IP generator for FPGA platforms,” IEEE Transactions on Circuits
and Systems I: Regular Papers, vol. 58, no. 4, pp. 755–764, 2010.
[26] I. Stoica, D. Song, R. A. Popa, D. Patterson, M. W. Mahoney,
R. Katz, A. D. Joseph, M. Jordan, J. M. Hellerstein, J. E. Gonzalez
et al., “A Berkeley view of systems challenges for AI,” arXiv preprint
arXiv:1712.05855, 2017.
[27] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter
performance analysis of a tensor processing unit,” in 2017 ACM/IEEE
44th Annual International Symposium on Computer Architecture (ISCA).
IEEE, 2017, pp. 1–12.
[28] Cloud TPUs. [Online]. Available: https://cloud.google.com/tpu/
8[29] N. Jouppi. (2017) Quantifying the performance
of the TPU, our first machine learning chip.
[Online]. Available: https://cloud.google.com/blog/products/gcp/
quantifying-the-performance-of-the-tpu-our-first-machine-learning-chip
[30] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey,
M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural
machine translation system: Bridging the gap between human and
machine translation,” arXiv preprint arXiv:1609.08144, 2016.
[31] N. Ketkar, “Introduction to PyTorch,” in Deep learning with Python.
Springer, 2017, pp. 195–208.
[32] K. Yang, Y.-F. Chen, G. Roumpos, C. Colby, and J. Anderson,
“High performance Monte Carlo simulation of Ising model on
TPU clusters,” in Proceedings of the International Conference for
High Performance Computing, Networking, Storage and Analysis,
ser. SC ’19. ACM, 2019, pp. 83:1–83:15. [Online]. Available:
http://doi.acm.org/10.1145/3295500.3356149
[33] F. Belletti, D. King, K. Yang, R. Nelet, Y. Shafi, Y.-F. Chen, and
J. Anderson, “Tensor processing units for financial Monte Carlo,” arXiv
preprint arXiv:1906.02818, 2019.
[34] S. Bagchi and S. K. Mitra, “The nonuniform discrete Fourier transform
and its applications in filter design. I. 1-D,” IEEE Transactions on
Circuits and Systems II: Analog and Digital Signal Processing, vol. 43,
no. 6, pp. 422–433, 1996.
[35] ——, “The nonuniform discrete Fourier transform and its applications
in filter design. II. 2-D,” IEEE Transactions on Circuits and Systems
II: Analog and Digital Signal Processing, vol. 43, no. 6, pp. 434–444,
1996.
[36] J.-Y. Lee and L. Greengard, “The type 3 nonuniform FFT and its
applications,” Journal of Computational Physics, vol. 206, no. 1, pp.
1–5, 2005.
[37] A. Rahimi and B. Recht, “Random features for large-scale kernel
machines,” in Advances in neural information processing systems, 2008,
pp. 1177–1184.
[38] Using bfloat16 with TensorFlow models. [Online]. Available: https:
//cloud.google.com/tpu/docs/bfloat16
[39] G. Henry, P. T. P. Tang, and A. Heinecke, “Leveraging the bfloat16
artificial intelligence datatype for higher-precision computations,” arXiv
preprint arXiv:1904.06376, 2019.
[40] P. A. Regalia and M. K. Sanjit, “Kronecker products, unitary matrices
and signal processing applications,” SIAM review, vol. 31, no. 4, pp.
586–613, 1989.
