6 research outputs found
Wafer-Scale Fast Fourier Transforms
We have implemented fast Fourier transforms for one, two, and
three-dimensional arrays on the Cerebras CS-2, a system whose memory and
processing elements reside on a single silicon wafer. The wafer-scale engine
(WSE) encompasses a two-dimensional mesh of roughly 850,000 processing elements
(PEs) with fast local memory and equally fast nearest-neighbor
interconnections.
Our wafer-scale FFT (wsFFT) parallelizes a problem with up to
PEs. At this point a PE processes only a single vector of the 3D domain (known
as a pencil) per superstep, where each of the three supersteps performs FFT
along one of the three axes of the input array. Between supersteps, wsFFT
redistributes (transposes) the data to bring all elements of each
one-dimensional pencil being transformed into the memory of a single PE. Each
redistribution causes an all-to-all communication along one of the mesh
dimensions. Given the level of parallelism, the size of the messages
transmitted between pairs of PEs can be as small as a single word. In theory, a
mesh is not ideal for all-to-all communication due to its limited bisection
bandwidth. However, the mesh interconnecting PEs on the WSE lies entirely
on-wafer and achieves nearly peak bandwidth even with tiny messages.
This high efficiency on fine-grain communication allow wsFFT to achieve
unprecedented levels of parallelism and performance. We analyse in detail
computation and communication time, as well as the weak and strong scaling,
using both FP16 and FP32 precision. With 32-bit arithmetic on the CS-2, we
achieve 959 microseconds for 3D FFT of a complex input array using a
512x512 subgrid of the on-wafer PEs. This is the largest ever parallelization
for this problem size and the first implementation that breaks the millisecond
barrier