Efficient Ab-Initio Molecular Dynamic Simulations by Offloading Fast
  Fourier Transformations to FPGAs by Ramaswami, Arjun et al.
Efficient Ab-Initio Molecular Dynamic Simulations
by Offloading Fast Fourier Transformations to
FPGAs
Arjun Ramaswami∗, Tobias Kenter†, Thomas D. Khne‡ and Christian Plessl§
Paderborn University, Paderborn, Germany
{∗arjun.ramaswami, †kenter, §christian.plessl}@uni-paderborn.de
‡tkuehne@gmail.com
Abstract—A large share of today’s HPC workloads is used for
Ab-Initio Molecular Dynamics (AIMD) simulations, where the
interatomic forces are computed on-the-fly by means of accu-
rate electronic structure calculations. They are computationally
intensive and thus constitute an interesting application class for
energy-efficient hardware accelerators such as FPGAs. In this
paper, we investigate the potential of offloading 3D Fast Fourier
Transformations (FFTs) as a critical routine of plane-wave-based
electronic structure calculations to FPGA and in conjunction
demonstrate the tolerance of these simulations to lower precision
computations.
I. INTRODUCTION
AIMD simulations are one of the most computationally
intensive scientific simulations in current high performance
cluster systems. These simulations involve computing inter-
atomic forces using accurate electronic structure calculations
to simulate the motion of atoms. Accelerating these simula-
tions has led to massively parallel computations using mul-
ticore processors and hardware accelerators. Popular AIMD
simulation software packages such as Quantum Espresso [1]
and CP2K [2] offer CUDA implementations to offload spe-
cific routines to GPUs. This scale of processing has led to
maximizing performance through efficient computation.
One of the approaches to improve efficiency involves
sacrificing accuracy in computations using lower precision
arithmetic units, a strategy denoted as Approximate Com-
puting (AC). Techniques such as half-precision floating point
arithmetic are used in hardware accelerators, specifically in
recent Nvidia GPUs to support applications that tolerate ap-
proximation. Field Programmable Gate Arrays (FPGAs) are
becoming increasingly relevant in this regard, mainly due to
the flexibility in building custom data paths and arbitrary
widths/precision fixed or floating point units.
In AIMD, a key kernel on the critical path of plane-
wave-based density functional theory (DFT) calculations are
3D FFTs, which are exploited for an efficient evaluation of
the quantum-mechanical operators. Accelerating this routine
involves optimizing execution in ranges of microseconds. In
the context of classical force-field-based Molecular Dynamics
(MD) simulations, Humphries et al. implement high perfor-
mance 3D FFT with single precision floating point accuracy
by fitting the entire data set in the FPGA and by using vendor-
specific 1D FFT IP blocks. Their evaluation focuses on the
performance of the 3D FFT design, but does not evaluate
the impact of reduced precision on the MD application in
question. This raises the question whether acceleration of
specific routines with reduced precision can be tolerated by
MD simulations. Recently, Rengaraj et al. have shown that
errors introduced by low precisions computing can be modeled
as noise and can be perfectly corrected at the algorithmic level
by means of a modified Langevin equation [5].
In this paper, the resilience of AIMD simulations to approx-
imation from lower precision floating point arithmetic is inves-
tigated by offloading 3D FFT computations to FPGAs. This
is demonstrated by designing an Open Computing Language
(OpenCL) based 3D FFT design for Intel FPGAs that uses
single precision instead of double precision floating points for
computations. This is compared with FFTW for performance
and evaluated for AIMD simulations by creating an interface in
the CP2K framework. As an initial contribution, this interface
has already been adopted as part of CP2K 7.1 release1.
II. APPROACH
3D FFT is an efficient algorithm to compute a 3D discrete
Fourier transform (DFT) as defined in Equation (1), where
WN = exp(−i 2piN ):
F (kx, ky, kz) =
N−1∑
z=0
N−1∑
y=0
N−1∑
x=0
f(x, y, z)W xkxx W
yky
y W
zkz
z
(1)
To compute the 3D FFT of N3 points, the computation can
be decomposed into 1D FFTs in x-dimension followed by the
same in the y-dimension and then the z-dimension.
The platform to offload this routine is the Nallatech 520N
board that houses a Stratix 10 GX2800 FPGA along with four
8 GB banks of DDR4 memory. It is connected to the host CPU
using an 8-lane Gen3 PCIe bus. Developing an OpenCL 3D
FFT kernel design targeting this board requires transferring N3
points from the host CPU to the DDR (global) memory via
the PCIe bus, transforming the data and finally, transferring
the results back to the host CPU. The transformation step
involves fetching data from the global memory followed by
1https://github.com/cp2k/cp2k/releases/tag/v7.1.0
ar
X
iv
:2
00
6.
08
43
5v
1 
 [c
s.D
C]
  1
5 J
un
 20
20
three N-point 1D FFT computation phases intertwined by two
transposition phases, the 2d matrix transposition and transpo-
sition along the z-dimension (3D transpose), and storing the
result back to the global memory. The 1D FFT computations
are based on Intel’s 1D FFT OpenCL design sample, which
uses 8 single precision points per cycle in bit reversed order
as input and output, transforming N points in N8 cycles. The
2d transpose stage uses multiple buffers to avoid stalling the
pipeline, however, this cannot be realized for the 3D transpose
stage as it may not fit into the FPGA. Figure 1 illustrates this
design, where the entire 3D FFT fits into the local memory,
hence the 3D transpose is performed in the FPGA.
2d 
Transpose1d FFT 1d FFT
3d 
Transpose 1d FFT
Store to 
DDR 
Memory
 Fetch from 
DDR 
Memory
Fig. 1. 3D FFT Design for FPGAs
III. RESULTS
The 3D FFT design described above is implemented for
Intel’s Stratix 10 FPGAs. It uses single precision floating
point digital signal processing (DSP) blocks in 1D FFT
computations. The code is available as an open source project2.
The FPGA kernel code is synthesized using Intel OpenCL
SDK 19.3 for Nallatech 520N board. The host code, which
is compiled using GCC v8.3, measures the total wall clock
time of PCIe data transfers between the host and the DDR
memory as well as the time taken for kernel execution that
includes FPGA and DDR memory communication. This is
measured over an average of a hundred iterations of non-
batched executions.
Here, the performance of the Stratix 10 implementation is
compared with the single precision floating point variant of
FFTW v3.3.8. The FFTW application is linked using GCC
v8.3 and executed on a node with two Intel Xeon Skylake
Gold 6148 CPUs, each featuring 20 cores with hyperthreading
disabled operating at 2.4 GHz. Performance is measured by
scaling the number of threads from 1 to 40 threads pinned to
the 40 cores available for an average of hundred iterations
each, using all four planning heuristics; the best runtime
obtained among the different multithreaded executions is used
for comparison, as shown in Table I.
The performance of the FPGA is promising considering the
evaluation compares highly optimized libraries utilizing 40
core server CPUs. The latency of the FPGA implementation
can be reduced by overcoming the bottleneck with the 3D
transpose step. PCIe data transfer latency can be overcome us-
ing larger transforms, making use of the available bandwidth.
In order to evaluate the resiliency of AIMD simulations
to approximate floating point arithmetic, an interface to the
FPGA 3D FFT implementation is integrated into the CP2K
framework. This is then compared with the default double
precision CPU execution of 3D FFT using FFTW3 [6]. The
application chosen for evaluation uses the Gaussian and plane
2https://github.com/pc2/fft3d-fpga
TABLE I
COMPARISON OF RUNTIMES IN MILLISECONDS OF DIFFERENT 3D FFT
SIZES BETWEEN FFTW AND FPGA 3D FFT ALONG WITH THE LATENCY
OF PCIE DATA TRANSFERS BETWEEN HOST AND DDR MEMORY
N3 FFTW FPGAKernel Execution PCIe Data Transfer
163 0.01 0.11 0.05
323 0.03 0.22 0.21
643 0.14 0.74 0.87
wave DFT method to compute the molecular orbitals of a
single H2O molecule in the gas phase [7]. By restricting the
plane-wave density cutoff, the application is evaluated for the
specific 3D FFT sizes that matches the FPGA implementation.
Both experiments converge to nearly identical total energies,
leading to similar nuclear forces, thus showing that the AIMD
simulations are tolerant to approximations in floating point
arithmetic.
With these initial accomplishments, further work can be
focused on efficient acceleration of AIMD simulations by
developing a competitive 3D FFT design for FPGAs that
scales to larger FFT sizes, where FPGAs should have a
significant advantage. Moreover, investigating the tolerance
of AIMD simulations towards further approximation using
custom precision floating point units in FPGAs could help
achieve higher computational efficiency.
REFERENCES
[1] J. Romero, E. Phillips, G. Ruetsch, M. Fatica, F. Spiga,
and P. Giannozzi, “A performance study of quantum
ESPRESSO’s pwscf code on multi-core and GPU sys-
tems,” in Int. Workshop on Performance Modeling,
Benchmarking and Simulation of High Performance
Computer Systems, Springer, 2017, pp. 67–87.
[2] T. D. Ku¨hne et al., “CP2K: An Electronic Structure
and Molecular Dynamics Software Package – Quickstep:
Efficient and Accurate Electronic Structure Calculations,”
Journal of Chemical Physics, vol. 152, p. 194 103, 2020.
[3] B. Humphries, J. S. Hansen Zhang, R. Landaverde, and
M. C. Herbordt, “3d FFTs on a single FPGA,” in Proc.
Int. Symp. on Field-Programmable Custom Computing
Machines (FCCM), vol. 2014, 2014, p. 68.
[4] V. Rengaraj, M. Lass, C. Plessl, and T. D. Ku¨hne,
“Accurate sampling with noisy forces from approximate
computing,” Computation, vol. 8, no. 2, p. 39, 2020.
[5] T. Ku¨hne, M. Krack, F. Mohamed, and M. Par-
rinello, “Efficient and accurate Car-Parrinello-like Born-
Oppenheimer molecular dynamics simulations,” Phys.
Rev. Lett., vol. 98, p. 066 401, 6 2007.
[6] M. Frigo and S. G. Johnson, “The design and implemen-
tation of FFTW3,” Proceedings of the IEEE, vol. 93, no.
2, pp. 216–231, 2005.
[7] G. Lippert, J. Hutter, and M. Parrinello, “A hybrid
gaussian and plane wave density functional scheme,”
Mol. Phys., vol. 92, p. 477, 3 1997.
