On the Feasibility of FPGA Acceleration of Molecular Dynamics
  Simulations by Schaffner, Michael & Benini, Luca
On the Feasibility of FPGA Acceleration of
Molecular Dynamics Simulations
Technical Report (v0.1)
Michael Schaffner†, Luca Benini†
†ETH Zurich, Integrated Systems Lab IIS, Zurich, Switzerland
Abstract—Classical molecular dynamics (MD) simulations are
important tools in life and material sciences since they allow
studying chemical and biological processes in detail. However,
the inherent scalability problem of particle-particle interactions
and the sequential dependency of subsequent time steps render
MD computationally intensive and difficult to scale. To this
end, specialized FPGA-based accelerators have been repeatedly
proposed to ameliorate this problem. However, to date none of the
leading MD simulation packages fully support FPGA acceleration
and a direct comparison of GPU versus FPGA accelerated codes
has remained elusive so far.
With this report, we aim at clarifying this issue by comparing
measured application performance on GPU-dense compute nodes
with performance and cost estimates of a FPGA-based single-
node system. Our results show that an FPGA-based system can
indeed outperform a similarly configured GPU-based system, but
the overall application-level speedup remains in the order of 2×
due to software overheads on the host. Considering the price for
GPU and FPGA solutions, we observe that GPU-based solutions
provide the better cost/performance tradeoff, and hence pure
FPGA-based solutions are likely not going to be commercially
viable. However, we also note that scaled multi-node systems
could potentially benefit from a hybrid composition, where GPUs
are used for compute intensive parts and FPGAs for latency and
communication sensitive tasks.
I. INTRODUCTION
Classical molecular dynamics (MD) simulations [1] are im-
portant tools in life and material sciences since they allow
studying chemical and biological processes in detail. For
example, this enables researchers to study drug-target bindings
for drug discovery purposes [2] or to analyze protein folding
processes to understand their biological function [3].
However, MD is computationally intensive and difficult to
scale due to the sequential dependency between subsequent
timesteps and the many particle-particle interactions. The
timescales of interest are often orders of magnitude larger than
the simulation time steps (i.e., ns or µs timescales versus fs
time steps), which results in long simulation times even on
HPC computing infrastructure.
To this end, several approaches have been pursued to
improve simulation performance, ranging from novel algo-
rithms to approximate forces between particles [4] over al-
gorithmic tweaks [5] and biasing methods such as enhanced
sampling methods [6], to custom hardware solutions, such as
the MDGRAPE systems from Riken [7] and the Anton-1/2
supercomputers developed by D.E. Shaw Research LLC [3],
[8], [9]. However, we observe that algorithmic improvements
are often very problem specific, and it often takes a long time
until they are adopted by major production software packages.
Hence, the core algorithms used in classical MD simulations
have largely remained in recent years, and most simulation
speed improvements are due to the use of MPI parallelization
in combination with GPGPUs that handle the computationally
dominant parts. Specialized supercomputers such as the Anton
systems are very inaccessible and expensive, and are hence not
widely used today.
Besides these MPI and GPU-based solutions, FPGA accel-
erators have repeatedly been proposed as a viable alternative
to accelerate the compute intensive parts [10–22] . However,
these studies only show estimated or measured speedup with
respect to older CPU implementations. To date none of the
leading MD simulation packages fully support FPGA accel-
eration [23] and a direct comparison of GPU versus FPGA
accelerated codes has remained elusive so far.
This report aims at shedding some light onto the questions
whether and how FPGAs could be used to accelerate classical
MD simulations in the scope of biochemistry, and whether
such a solution would be commercially viable. To this end,
we revisit existing FPGA architectures, model their behavior
on current FPGA technology and estimate the performance
and price of an FPGA accelerated system in order to compare
with GPU accelerated solutions. We focus on single node
systems in this report (possibly carrying several accelerator
cards) since these represent the most common configuration
employed today1. Typical MD problems with in the order of
100k atoms do not scale well across several nodes, and hence
it is most economic to run these simulations on accelerator-
dense single node systems.
Our results show that, in principle, FPGAs can be used to
accelerate MD, and we estimate full application-level speedups
in the order of 2× with respect to GPU-based solutions.
However, our estimates also indicate that this speedup is
likely not high enough to compensate for the increased cost
and reduced flexibility of FPGA-based solutions. Hence we
conclude that FPGAs are likely not well suited as a replace-
ment for GPU accelerators. However, we observe that other
aspects of FPGAs like the low-latency networking capabilities
could be leveraged to augment scaled multi-node systems and
ameliorate scalability issues by providing network-compute
capabilities in addition to GPU acceleration.
This report is structured in three main sections: Sec. II
summarizes background and related work, performance bench-
marks of two widely used software packages are given in
Sec. III, and in Sec. IV we provide the FPGA estimates and
comparisons with GPU-based systems.
ar
X
iv
:1
80
8.
04
20
1v
1 
 [c
s.D
C]
  8
 A
ug
 20
18
II. BACKGROUND AND RELATED WORK
A. Classical MD
This section gives a brief overview of MD, for more details
we refer to [1], [2], [24], [25].
1) Simulation Loop and Force-Fields
A typical biomolecular MD simulation consists of a macro-
molecule that is immersed in a solvent (e.g., water). Each atom
in the system is assigned a coordinate xi, velocity vi and
acceleration ai. The aim of MD is to simulate the individual
trajectories of the atoms, and this is done by integrating the
forces acting on each particle. The integration is carried out in
discrete timesteps, with a resolution in the order of 2 fs, and
the forces are calculated by taking the negative derivative of
a potential function V w.r.t. to each particle coordinate xi:
fi (t) = miai (t) = −∂V (x (t))
∂(xi (t)
(1)
where the potential function is based on classical Newtonian
mechanics. The potential typically comprises the following
terms:
V =
bonds∑
i
kl,i
2
(li − l0,i)2 +
angles∑
i
kα,i
2
(αi − α0,i)2
+
torsions∑
i
(
M∑
k
Vik
2
cos (nik · θik − θ0,ik)
)
+
pairs∑
i,j
ij
((
r0,ij
rij
)12
− 2
(
r0,ij
rij
)6)
+
pairs∑
i,j
qiqj
4pi0rrij
(2)
Such a set of functions and the corresponding parameteriza-
tion is often also called a force-field (examples for specific
force field instances are CHARMM36m, GROMOS36 or AM-
BERff99). The first three terms in (2) cover the interactions
due to bonding relationships and are hence also called the
bonded terms. The last two terms cover the van der Waals
and electrostatic (Coulomb) interactions, and are hence called
the non-bonded terms. We can already anticipate from these
equations that the non-bonded terms will play a dominant
role during force calculation since it is a particle-particle (PP)
problem with O(N2). The bonded terms are usually much
less intensive (O(N)) since we do not have bonds among all
particles in the system.
Most simulations are carried out using periodic boundary
conditions (PBC) in order to correctly simulate the bulk
properties of the solvent. This means that the simulation cell
containing the N -particle system above is repeated infinitely
many times in all directions, which has implications on the
algorithms used for the non-bonded forces.
A typical MD simulation has an outer loop that advances the
simulation time in discretized steps in the order of fs, and in
each time step, all forces in the system have to be calculated,
1http://ambermd.org/gpus/recommended hardware.htm
the equations of motion have to be integrated, and additional
constraints2 have to be applied before restarting the loop.
2) Evaluation of Non-bonded Interactions
The first term with 1/r12 and 1/r6 modelling the Van der
Waals interaction decays very quickly. Therefore, a cutoff
radius r can be used (usually in the order of 8-16 Angstroms),
and only neighboring particles within the spanned sphere are
considered during the evaluation of this term.
The Coulomb interaction on the other hand is more difficult
to handle, as it decays relatively slowly (1/r). Earlier methods
also just used a cutoff radius, but this may introduce significant
errors. To this end, different algorithms that leverage the peri-
odic arrangement have been introduced. The most prominent
one is the Ewald sum method (and variations thereof). This is
basically a mathematical trick to split the Coulomb interaction
into two separate terms: a localized term that is evaluated
in space-domain, and a non-local term that is evaluated in
the inverse space (i.e., in the spatial frequency domain / k-
space). This decomposition enables efficient approximation
methods that have computational complexity O(N log(N))
instead of O(N2). One such method that is widely used
together with PBC today is called Particle-Mesh Ewald (PME)
[29–32]. Basically, the spatial term is handled locally with a
cutoff radius (same as for the Van der Waals term above),
and the second is approximated using a gridded interpolation
method combined with 3D FFTs (hence the O(N log(N)))
complexity). The spatial term is sometimes also called the
short-range interaction part, whereas the inverse space term is
called the long-range interaction part.
In terms of computation and communication, the non-
bonded terms are the heavy ones, and consist of two cutoff-
radius limited PP problems, and two 3D FFT problems. As
we shall see later in Sec. III, these non-bonded forces account
for more than 90% of the compute time in an unparallelized
MD simulation. Moreover, an intrinsic problem of MD is that
the simulation time steps are sequentially dependent, which
inherently limits the parallelizability, and means that we have
to parallelize a single time step as good as possible (which
essentially creates a communication problem). Note that the
dataset sizes are very manageable and only in the order of a
couple of MBytes.
3) Other Approaches
Today, the most serious contenders for replacing PME elec-
trostatics seem to be fast multipole methods (FMM) [29],
[33–36], multi-level summation methods (MSM) [4], [37] and
multi-grid (MG) methods [38]:
• FMM: The FMM makes use of a hierarchical tree decom-
position of simulation space and interacts particles with
multipole approximations of the farfield, thereby giving
rise to O (N) complexity. For low order expansions,
the multipole approximation of the electrostatic potential
function has large discontinuities. This leads to drifts
2Constraints are usually used to eliminate high-frequency oscillations of H
atoms that would lead to integration issues otherwise, see also [26–28].
of the total energy over time, and this violation of
energy conservation is not acceptable in MD applications.
To resolve this issue, one has to resort to high order
expansions that are more accurate, but also more complex
to calculate. With sufficiently high expansion orders, the
FMM can be slower [4], [30] than fast PME algorithms in
the important range of N = 10k - 100k particles (despite
the fact that FMM has better asymptotic complexity than
PME)3.
• MG: These methods discretize the Laplace operator and
solve the resulting linear system with a multigrid solver,
which results in linear complexity as well. however, these
methods suffer from relatively large discretization errors
compared with FFT methods, thereby leading to clear
energy drifts that need to be corrected for [33].
• MSM: This method splits the interaction kernels
smoothly into a sum of partial kernels of increasing
range and decreasing variability, and the long-range parts
are interpolated from grids of increasing coarseness, and
hence MSM provides O (N) complexity, too. The MSM
is relatively new, and hence not widely adopted yet. It
is unclear how it compares to PME in speed, but it is
shown to provide better performance than FMM in the
case of highly parallel computation and it can be used
for nonperiodic systems, where FFT-based methods do
not apply [4].
Due to the reasons mentioned, FMM, MSM and MG methods
have not yet been widely adopted by common software
packages. Another reason is that in a direct comparison
[33], FFT-based methods are still among the most efficient
in performance and stability – and hence it is difficult to
motivate a migration to an new experimental algorithm if it not
absolutely required. The log (N) contribution of FFT based
methods does not seem to be visible for common system sizes.
Hence, for single-node systems not operating at scale, FFT-
based methods still seem to be the best algorithmic choice for
simulations with PBC due to their efficiency and algorithmic
simplicity. As we will see later in Sections III and IV, one way
to address the communication bottleneck within PME can be
addressed by dedicating it to a single FPGA or GPU, where
it can be executed at very high speeds.
4) Classical versus Ab-Initio MD
Note that classical MD should not to be confused with ab-initio
MD (AIMD) that operates on quantum-mechanical potential
approximations. While AIMD has the advantages of being
more accurate and not requiring explicit parameterization, it is
also computationally much more involved and hence limited
to small MD systems comprising a few 100 to 1000 particles.
Classical MD works on a coarser abstraction4 by employing
classical mechanics in the form of force-fields that can be
evaluated more efficiently, hence allowing to simulate larger
3See also these LAMMPS and GROMACS forum entries: http://lammps.
sandia.gov/threads/msg72001.html, http://www.gromacs.org/Developer Zone/
Programming Guide/Fast multipole
4Mixed simulation modes where some potential calculations of classical
MD are augmented with QM are possible, but not considered in this report.
systems with up to several 100k atoms. Although classical MD
simluations are less accurate than their AIMD counterparts,
they are often accurate enough for many biomolecular systems,
where AIMD would not be feasible due to amount of particles
involved. The drawback of classical MD is that the force-fields
need to be parameterized with experimental data, and they can
often not model chemical reactions as the bonds are static.
An interesting new algorithmic development for AIMD-
based approaches is to use neural networks (NN) to accelerate
the evaluation of the quantum-mechanical potentials [39].
B. Dedicated ASICs for classical MD
There exist only a few dedicated systems that have been
built to enhance MD simulation performance. Two older ones
are the MD-ENGINE [40] and MDGRAPE-3 [41]. The MD-
ENGINEis a simple ASIC coprocessor that evaluates non-
bonded forces only (with a non-optimized Ewald sum method
which is O (N2)). Several such accelerators can be attached
to a SPARC host system that carries out the rest of the
calculations. Compared to newer architectures, the system does
not scale well since it is still based on the a direct Ewald
sum method. the interpolation units use a combination of
fixed point and extended single-precision (40 bit) FP formats.
MDGRAPE-3 is another MD co-processor chip which is
similar to the MD-ENGINE above in the sense that only the
non-bonded portion in the MD force calculation is accelerated,
and they still use the direct Ewald sum method, which is
accurate, but O(N2). However, the complete system is much
larger than previous ones: the complete system comprises
256 compute nodes equipped with 2 Itanium class CPUs and
two MDGRAPE boards with 24 chips each. This results in a
cluster with 512 CPUs and 6144 special purpose MDGRAPE
chips. Also, they use a ring interconnect topology on each
MDGRAPE board, which facilitates reduction operations such
as force summation on one board.
Apart from these older instances above, there are also a few
more recent incarnations of such special purpose machines
that leverage complete SoC integration to accelerate MD in
a more holistic fashion. The most notable machines are the
Anton-1 [3], [42], [43] and Anton-2 [9] computers by D.E.
Shaw Research LLC. The MDGRAPE-4 chip from Riken [7]
is another special purpose SoC for MD that has similarities
with the Anton chips, but the project does not seem to be as
successful as Anton.
Both Anton generations are similar from a conceptual
viewpoint, so we will mainly refer to numbers from Anton-2 in
the following. The system configuration comprises a set of 512
compute nodes, interconnected with a 8×8×8 3D Torus, and
each compute node consists of one Anton-2 ASIC fabricated in
40 nm. The 3D torus arrangement allows for natural problem
mapping using a domain decomposition and facilitates com-
munication in all directions. The Anton chips themselves are
quite innovative with respect to the previous MD accelerators
in the sense that they do not only contain fixed-function
accelerators, but both C++ programmable Tensilica processors
and specialized pairwise point interaction modules (PPIMs).
Each Anton-2 chip contains 64 simple 32 bit general purpose
processors with 4 SIMD lanes and two high-throughput inter-
action subsystem (HTIS) tiles containing an array of 38 PPIM
streaming accelerators each. The heterogenous arrangement
allows Anton to perform complete MD simulations without
having to closely interact with a host system. I.e., it can
also accelerate the other parts of the MD simulation (the
remaining 10% of the computations), thus enabling better
scalability. Moreover, Anton uses optimized PME algorithms
to circumvent the O(N2) issue of brute-force PP interactions.
[44] provides a detailed study of distributed 32×32×32 and
64×64×64 3D FFTs on Anton machines.
Interestingly, newer firmware versions running on Anton use
real-space convolution instead of 3D FFTs as this requires
less all-to-all communication phases that impact scalability.
Another interesting aspect to note is that external DRAM –
albeit supported by the Anton-2 chips – is not required even
in for large scale simulations, since the complete state of the
problem fits into a couple of MBytes, and hence into the
distributed on-chip SRAM available on the chips.
C. MD Acceleration using FPGAs
Most studies on FPGA accelerated MD target the PP calcu-
lations [10], [11], [13], [14], [20–22], since they make up for
over 90% of the sequential runtime in typical simulations.
Apart from these studies, only relatively few papers cover
PME and other aspects [16], [18], [19], [45].
We note that most work in the area has been done by
the group by M. Herbordt at Boston University, including
the design of several variations of PP interaction pipelines
[13], [14], [20], [21], [46] and a prototype where such a
PP interaction pipeline is integrated into a simplified variant
of NAMD [14]. Their PP interaction pipelines makes use
of particle filters that build the neighborlists on-the-fly. In
contrast to software-based implementations, such a filter-based
approach does not require large neighborlist buffers and can be
efficiently implemented using systolic filter arrays and reduced
arithmetic precision in hardware (the Anton PPIMs employ a
similar approach). Their preferred PP design employs 1st order
piece-wise polynomial interpolation with around 6 segments
and 256 LUT entries per segment [20]. Since their prototype
system uses relatively dated FPGA technology (Stratix III),
it is difficult make comparisons with current GPU-based
solutions, and updated performance estimates are required
(no rigorous performance comparison with contemporary GPU
technology has been carried out in their study). In [18], [19],
it is shown that commonly used 3D FFTs in the order of 643
can be conveniently solved on single FPGAs at competitive
speeds in the range of a few 100 µs. In [16] they present an
interpolator design able to carry out the charge spreading and
force gathering stems within PME, and in [45] a preliminary
analysis of bonded-force computation on FPGAs is performed.
The work by Kasap et al. [22] is another attempt to
implement a production grade MD accelerator using FPGAs
which is not from the Herbordt group. They target the
Maxwell platform, an FPGA based multi-node system that
has similarities with Microsoft’s Catapult [47], [48] (albeit
using more dated technology, i.e. Virtex 4 FPGAs). The
overall system architecture template is very similar to the
one used in this report. However, they only accelerate non-
bonded pair interactions on the FPGAs and do not use inter-
FPGA communication. Although they are able to accelerate
the interactions, the overall system is not competitive due
to a very limited bandwidth between host processor and the
SD RAM on the FPGA card (PCI-X bus, with only around
500MB s−1 for two FPGA cards). Further, they transfer many
parameters that could be shared among several particles on
a per atom basis onto the the accelerator which leads to
significant I/O overhead. Virials and potential energies are
calculated in each iteration, which is typically not required.
Due to these factors, their FPGA solution is actually slower
than the software baseline, since about 96% of the time is spent
in communication. Apart from the apparent improvements
in data communication volume (redundant parameters), the
situation in connectivity is quite different today, where PCIe
and NVLink provide much higher bandwidth between host and
device. Further, modern FPGAs provide enough fast on-chip
RAM resources such that no external SRAM/DRAM has to
be used (this is another performance bottleneck of the system
by Kasap et al.).
In general, we can observe that FPGA designs have to
leverage fixed-point arithemtic and operator fusing wherever
possible (e.g. in the interpolators and for particle coordinates).
Floating-point arithmetic should only be used where absolutely
needed (e.g., final force values) as these cause high pipeline
latencies and require a lot of LUT resources. According to
literature [20], the limiting resource types on modern FPGAs
are likely going to be the LUTs and registers in this context,
and the amount of DSP slices and memory blocks are of
secondary concern. Further, fixed-point arithmetic enables to
reach higher clock rates and maps better to DSP resources
than FP. Note that the Anton chips almost exclusively employ
fixed-point datapaths, too, since the use of fixed-point not
only improves performance, but also has a positive impact
on testability and reproducibility (out-of-order accumulations
remain bit-true, for instance).
We base the FPGA estimates for the PP interaction and
PME modules in this report on the resource figures provided
by Herbordt’s group, as these appear to be the best references
that can be found in literature.
D. High-Level Synthesis
High-level synthesis (HLS) and related methods for FPGAs
are often advertised as great productivity enhancers that sig-
nificantly reduce the complexity of design-entry compared to
traditional HDL flows. From experience, we can say that this
is not always the case, as the benefit of HLS is quite design
dependent. However, the results presented by [23], [49–51]
suggest that HLS or OpenCL based flows can indeed be used
for certain parts of the PP and PME units.
For example, Saunullah et al. [49] compare an IP-based de-
sign flow with FFT IP’s from Altera to an OpenCL kernel, and
report quite large improvements in terms of overall resources
(10× fewer ALMs, 25× less on-chip memory). While this
needs to be interpreted with care (the paper does not reveal
TABLE I
TEST NODES USED FOR BENCHMARKING. THE P[2,3].[2,8,16]XLARGE
NODES ARE COMPUTE INSTANCES AVAILABLE ON AWS.
Node Name CPU (GHz) vCPUs† GPU Config
Sassauna E5-2640v4 (2.4) 40 4×GTX1080Ti
P2.2xlarge E5-2686v4 (2.3) 8 1×K80‡
P2.8xlarge E5-2686v4 (2.3) 32 4×K80‡
P2.16xlarge E5-2686v4 (2.3) 64 8×K80‡
P3.2xlarge E5-2686v4 (2.3) 8 1×V100 SMX2
†
The number of vCPUs (virtual CPUs) corresponds to the
amount of concurrent threads.
‡
The K80 is a dual GPU card,
so the effective number of GPUs is double the listed number.
many implementation details), it seems to be reasonable that
such an approach yields good results in this setting. The FFT
is quite a regular algorithm which can be well described with
OpenCL, and the Altera FFT IP’s incur some overhead due to
internal buffering, interfaces, etc. which are not needed.
On the other hand, Saunullah et al. [50] attempt to use
OpenCL to design the same PP interaction pipeline that they
developed with HDL in an earlier papers [14], [20], [21]. Their
conclusion is that for such a highly tuned datapath OpenCL
does not provide competitive results (”...the OpenCL versions
are dramatically less efficient, with the Verilog design fitting
from 3.5× to 7× more logic.”). The work by Cong et al. [23]
is a similar study that investigates a Xilinx HLS flow for the
same PP interaction module. Their experiments reveal that part
of the inefficiency of HLS for this particular module is the fact
that HLS tools currently have difficulties producing efficient
results for datapaths with dynamic data flow behavior where
conditional execution exists within a processing element. Their
solution is to describe certain parts critical for dynamic data
flow in HDL, and generate the remaining subblocks of the
datapath using HLS.
So based on our own experience and the above literature,
we can say that OpenCL or HLS can increase productivity, and
yield good results for algorithms that can be described well and
that have regular compute patterns (e.g., systolic dataflow, no
feedback loops, no dynamic control behavior). But for designs
that are difficult to describe and tune with OpenCL such as
parts of the force-pipeline a highly tuned HDL implementation
turns out to be more efficient. An interesting design approach
is to use a mix between both methodologies, leveraging
HLS (and its automated verification functionality) for small
datapath subblocks that are then connected and orchestrated
using an HDL wrapper.
E. Vendor and Market Overview
Several well known high-performance software packages come
from the academic sector and some of them like GROMACS
[52–54] and LAMMPS [55] are released freely under open-
source licences. Others like such as AMBER [56], NAMD
[57] and CHARMM [58] provide reduced or free academic
licenses and require full licensing for commercial purposes.
Besides these codes originating from academia, there are
a couple of companies that develop and sell MD simulation
software such as Biovia from Dassault Systemes, AceMD
from Acellera Ltd [59], Yasara [5], and Desmond from Shaw
Research LLC [60] to name a few (see this market report
for more info [61]). Some of the main differentiators of these
commercial software packages comprise:
• Workflow integration and collaborative tools (e.g., Das-
sault Systems),
• Better visualization and GUI integration (e.g., Dassault
Systems, Yasara),
• Performance tweaks for workstation systems (e.g.,
Yasara, Acellera),
• Extreme scalability on clusters (e.g., Shaw Research),
• Costumer support and consulting options (all of them).
In terms of infrastructure, some companies sell application
certified workstations and server blades, such as Exxact-
Corp5 (GROMACS and AMBER workstations) and Acellera
(AceMD MetroCubo)6. Another interesting trend in this field
seem to be cloud services [61], [62]. E.g., Acellera AceMD
has built-in support for Amazon Web Services (AWS), that
allows users to conveniently outsource simulation runs with a
single command. The advantage of such cloud solutions are
scalability and low up-front costs, which can be attractive for
small labs and/or labs that have large variations in workload.
Otherwise on-prem solutions can be more cost-effective.
The overall market however can be considered a niche mar-
ket, since there are only a few 1000 to 10’000 users worldwide
that use such specialized software packages. According to
Goldbeck et al. [63], the overall spending on scientific software
is in the order of 0.1% of the total sector R&D spending,
which would amount to roughly 100M$ in pharma/biotech and
50M$ in the chemicals/materials industry in Europe. Now this
is an estimate for the total spending, so for a specific software
package the market size will only be a fraction of that. So
assuming a share in the order of 1% we can estimate that the
market for a specific software package is likely in the order
of several 100k$ to a few M$.
In this report, we use AMBER and GROMACS as bench-
mark baselines, since these provide very competitive runtimes
on single-node systems and are widely used in the community.
In addition to on-prem solutions, we also consider cloud
infrastructure, since FPGAs have recently become available as
a service (FaaS), in the form of the AWS F1 instance7. Edico
Genomics8 is an example for a company that successfully uses
F1 instances to accelerate Genome sequencing.
III. SOFTWARE BENCHMARKS
In this section we assess the performance of two widely
used GPU-accelerated MD packages using three different
benchmarks and recent hardware. This is done to get a solid
baseline for the comparisons with the FPGA performance
estimates in Sec. IV.
5https://www.exxactcorp.com/GROMACS-Certified-GPU-Systems
6https://www.acellera.com/
7https://aws.amazon.com/ec2/instance-types/f1/
8http://edicogenome.com/
TABLE II
MD SYSTEMS USED FOR BENCHMARKING.
Protein ID 4J3I 2LEM 2N5E
# Particles 35’022 91’030 157’853
System Size [A˚] ˜723 ˜993 ˜1193
Time Step [fs] 2 2 2
Cutoff Radii [A˚]† 12 12 12
PME Grid† 603 843 1003
PME Tuning Stepsg 6k 6k 6k
Benchmark Stepsg 9k/14k‡ 9k/14k‡ 9k/14k‡
Benchmark Stepsa 10k 10k 10k
† These are the initial values before load balancing, see also [64].
‡ CPU/GPU runs. g GMX runs. a AMBER runs.
A. MD Packages and Hardware Configuration
We use GROMACS 2018.1 and a licensed version of AMBER
16 (with AmberTools 17 and all patches that where available
by the end of June 2018) in this report, since these are widely
used and are among the fastest software packages for single
node systems. Both packages have been compiled from scratch
on Linux 14.06 LTS with GCC 5.5.0. GROMACS has been
compiled in mixed-precision mode for single node systems
using the built-in thread-MPI support and with FFTW 3.3.5
and CUDA 9.2. AMBER has been built in mixed-precision as
well with MPI support, FFTW 3.3 and CUDA 9.0.
We used one on-site machine and several different Amazon
Webservices (AWS) configurations to run the benchmarks. The
machine configurations are listed in Tbl. I. All machines have
almost identical processor models (only the core count and
maximum frequency differs slightly), and all machines ran
Linux 14.06 LTS. The configuration of the on-site machine
Sassauna is in line with the GROMACS benchmarking paper
by Kutzner et al. [64], where it has been found that a dual Xeon
machine with around 10 cores per socket and 2-4 customer
grade GPUs provides the best cost-efficiency in terms of ns/$.
B. Benchmark Systems
Existing benchmarks are often MD software specific (in terms
of input files), and hence we chose to recreate the input
files from scratch to make sure we are simulating the same
system. To accomplish that we used the CHARMM-GUI9
online tool [65], [66]. The benchmarks are listed in Tbl. II,
and consist of three different proteins that have been solvated
in a box of water. The amount of particles range from 35k to
157k, which represents typical problem sizes in biochemistry
today. The two smaller systems are similar to the DHFR and
Apoa1 benchmarks often used in literature in context with the
AMBER and NAMD software packages.
9Input generator on http://www.charmm-gui.org/
C. Benchmarking Results
We use the standard protocols provided by CHARMM-GUI
to equilibrate the systems. The equilibrated systems are then
simulated in the NPT ensemble for the amount of steps
specified in Tbl. II. In GROMACS, a Nose-Hoover thermostat
is used in combination with Parinello-Rahman pressure con-
trol and in Amber, a Langevin Thermostat is used together
with Montecarlo Pressure control (these were the default
settings from CHARMM-GUI). The force-fields employed
is CHARMM36m, which uses a TIPS3P water model (with
Coulomb and LJ interactions on the O and H atoms). Energies
are calculated every 100 and stored every 1k steps in both
cases. For each software package, we run different paralleliza-
tion options to find the optimal one.
1) GROMACS Benchmarks on Sassauna
Similar as described in [64], we sweeped different MPI
rank/openmp thread combinations, in combination with differ-
ent numbers of GPUs. The results are shown in Fig. 1a,c,e. For
standard GPU accelerated runs (blue lines) where the particle-
particle (PP) interactions are mapped to the GPUs and the
PME is handled on the CPUs10, we observe speedups in the
order of 4× to 6× with respect to the CPU-only version (green
line). Note that in GROMACS 2018 update, a new feature
became available that allows to map the PME evaluation to one
of the GPUs, which leads to significant runtime improvement
(see orange lines) on single-node systems as the costly 3D FFT
communication phases are now contained within one device
instead of being spread accross many CPU cores. This leads
to a performance improvement of another 2×.
2) AMBER Benchmarks on Sassauna
As can be seen in Fig. 1b,d,f, the CPU-only implementation
is quite slow when compared to GROMACS. However, the
GPU accelerated configuration with 2 GPUs provides similar
performance as GROMACS with 1-2 GPUs using the new
GPU PME feature. For more GPUs, the performance does not
scale on this system. Note that as opposed to GROMACS,
AMBER can run the whole simulation loop on the GPUs
without involving the CPU cores. In multi-GPU runs, several
GPUs communicate via the PCIe bus. This is only working
well for up to 2 cards in this machine, since each pair is
connected to one XEON socket, and connections bettween
these pairs have to go over QPI. Compute nodes with NVLINK
or optimized PCIe infrastructure will likely perform better on
such runs11. The advantage of AMBER is not necessarily the
speed of a single MD run (GROMACS is faster in that case),
but the fact that the no expensive multi-core CPUs have to
be used in order to get decent performance as in GROMACS.
This allows to run many simulations in parallel in a very cost-
effective manner.
Note that we used the licensed version of AMBER16 for
these benchmarks with all updates as of June 2018. There
also exists AMBER18 that has been released recently, and
10This is done without separate PME ranks in this case.
11See http://ambermd.org/gpus/benchmarks.htm, http://ambermd.org/gpus
R1
0 /
 T4
R8
 / T
5
R5
 / T
8
R4
 / T
10
R2
 / T
20
Rank / Thread Config
0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
Pe
rfo
rm
an
ce
 [n
s/d
]
4J3I-35k NPT sim (dt=2fs, cuto=12A) 
gmx2018.1, sassauna, T40G4
0 GPUs
1 GPUs
2 GPUs
3 GPUs
4 GPUs
1 GPUs + GPU-PME
2 GPUs + GPU-PME
3 GPUs + GPU-PME
4 GPUs + GPU-PME
(a)
40
 M
PI 
Pro
cs
1 G
PU
2 G
PU
s
3 G
PU
s
4 G
PU
s
0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
pe
rfo
rm
an
ce
 [n
s/d
]
4J3I-35k 
amber16+amberTools17, NPT sim (dt=2fs, cutoff=12A) 
(b)
R1
0 /
 T4
R8
 / T
5
R5
 / T
8
R4
 / T
10
R2
 / T
20
Rank / Thread Config
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
Pe
rfo
rm
an
ce
 [n
s/d
]
2LEM-91k NPT sim (dt=2fs, cuto=12A) 
gmx2018.1, sassauna, T40G4
0 GPUs
1 GPUs
2 GPUs
3 GPUs
4 GPUs
1 GPUs + GPU-PME
2 GPUs + GPU-PME
3 GPUs + GPU-PME
4 GPUs + GPU-PME
(c)
40
 M
PI 
Pro
cs
1 G
PU
2 G
PU
s
3 G
PU
s
4 G
PU
s
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
pe
rfo
rm
an
ce
 [n
s/d
]
2LEM-91k 
amber16+amberTools17, NPT sim (dt=2fs, cutoff=12A) 
(d)
R1
0 /
 T4
R8
 / T
5
R5
 / T
8
R4
 / T
10
R2
 / T
20
Rank / Thread Config
0
10
20
30
40
50
60
70
80
90
Pe
rfo
rm
an
ce
 [n
s/d
]
2N5E-158k NPT sim (dt=2fs, cuto=12A) 
gmx2018.1, sassauna, T40G4
0 GPUs
1 GPUs
2 GPUs
3 GPUs
4 GPUs
1 GPUs + GPU-PME
2 GPUs + GPU-PME
3 GPUs + GPU-PME
4 GPUs + GPU-PME
(e)
40
 M
PI 
Pro
cs
1 G
PU
2 G
PU
s
3 G
PU
s
4 G
PU
s
0
10
20
30
40
50
60
70
80
90
pe
rfo
rm
an
ce
 [n
s/d
]
2N5E-158k 
amber16+amberTools17, NPT sim (dt=2fs, cutoff=12A) 
(f)
Fig. 1. Performance of AMBER and GROMACS with different #thread vs. #GPU configurations on the Sassauna machine.
sa
ss
au
na
, 2
xla
rge
sa
ss
au
na
, 8
xla
rge
sa
ss
au
na
, T
40
G4
aw
sP
3.2
xla
rge
, 2
xla
rge
aw
sP
2.1
6x
lar
ge
, 2
xla
rge
aw
sP
2.1
6x
lar
ge
, 8
xla
rge
aw
sP
2.1
6x
lar
ge
, 1
6xl
arg
e
0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
pe
rfo
rm
an
ce
 [n
s/d
]
4J3I-35k 
gmx2018.1, NPT sim (dt=2fs, cutoff=12A) 
(a)
sa
ss
au
na
, 2
xla
rge
sa
ss
au
na
, 8
xla
rge
sa
ss
au
na
, T
40
G4
aw
sP
3.2
xla
rge
, 2
xla
rge
aw
sP
2.1
6x
lar
ge
, 2
xla
rge
aw
sP
2.1
6x
lar
ge
, 8
xla
rge
aw
sP
2.1
6x
lar
ge
, 1
6xl
arg
e
0
20
40
60
80
100
120
140
pe
rfo
rm
an
ce
 [n
s/d
]
2LEM-91k 
gmx2018.1, NPT sim (dt=2fs, cutoff=12A) 
(b)
sa
ss
au
na
, 2
xla
rge
sa
ss
au
na
, 8
xla
rge
sa
ss
au
na
, T
40
G4
aw
sP
3.2
xla
rge
, 2
xla
rge
0
20
40
60
80
pe
rfo
rm
an
ce
 [n
s/d
]
2N5E-158k 
gmx2018.1, NPT sim (dt=2fs, cutoff=12A) 
(c)
Fig. 2. Comparison of the performance of GROMACS on different AWS instances and the corresponding configurations on the Sassauna machine. The largest
benchmark has not been run on AWS due to long benchmark runtimes.
that likely includes further optimizations for the Pascal and
Volta generation GPUs.
3) GROMACS Benchmarks on AWS
As mentioned in the introduction, cloud infrastructure/software
services (IaaS/SaaS) are interesting alternatives to on-site
solutions. Hence, we also benchmark a few AWS instances
in order to be able to include them in the performance cost
comparison later on in Sec. IV.
At the time of writing, the demand for P3 instances with
V100 SMX2 cards was extremely high, and hence we only
got access to one 2xlarge instance with one GPU. As such, the
performance improvements for multi-GPU runs with V100’s
has to be extrapolated from these measurements. As we can
observe in Fig. 2 the improvement is in the order of 20% with
respect to the GTX1080Ti for the larger benchmarks. This
improvement seems to be reasonable, since the raw increase in
FLOP/s is around 35%. We can also see that the P2 instances
with K80 GPUs are significantly slower than the P3 instance
or Sassauna.
4) GROMACS Breakdown of Simulation Loop
The walltime breakdown for the three different benchmarks is
shown in Tbl. III in % for three different cases: single threaded
(CPU-only), multi-threaded with PME on CPUs, and multi-
threaded with both PP interactions and PME on GPUs.
The walltime breakdown for the single-threaded case is
shows the well-known picture: the compute time is mainly
dominated by non-bonded force evaluations, which accounts
for more than 96%, including PME. The force time also
includes bonded forces in this case, but the fraction is in-
significant compared to the non-bonded part.
In the multi-threaded case with PP interactions on the GPU,
the picture already changes quite a bit. The runtime of the PP
interaction kernels is not visible since they are executed in
parallel to the CPU code, for which the walltime accounting
is performed. The breakdown is a bit more difficult to read,
but we see that PME starts to become the dominant factor
(more than 50%)..
In the third case, we note that other parts besides the PP
interactions and PME are starting to become significant as
well. In particular the operations that cannot be overlapped
with the accelerated PP and PME calculations (such as domain
decomposition, reduction operations, trajectory sampling, in-
tegration, global energy communication) start to amount for
a significant percentage of overall walltime (around 25% in
this benchmarks). As we shall see in Sec. IV on hardware
performance estimates, this inherently limits the maximum
acceleration that can be achieved by using a co-processor
solution. This issue is not exclusive to GROMACS, and has
also been noted, e.g., in [67] with NAMD.
It is worth noting that the constraints step is required to keep
high-frequency oscillations of bonds involving light atoms
(H) under control. Otherwise, significantly smaller timesteps
than the currently employed 2 fs have to be employed in
order to ensure correct integration and prevent the simulation
from blowing up. These constraints are implemented using a
combination of efficient parallel algorithms in GROMACS (P-
LINCS for general bonds [26], [27], analytic SETTLE [28] for
water molecules).
IV. PERFORMANCE ESTIMATIONS AND COMPARISONS
In this section, we estimate the performance of FPGA ac-
celerated systems, compare them with GPU-based solutions
TABLE III
WALL-TIME ACCOUNTING AS REPORTED BY GROMACS FOR DIFFERENT
CONFIGURATIONS (IN PERCENT).
Single Threaded
Compute Steps 4J3I-35k 2LEM-91k 2N5E-158k
Force 87.7 86.2 86.3
PME mesh 8.4 9.7 9.9
NB X/F buffer ops. 0.2 0.4 0.4
Write traj. 2.9 3.0 2.6
Update 0.2 0.3 0.3
Constraints 0.4 0.4 0.4
Rest 0.1 0.1 0.1
Multi Threaded (40T), PP Accelerated (4 GPUs)
Compute Steps 4J3I-35k 2LEM-91k 2N5E-158k
Domain decomp. 1.7 2.1 2.1
DD comm. load 0.0 0.0 0.0
DD comm. bounds 0.0 0.1 0.0
Neighbor search 1.2 1.4 1.4
Launch GPU ops. 14.2 9.5 5.9
Comm. coord. 9.4 12.2 8.9
Force 3.6 2.8 2.9
Wait + Comm. F 10.0 10.0 9.1
PME mesh 44.8 46.9 52.6
Wait GPU NB nonloc. 0.8 0.3 0.6
Wait GPU NB local 0.4 0.3 0.3
NB X/F buffer ops. 2.8 3.4 4.3
Write traj. 1.1 0.5 0.9
Update 0.9 1.0 2.5
Constraints 7.0 6.2 6.7
Comm. energies 1.1 2.4 0.8
Rest 1.0 0.7 0.9
Multi Threaded (40T), PP+PME Accelerated (4 GPUs)
Compute Steps 4J3I-35k 2LEM-91k 2N5E-158k
Domain decomp. 3.2 3.9 3.7
DD comm. load 0.0 0.0 0.0
Send X to PME 4.7 5.9 6.2
Neighbor search 1.9 2.1 2.1
Launch GPU ops. 14.4 7.9 5.0
Comm. coord. 14.0 14.7 14.1
Force 6.4 5.3 4.6
Wait + Comm. F 15.2 13.3 11.0
Wait + Recv. PME F 3.7 4.9 8.8
Wait PME GPU gather 8.7 13.1 14.6
Wait GPU NB nonloc. 6.5 7.6 7.3
Wait GPU NB local 0.6 0.3 2.2
NB X/F buffer ops. 4.3 4.6 4.3
Write traj. 2.2 1.9 2.0
Update 2.2 3.0 3.4
Constraints 11.8 11.3 10.6
Comm. energies 0.1 0.1 0.1
in terms of performance/cost and discuss the technical and
commercial implications.
A. Considered System Architectures
In order to narrow down the system-level architectural tem-
plates to focus on, we first make a couple of observations:
• While some kernels of the simulation loop can be well
implemented in HW (PME, PP interactions), there are
other parts such as the enforcement of constraints, bonded
interactions and integration (double precision) which are
MCH
XEON Host 
CPU0
FPGA
Card 0
FPGA
Card 1
FPGA
Card ..
FPGA
Card ..
MCH
XEON Host 
CPU1
FPGA
Card ..
FPGA
Card ..
FPGA
Card ..
FPGA
Card N
QPI
PCIe x40 PCIe x40
x16
Raw BW: ~39.5GB/s
Raw BW:
~15.8GB/s
x16Raw BW:
~15.8GB/s
Raw BW:
9.6GT/s=19.2GB/s
Bidirectional Ring with 200 or 400GBps (raw) in each direction
(b) Uniform Allocation:
(c) Non-Uniform Allocation:
PME PME PME PME PME PME PME PME
PME
PP PP PP PP PP PP PP PP
PP PP PP PP PP PP PP
(a) System Architecture Template:
...
Fig. 3. Architectural template used for the performance estimations and
different PP/PME unit allocations.
better suited for CPUs due to the control flow and
flexibility required for algorithmic changes. If we stick to
this partitioning, this means that the simulation loop will
always involve a complete state exchange (coordinates in,
forces out) in each simulation step.
• Bandwidth from CPU to logic and vice versa is similar for
a PCIe card (15GB s−1 raw bandwidth) as for an ARM
system on a SoC like Stratix 10, Zynq SoC (2×128 bit
plug at, say 400MHz gives 12.8GB s−1 of raw band-
width in total). The only benefit of a SoC will likely be
latency, but the embedded CPUs are not as capable as on
a server class system. Further, latency is the smaller evil
on small single node systems – it is the bandwidth that is
important. I.e., to transfer 2MB of simulation state over
12GB s−1 166 µs, while link latency is only in the order
of a few microseconds.
• As we have seen in Sec. III, a capable server class CPU
is required to handle the non-HW-accelerated part of the
simulation within the available time budget. An ARM-
based SoC system will likely be too slow.
Hence, it makes sense to stick to a system architecture tem-
plate of the form Xeon Class Server plus N FPGA cards with
own interconnect. This is also aligned with what is currently
available in cloud services (AWS F1 for instance), and such a
system is depicted in Fig. 3.
We do not consider IBM Power 8/9 systems in this report
as due to their cost and limited availability. Judging from the
literature, most people in this space seem to be used to x86
machines, and these machines together with customer grade
GPUs are more affordable.
B. Estimated System Components
As explained before, we assume the architectural template
shown in Fig. 3a). For the FPGA cards, we base the resource
and performance estimates on reported values in literature, as
explained below.
1) Interconnects
The assumed architectural template consists of two XEON
host CPUs that can host a maximum of 4 cards at full x16
PCIe bandwidth, or 8 cards at x8 bandwidth. For the PCIe
link efficiency (framing and other overheads), we assume a
bandwidth efficiency of 75% and a latency of 10 µs. Further
we assume that the FPGA cards are interconnected with a
bidirectional ring. This is readily possible due to the high
amount of transceivers available on todays high-end FPGAs.
In fact, almost every PCIe FPGA card features at least one
QSFP port. As explained later, the targeted FPGA platforms
either provide two or four QSFP28 cages, allowing for raw
ring bandwidths of 200 or 200-400Gbit. The assumed link ef-
ficiency is 85% including framing and packetizing overheads,
and a link latency of 0.5 µs is assumed.
2) PP Interaction Pipelines
We base our estimates on the PP interaction pipeline by M.
Herbordt’s group [20], [21] that employs particle filters that
generate the neighbor lists on-the-fly. This architecture is well
suited for hardware implementations, and does not require as
much memory as an implementation with explicit neighbor-
lists. The employed arithmetic is a mix between fixed-point
and single-precision floating-point, and has been optimized for
FPGAs. The employed domain decomposition is an optimized
variant of the half shell method. Better methods with less
inter-cell communication volume exist (e.g., neutral territory
and mid-point methods [60]), but these are likely to show
their benefits only in the highly scaled regime (which is
not the scope of this evaluation). According to the results
of Herbordt et al., the preferred design employs first-order
interpolation with piecewise polynomials, and this design
amounts to around 14.5k ALMs and 82 DSP multipliers on
a Stratix IV (including filters, LJ and Coulomb datapath,
distribution and accumulation logic). We estimate the needed
amount of memory separately as described further below.
The PP interaction pipelines can calculate one interaction
per cycle. In terms of utilization, the authors reported that it
is possible to achieve relatively high percentages in the order
of 95% by properly mapping the particles to the filters and
LJ/Coulomb force datapaths. However, since in our evaluation
we deal with many more parallel pipelines (several 100 instead
of several tens), we assume a slightly reduced utilization of
80% to account for potential imbalances.
The required RAM resources have been estimated by as-
suming that each particle position entry requires three 32 bit
coordinate entries plus one 32 bit entry holding metadata like
atom ID and type (we assume that atom specific LJ interaction
values and charge parameterizations can be compiled into
ROMs and indexed by these atom type IDs at runtime). The
particle force accumulation entries are assumed to comprise
TABLE IV
ASSUMED COSTS FOR THE COMPARISON. EQUIPMENT AND SOFTWARE
PACKAGES ARE AMORTIZED OVER 5 YEARS (STRAIGHT LINE), AND THE
ELECTRICITY IS ASSUMED TO COST 0.1$ PER KWH. FURTHER, FOR EACH
ON-SITE SOLUTION WE DOUBLE THE ELECTRICITY COST IN ORDER TO
ACCOUNT FOR COOLING.
Component Cost [$] [kW] Details
Server with 4×PCIe† 8’000 0.45 Dual E5-2640v4
Server with 8×PCIe† 11’000 0.45 Dual E5-2640v4
P100 6’000 0.25 Pascal Generation
GTX1080Ti 800 0.25 Pascal Generation
TITAN-XP 1’350 0.25 Pascal Generation
V100 SMX2 10’000 0.25 Volta Generation
TITAN-V 3’600 0.25 Volta Generation
VCU1525 (VU9P) 4’500 0.25 200GBit Ring
XUPP3R (VU9P) 10’000 0.25 400GBit Ring
XUPP3R (VU13P) 15’000 0.25 400GBit Ring
DE10-PRO (GX2800) 15’000 0.25 400GBit Ring
p3.2xlarge‡ 3.305 – 1×V100
p3.8xlarge‡ 13.22 – 4×V100
f1.2xlarge‡ 1.815 – 1×XUPP3R (VU9P)∗
f1.16xlarge‡ 14.52 – 8×XUPP3R (VU9P)∗
Amber License+ 20’000/4 – Commercial, Site
† Assuming the same dual Xeon configuration of the Sassauna
node used for benchmarking. ‡ For the AWS instances, the
prices are per hour. + The Amber license is a site license.
We assume here for simplicity that a site consists of 4 nodes
to amortize the license cost. ∗ Or a similar FPGA card that
provides up to 400Gbit ring interconnection capability.
three 32 bit values. The estimation for memory usage includes
all interpolator lookup tables, atom property ROMs, all posi-
tion+metadata entries and private accumulation buffers of the
PP interaction pipelines.
3) PME Unit
Herbordt’s group demonstrated as well that it is possible to
fit a 3D FFT unit with 643 grid points onto recent FPGAs
[18], [19]. The dominant factor here are the 1D FFT macros
provided by the FPGA vendors. Hence, we use the complexity
of these vendor provided macros to estimate the resources for
the 3D FFT part. Since our FFTs sizes are in the order of
843 grid points, we assume that the resource consumption
is similar to 128-point FFT macros, i.e., 5k LUT, 32 DSP
multipliers and 16kBit memory on a VUP FPGA12. The
latency corresponds to the amount of samples to be calculated.
For the particle-to-grid and grid-to-particle interpolators,
we use the results reported in [16], where an optimized
single-cycle datapath is designed and implemented. One such
unit requires 51k ALMs 192 DSP blocks (= 2 × 192 DSP
multipliers) on a Stratix V FPGA. Further, we assume that this
design can be optimized such that the same interpolators can
be used for both interpolation directions, as well as the PME
solution step in the frequency domain that entails a point-wise
multiplication with pre-computed constants.
12Note that these implementations likely use a Cooley-Tukey FFT that can
only be used for lengths that are powers of two. In practice, a split-radix FFT
would be required to handle other FFT lengths.
TABLE V
CONSIDERED FPGA CONFIGURATIONS (EACH LISTED SOLUTION EMPLOYS 7 PP FPGA CARDS AND 1 PME FPGA CARD).
Evaluated Configurations
Platform/Board VCU1525 XUPP3R XUPP3R DE10 Pro f1.x16large‡
FPGA VU9P VU9P VU13P 1SGX280 VU9P
Cutoff [A˚]† 12.6 12.3 12.6 12.0 12.0
PME Grid† 803 823 803 843 843
vCPUs 40 40 40 40 64
Core Freq. [MHz] 400 400 400 700 400
Ring BW (raw/eff) [GBit] 200/170§ 400/340§ 400/340§ 400/340§ 400/340§
PCIe BW (raw/eff) [GBit] 63/54+ 63/54+ 63/54+ 63/54+ 63/54+
PP Pipelines / FPGA 66 66 108 42 58
Grid Interpolators / FPGA 10 10 18 6 9
Mixed Radix FFT Units / FPGA 64 64 96 64 64
Estimated Resource Utilization
PP FPGA Logic [kLUT] 887.9 (75.1%) 885.8 (74.9%) 1452.9 (84.1%) 701.0 (75.1%) 782.1 (85.6%)
PP FPGA DSP 2706.0 (39.6%) 2706.0 (39.6%) 4428.0 (36.0%) 1722.0 (14.9%) 2378.0 (42.2%)
PP FPGA Memory [MBit] 78.3 (23.0%) 78.3 (23.0%) 124.4 (27.4%) 46.7 (20.4%) 62.2 (36.3%)
PME FPGA Logic [kLUT] 846.9 (71.6%) 846.9 (71.6%) 1428.4 (82.7%) 711.3 (76.2%) 794.2 (86.9%)
PME FPGA DSPs 5888.0 (86.1%) 5888.0 (86.1%) 9984.0 (81.2%) 4352.0 (37.8%) 5504.0 (97.6%)
PME FPGA Memory [MBit] 50.2 (14.7%) 54.0 (15.8%) 50.7 (11.2%) 57.9 (25.3%) 57.9 (33.8%)
Estimated Runtimes and Performance
PME + H2D Transfers[µs] 274.4 250.9 168.1 196.5 269.0
PP + H2D Transfers[µs] 268.4 250.6 168.1 210.4 264.1
PP/PME + D2H Transfers [µs] 307.5 284.0 201.2 243.5 302.1
SW Overheads [µs] 280.0 280.0 280.0 280.0 175.0
Total HW+SW [µs] 587.5 564.0 481.2 523.5 477.1
Performance [ns/d] 294.1 306.4 359.1 330.1 362.2
†We employ a similar load balancing mechanism as GROMACs between PME and PP cards.
‡ The maximum available resources are reduced in this case due to the AWS shell infrastructure (see text for more details).
+ We assume a PCIe link efficiency of 75%, and since we use 8 PCIe cards only half the 16x bandwidth is available per FPGA.
§ Assuming a link efficiency of 85% for the ring interconnect.
4) Resource Allocation
In terms of allocation of PME and PP units to accelerator
boards it seems natural to uniformly distribute them as illus-
trated in Fig. 3b). The advantage of this allocation is scalability
of compute resources. However, the communication pattern of
the two 3D FFT passes in PME leads to high communication
volume between these distributed PME units, and hence it has
been found that a contraction in the PME calculation can lead
to better performance [68]. E.g., for highly parallel scenarios
GROMACS supports the use of fewer separate MPI ranks [53].
Further, an additional optimization for single node systems
has been added in the newest GROMACS release where the
PME calculation can be allocated to one GPU. AMBER moved
to single GPU implementations already a while ago to solve
this communication issue. Hence, we do not consider uniform
allocation, but the non-uniform allocation shown in Fig. 3c).
5) Schedule
The computation schedule follows a relatively simple repeti-
tive pattern. In each iteration, all MPI processes on the host
push the particle coordinates and meta-information down to
the FPGA card using bulk DMA transfers, thereby utilizing the
full effectively available PCIe bandwidth. Computation on the
FPGA side can be largely overlapped with the PCIe transfers,
since we can start computing PP interactions already with a
small part of the simulation volume. Further, the coordinates
and meta data can be gathered and transferred to the PME card
on-the-fly (via the ring interconnect), and the grid interpolators
can start with particle-to-grid interpolation. As soon as the
complete simulation volume has been transferred, the PP cards
exchange the overlap regions required for PP interactions via
the ring. Once the inverse 3D FFT is complete, the grid-
to-particle interpolation can be started and overlapped with
the scatter operation that transfers the forces back to the
originating card. The forces are then accumulated within the
PP interaction force buffers, and once all non-bonded forces
have been calculated, the results are copied back to the host.
C. FPGA Targets and Hardware Costs
In order to achieve the required performance, large datacenter
grade FPGAs are required. In this report we target the Xilinx
Virtex UltraScale+ series, as well as the Intel Stratix 10 series,
since at the time of writing, these represent the best FPGA
technology that is available (or soon will be). On the Xilinx
side, we identified the VU9P device as an ideal target as it
is currently widely used and stable in production (this device
is also available in the AWS F1 instances). The roughly 50%
larger VU13P device will soon reach stable production, as
well, and can be seen as the natural successor of the VU9P
device in the forthcoming years. On the Intel side, the only
FPGA that can currently match the Virtex UltraScale+ devices
is the upcoming Stratix 10 series (the Arria 10 devices are too
small). In particular, it seems that the 1SGX280 device will
be the equivalent of the VU9P in terms of adoption (several
development boards feature this device). However, we found
that the availability of these devices is not yet guaranteed
(especially for the H-Tile devices with fast transceivers), and
we can expect that stable products are likely not to going to
be introduced before next year.
In order to compare system level costs, we assume the prices
and power consumptions13 listed in Tbl. IV. For simplicity,
we lump the costs for the optical modules together with the
corresponding FPGA boards (roughly 200$ per QSFP28 slot).
Equipment and software packages are amortized over 5 years
(straight line), and the electricity is assumed to cost 0.1$
per kWh. Further, for each on-site solution we double the
electricity cost in order to account for cooling. No sales margin
is added to the FPGA solutions in this comparison, but in a
commercial setting this has to be accounted for as well.
D. Evaluated Configurations and Assumptions
We calculate our estimates for a system with N = 8 FPGA
cards in the system (as discussed earlier, the PP interaction
pipelines are allocated on 7 FPGAs, while 1 FPGA card is
used for PME). These configurations are listed in Tbl. V.
The amount of units (e.g. FFTs) listed in that table is per
FPGA instance. We note that on these modern FPGAs, logic
resources are the ones that are going to be critical. In order
to allow for enough headroom for additional infrastructure
such as, e.g., Aurora and PCIe PHYs we target a LUT/ALM
usage of 75% on the VU9P and 1SGX280 devices (both offer
a similar amount of logic resources). On the larger VU13P
devices and AWS F1 instance, we target a higher utilization
in the order of 85%. This is possible since the VU13P offers
around 50% more logic resources than the VU9P, and on the
AWS instance, the infrastructure is already included in the
AWS F1 Shell that wraps the user logic14.
From test syntheses of an HDL design optimized for FPGAs
(NTX cores from [69]) we found that operating frequencies up
to 400MHz and 700MHz should be achievable on the Virtex
UltraScale+ and Stratix 10 devices, respectively. From these
test syntheses we also calculated logic conversion factors for
the LUTs and ALMs to derate the numbers reported for the
3D FFT and PP cores in literature (shown in Tbl. VI).
13For more information on the FPGA boards, see also:
http://www.hitechglobal.com/Boards/UltraScale+ X9QSFP28.htm,
https://www.xilinx.com/products/boards-and-kits/vcu1525-p.html,
https://www.bittware.com/fpga/xilinx/boards/xupp3r/,
https://www.altera.com/solutions/partners/partner-profile/terasic-inc-/board/
terasic-stratix-10-de10-pro-fpga-development-kit.html
14AWS uses partial reconfiguration, and the logic resources available in the
user partition amounts to 914k, 5640 DSPs, 3360 BRAMs and 400 URAMs,
which corresponds to 77%, 82%, 78% and 42% of the resources available on
the VU9P.
TABLE VI
DERATING FACTORS FOR DIFFERENT FPGA GENERATIONS. THE HIGHER
# ALMS ON THE STRATIX 10 DEVICE IS LIKELY DUE THE HIGH
OPERATING FREQUENCY TARGETED.
Source / Target VUP [LUTs] Stratix 10 [ALMs]
Stratix IV [ALMs] 0.95 1.18
Stratix V [ALMs] 1.04 1.29
For the estimations in this section, we consider the 91k
problem (2LEM) of the previous section, and assume a soft-
ware overhead of 280 µs on 40 logical cores that can not be
overlapped with the force computations (this amounts to 25%
of the overall walltime of a single step, see previous section
on benchmarks). Note that we adjust the PME grid resolution
and PP interaction cutoffs to balance the load between the
PP and PME cards at similar accuracy. This is analogous to
the PME load balancing procedure performed in GROMACS
simulation runs [64].
We also note that we do not explicitly account for potential
and virial calculations here as these are only carried out every
100 steps in the considered benchmarks. The support for these
calculations can be added to the hardware either by extending
the force pipelines (this leads to a small increase in DSP slices
which are still abundantly available) or by reusing existing
interpolator infrastructure and repeated evaluation, leading to
a decrease in performance of around 1%.
E. Results
The estimations for resources and timings are shown in Tbl. V
at the bottom. The highest performance is achieved by the
VU13P and f1.16xlarge designs. In the first case this is due
to the high amount of logic resources on the VU13P and in
the second case, more CPU cores translate into lower software
overheads. In all cases we note that the non-hideable software
overheads (domain decomposition, integration, constraints,
etc.) are either in similar in magnitude than the accelerated
PP and PME portions.
When comparing the VCU1525 and XUPP3R solutions with
VU9P FPGAs, we can observe that the faster ring interconnect
available with the more expensive XUPP3R does lead to a
small improvement in speed, but this is likely not worth the
price increase of around 2× in case of the VU9P. However,
for the larger/faster VU13P and 1SGX280 FPGAs, the faster
ring interconnect is desirable in order to match the bandwidth
with the increased throughput.
In order to better compare different solutions, we cast these
results into a performance (ns/d) versus cost ($/h) and add
different operating points of GPU-accelerated solutions. The
performance values for FPGA versions with less than eight
cards have been derated from the 8 card solutions (assuming
that one of the cards now contains both the complete PME
unit and some PP interaction pipelines). The performance
values for the GPU solutions employing 1-4×GTX1080Ti and
1 V100 (on AWS) have been measured (see previous section
on benchmarks). The remaining operating points have been
10-1 100 101
Cost [$/h]
102
103
Pe
rfo
rm
an
ce
 [n
s/d
]
4xV100
1xV100
2xVCU1525
1xp3.2xlarge
ideal
1xV100
4xTITAN-V
4xGTX1080Ti
2xGTX1080Ti
1xGTX1080Ti
1xP100
1xTITAN-V
1xTITAN-XP
1xGTX1080Ti
2xGTX1080Ti
4xVCU1525
8xVCU1525
8xXUPP3R-VU9P
8xXUPP3R-VU13P
8xDE10-PRO
1xp3.8xlarge
1xf1.2xlarge
1xf1.16xlarge
better
GMX18
AMBER17
FPGA
AWS
ideal
Fig. 4. Comparison of the performance vs cost tradeoff of different FPGA
and GPU-based solutions on the 2LEM-91k benchmark.
estimated using numbers from existing benchmarks15.
The blue dots are all for GROMACS 2018, the green ones
for AMBER 16/17 and the orange ones for FPGA solutions
based Virtex UltraScale+ and Stratix 10 FPGAs. On the right
side of the plot we have the AWS instances, and on the left
the on-prem versions. The closer solutions are to the upper left
corner of the plot, the better, and the diagonal lines represent
same performance/cost. The red dot in the upper left corner
shows the desirable performance/cost of an ideal solution that
domain experts would consider commercially feasible.
F. Discussion
As can be observed in Fig. 4, systems employing con-
sumer grade GTX1080Ti cards are clearly at the forefront
in terms of cost efficiency (slanted lines represent equal
performance/cost). And it should be noted that this efficiency
is bound to increase, since there seems to be a trend in
moving more computations or even the complete simulation
loop onto the GPUs and have them communicate in a peer-
to-peer fashion. AMBER already supports this, allowing to
assemble very cost efficient desktop systems, since in that case
no expensive high-core-count CPUs are required anymore.
This is not reflected in the plot above yet, as these costs have
been calculated assuming a dual XEON (20 core) server.
We can also see that with FPGA solutions based on Ul-
traScale+ and Stratix 10 devices there is not much to be
gained with respect to the GPU solutions. I.e., it is possible
to achieve speedups in the order of around 1.5× to 2×, but
the performance/cost ratio is similar to GPU solutions. This
is a combination of two key factors: first, FPGAs prices are
in the range of datacenter GPUs, which makes it difficult to
compete with consumer grade GPUs that offer very attractive
single-precision FLOPS/$ ratios. And second, the amount of
15See http://ambermd.org/gpus/benchmarks.htm
remaining work that can not be overlapped with non-bonded
force computations starts to become dominant, thereby leading
to a saturation of the achievable speedup. For instance, in
the case of a system with 8 VU13P FPGAs, our estimates
indicate that the PP and PME calculations take less time
than the remaining non-hideable parts in software (200 µs
versus 280 µs). Hence, we see that even a 4× speedup of the
calculation of the PP interactions and PME only leads to an
application performance improvement of only 2×.
1) Commercial Feasibility
We have been in contact with domain experts and according
to them, a new accelerator solution based on a different
technology than GPUs should offer at least 1 µs of simulation
performance at the cost of one high-end GPU in order to be
perceived as a viable alternative (this ideal solution is indicated
with a red dot in Fig. 4). Considering our results and this
desirable target, it becomes evident that a FPGA co-processor
solution will likely not be commercially successful since a
mere 2× improvement in speed at similar cost efficiency does
not represent a compelling value proposition for potential users
(we have to keep in mind here that an FPGA solution will
likely be less flexible in terms of functionality than a GPU-
based one). Instead, MD users will likely just run several
simulations in parallel at somewhat lower speed and better
cost efficiency, e.g., to improve sampling statistics or to screen
multiple chemical compounds in parallel.
2) A Note on Algorithmic Improvements
We see that when targeting non-bonded interactions, archi-
tectural specialization and current FPGA technology alone
will likely not be enough to get the needed improvement in
speed (with respect to GPU solutions), and some algorithmic
innovation will be needed as well (that is, unless FPGA prices
drop over 10× the coming years, which is not very probable).
The following points should however be kept in mind when
doing algorithmic work in this field:
• This is a mathematical field, and people would like to
have error guarantees and bounds. Despite the fact that
it is very challenging to come up with an algorithm that
is better than SOA, it takes a long time until such a new
method is established and accepted by the community.
For example, variations of FMM and multi-level summa-
tion methods (MSM) [4], [37] have been in development
for many years now, and they still not used on a regular
basis for production runs – even if they may have benefits
in some operating regions. One of the reasons for this
slow development and adoption is that it is difficult
to find and introduce optimizations/approximations that
do not introduce assumptions that violate physical laws
and eventually lead to erratic behavior. Further, it is
difficult to verify the correctness of a certain approach
or implementation.
• It is likely that GPU-based solutions will benefit from
algorithmic improvements, too (e.g., improvements of the
integration and constraints steps), and the implementation
turn-around time is shorter for GPUs than for FPGAs.
• Machine-learning approaches are well suited for a certain
class of ab-initio potential evaluations (AIMD), and they
have been shown to give quite impressive speedups in
that case. However, in pure classical MD, the force fields
already have a very simple and efficient functional form
(basically polynomial expressions for classical mechan-
ics), and contain far fewer parameters than, e.g., the ANI-
1 NN potential [39]. It is currently unclear how such an
NN-based approach could be used to accelerate potential
expressions in classical MD simulations.
3) Other Hardware Acceleration Opportunities
ASIC integration of the non-bonded interaction pipelines could
be a way of improving the performance with respect to FPGA-
based solutions, but in the end this approach suffers the
same shortcomings as the FPGA co-processor solution, and in
addition the market does not seem to be big enough to justify
the development costs. Moving to fully integrated solutions in
a similar fashion as this has been done in the Anton systems
can also lead to higher performance, but the design effort (and
involved risks) are quite large and hence difficult to justify. So
far, Anton-1/2 SoCs have been the only successful chips built
in this fashion. MDGRAPE-4, which is the only other SoC-
based system, seems to be fizzled out as there has been no
update on the project for 4 years (the last publication [7] is
from 2014). Another fact that has to be kept in mind is that
there are several patent applications and granted patents [70–
73] protecting features of the Anton-type SoCs, which could
complicate commercial exploitation of such a solution.
Considering the difficulties mentioned above, it becomes
clear that different approaches for leveraging FPGAs should
be considered. It could make sense to turn towards scaled
datacenter solutions and look into hybrid solutions that lever-
age FPGAs as near-network accelerators (similar as this has
been done in Catapult-I/II [47], [48], or applications such
as networking filtering, high-frequency trading, etc). I.e., we
could use FPGAs to complement multi-node GPU systems in
order to improve the scalability of such systems. Consider for
instance the scaling behavior of GPU-accelerated GROMACS
runs on the Hydra supercomputer (Max Planck Computing
Centre, 20-core Ivy Bridge nodes with 2xK20X and FDR-14)
in Fig. 5. We can observe the well-known hard-scaling issue
for typical problem sizes (top 2 curves, 81k atoms). Problems
with several millions of atoms (lower 4 curves) are often used
for benchmarking, but they do not represent common everyday
problems. What is interesting to note is that GPU accelerated
problems with ¡100k atoms often reach their 50% scaling limit
very quickly at around 8 nodes - and this is something that
can be observed on other clusters, too (see [64], for example).
From what can be read in literature this scaling bottleneck is
mainly due to PME and global communication phases.
So it is likely that a network accelerator could be used to
ameliorate this scaling bottleneck by bypassing the standard
InfiniBand interconnects and the MPI software stack, and
by providing a dedicated secondary network with functional
capabilities tailored towards PME and global communication
(e.g. for energies and virials) and even integration/constraints
Fig. 5. GROMACS scaling on HYDRA (Max Planck Computing Centre),
reproduced from [53].
handling. Possible arrangements could be 1 FPGA network
card per node and a 2D or 3D torus. To this end, initial studies
on 3D FFTs for molecular dynamics on FPGA clouds by
Herbordt, et al. [68] show promising results (also considering
phased contraction where certain parts of the computation
are carried out on a subset of all nodes to improve com-
munication patterns). Another arrangement that is similar to
phased contraction and suitable for small-scale clusters would
be a star arrangement, where all nodes have a connection
to a single external FPGA box that performs PME and
global reductions/broadcasts. As we have seen with the recent
GROMACS update, solving PME on one device alone can be
beneficial since this improves the communication volume.
V. CONCLUSIONS
In this report, we benchmarked two widely used GPU-
accelerated MD packages using typical MD model problems,
and compare them with estimates of an FPGA-based solution
in terms of performance and cost. Our results show that
while FPGAs can indeed lead to higher performance than
GPUs, the overall application-level speedup remains in the
order of 2×. Considering the price for GPU and FPGA
solutions, we observe that GPU-based solutions provide the
better performance/cost tradeoff, and hence pure FPGA-based
solutions are likely not commercially viable. However, we also
note that scaled systems could potentially benefit from a hybrid
composition, where GPUs are used for compute intensive
parts and FPGAs for in-network acceleration of latency and
communication sensitive tasks.
REFERENCES
[1] M. Karplus and G. A. Petsko, “Molecular dynamics simulations in
biology,” Nature, vol. 347, no. 6294, pp. 631–639, 1990.
[2] M. De Vivo, M. Masetti, G. Bottegoni et al., “Role of Molecular Dy-
namics and Related Methods in Drug Discovery,” Journal of Medicinal
Chemistry, vol. 59, no. 9, pp. 4035–4061, 2016.
[3] D. E. Shaw, K. J. Bowers, E. Chow et al., “Millisecond-scale molecular
dynamics simulations on Anton,” Proceedings of the Conference on High
Performance Computing Networking, Storage and Analysis - SC ’09,
no. c, p. 1, 2009.
[4] D. J. Hardy, M. A. Wolff, J. Xia et al., “Multilevel summation with
B-spline interpolation for pairwise interactions in molecular dynamics
simulations,” The Journal of chemical physics, vol. 144, no. 11, p.
114112, 2016.
[5] E. Krieger and G. Vriend, “New ways to boost molecular dynamics
simulations,” Journal of Computational Chemistry, vol. 36, no. 13, pp.
996–1007, 2015.
[6] A. Spitaleri, S. Decherchi, A. Cavalli et al., “Fast Dynamic Docking
Guided by Adaptive Electrostatic Bias: The MD-Binding Approach,”
Journal of Chemical Theory and Computation, vol. 14, no. 3, pp. 1727–
1736, 2018.
[7] I. Ohmura, G. Morimoto, Y. Ohno et al., “MDGRAPE-4: a special-
purpose computer system for molecular dynamics simulations,” Philo-
sophical Transactions of the Royal Society A: Mathematical, Phys-
ical and Engineering Sciences, vol. 372, no. 2021, pp. 20 130 387–
20 130 387, 2014.
[8] D. E. Shaw, M. M. Deneroff, R. O. Dror et al., “Anton, a special-purpose
machine for molecular dynamics simulation,” Commun. ACM, vol. 51,
no. 7, pp. 91–97, Jul. 2008.
[9] D. E. Shaw, J. Grossman, J. A. Bank et al., “Anton 2: raising the bar
for performance and programmability in a special-purpose molecular dy-
namics supercomputer,” in Proceedings of the International Conference
for High Performance Computing, Networking, Storage and Analysis.
IEEE Press, 2014, pp. 41–53.
[10] M. Malit¸a, D. Mihit¸, and G. M. S¸tefan, “Molecular dynamics on fpga
based accelerated processing units,” in MATEC Web of Conferences, vol.
125. EDP Sciences, 2017, p. 04012.
[11] N. Azizi, I. Kuon, A. Egier et al., “Reconfigurable molecular dynamics
simulator,” in Field-Programmable Custom Computing Machines, 2004.
FCCM 2004. 12th Annual IEEE Symposium on. IEEE, 2004, pp. 197–
206.
[12] F. A. Escobar, X. Chang, and C. Valderrama, “Suitability Analysis of
FPGAs for Heterogeneous Platforms in HPC,” IEEE Transactions on
Parallel and Distributed Systems, vol. 27, no. 2, pp. 600–612, Feb 2016.
[13] M. A. Khan, M. Chiu, and M. C. Herbordt, “Fpga-accelerated molecular
dynamics,” in High-Performance Computing Using FPGAs. Springer,
2013, pp. 105–135.
[14] M. Chiu and M. C. Herbordt, “Towards production FPGA-accelerated
molecular dynamics: Progress and challenges,” in High-Performance
Reconfigurable Computing Technology and Applications (HPRCTA),
2010 Fourth International Workshop on. IEEE, 2010, pp. 1–8.
[15] A. Sanaullah, K. Lewis, and M. Herbordt, “GPU-Accelerated Charge
Mapping,” in IEEE High Performance Extreme Computing Conference,
HPEC 2016, 2016.
[16] A. Sanaullah, A. Khoshparvar, and M. Herbordt, “FPGA-Accelerated
Particle-Grid Mapping,” in IEEE Symposium on Field Programmable
Custom Computing Machines, FCCM 2016, 2016.
[17] J. Sheng, B. Humphries, H. Zhang et al., “Design of 3D FFTs with
FPGA clusters,” in High Performance Extreme Computing Conference
(HPEC), 2014 IEEE. IEEE, 2014, pp. 1–6.
[18] B. Humphries, H. Zhang, J. Sheng et al., “3D FFT on a Single FPGA,” in
IEEE Symposium on Field Programmable Custom Computing Machines,
2014.
[19] B. Humphries and M. C. Herbordt, “3D FFT for FPGAs,” in High
Performance Extreme Computing Conference (HPEC), 2013 IEEE.
IEEE, 2013, pp. 1–2.
[20] Chiu, Matt and Khan, Md Ashfaquzzaman and Herbordt, Martin
C, “Efficient calculation of pairwise nonbonded forces,” in Field-
Programmable Custom Computing Machines (FCCM), 2011 IEEE 19th
Annual International Symposium on. IEEE, 2011, pp. 73–76.
[21] M. Chiu and M. C. Herbordt, “Efficient particle-pair filtering for
acceleration of molecular dynamics simulation,” in Field Programmable
Logic and Applications, 2009. FPL 2009. International Conference on.
IEEE, 2009, pp. 345–352.
[22] S. Kasap and K. Benkrid, “Parallel processor design and implementation
for molecular dynamics simulations on a FPGA-Based supercomputer,”
Journal of Computers, vol. 7, no. 6, pp. 1312–1328, 2012.
[23] J. Cong, Z. Fang, H. Kianinejad et al., “Revisiting FPGA Acceleration
of Molecular Dynamics Simulation with Dynamic Data Flow
Behavior in High-Level Synthesis,” 2016. [Online]. Available: http:
//arxiv.org/abs/1611.04474
[24] D. C. Rapaport, The art of molecular dynamics simulation, 2nd ed.
Cambridge, UK: Cambridge University Press, 2004.
[25] T. Hansson, C. Oostenbrink, and W. van Gunsteren, “Molecular dynam-
ics simulations,” Current opinion in structural biology, vol. 12, no. 2,
pp. 190–196, 2002.
[26] B. Hess, H. Bekker, and H. J. C. Berendsen, “LINCS: a linear constraint
solver for molecular simulations,” Journal of . . . , vol. 18, no. 12, pp.
1463–1472, 1997.
[27] B. Hess, “P-LINCS: A parallel linear constraint solver for molecular
simulation,” Journal of Chemical Theory and Computation, vol. 4, no. 1,
pp. 116–122, 2008.
[28] S. Miyamoto and P. A. Kollman, “Settle: An analytical version of the
SHAKE and RATTLE algorithm for rigid water models,” Journal of
Computational Chemistry, vol. 13, no. 8, pp. 952–962, 1992.
[29] K. Esselink, “A comparison of algorithms for long-range interactions,”
Computer Physics Communications, vol. 87, no. 3, pp. 375–395, 1995.
[30] H. G. Petersen, “Accuracy and efficiency of the particle mesh ewald
method,” The Journal of chemical physics, vol. 103, no. 9, pp. 3668–
3679, 1995.
[31] “A smooth particle mesh Ewald method,” The Journal of Chemical
Physics, vol. 103, no. 19, pp. 8577–8593, 1995.
[32] Y. Shan, J. L. Klepeis, M. P. Eastwood et al., “Gaussian split Ewald: A
fast Ewald mesh method for molecular simulation,” Journal of Chemical
Physics, vol. 122, no. 5, pp. 1–13, 2005.
[33] A. Arnold, M. Bolten, H. Dachsel et al., “Comparison of Scalable Fast
Methods for Long-Range Interactions,” submitted to Phys. Rev. E, 2013.
[34] L. Greengard and V. Rokhlin, “A fast algorithm for particle simulations,”
Journal of computational physics, vol. 73, no. 2, pp. 325–348, 1987.
[35] C. A. White and M. Head-Gordon, “Derivation and efficient implemen-
tation of the fast multipole method,” The Journal of Chemical Physics,
vol. 101, no. 8, pp. 6593–6605, 1994.
[36] M. Challacombe, C. White, and M. Head-Gordon, “Periodic boundary
conditions and the fast multipole method,” The Journal of Chemical
Physics, vol. 107, no. 23, pp. 10 131–10 140, 1997.
[37] D. J. Hardy, Z. Wu, J. C. Phillips et al., “Multilevel Summation Method
for Electrostatic Force Evaluation,” Journal of Chemical Theory and
Computation, vol. 11, no. 2, pp. 766–779, 2015, pMID: 25691833.
[38] R. D. Skeel, I. Tezcan, and D. J. Hardy, “Multiple grid methods for
classical molecular dynamics,” Journal of Computational Chemistry,
vol. 23, no. 6, pp. 673–684, 2002.
[39] J. S. Smith, O. Isayev, and A. E. Roitberg, “ANI-1: an
extensible neural network potential with DFT accuracy at force
field computational cost,” Chem. Sci., vol. 8, no. 4, pp. 3192–3203,
2017. [Online]. Available: http://xlink.rsc.org/?DOI=C6SC05720Ahttp:
//dx.doi.org/10.1039/C6SC05720A
[40] S. Toyoda, H. Miyagawa, K. Kitamura et al., “Development of md en-
gine: High-speed accelerator with parallel processor design for molecu-
lar dynamics simulations,” Journal of Computational Chemistry, vol. 20,
no. 2, pp. 185–199, 1999.
[41] M. Taiji, T. Narumi, Y. Ohno et al., “Protein explorer: A petaflops
special-purpose computer system for molecular dynamics simulations,”
in Supercomputing, 2003 ACM/IEEE Conference, Nov 2003, pp. 15–15.
[42] D. E. Shaw, M. M. Deneroff, R. O. Dror et al., “Anton, a special-
purpose machine for molecular dynamics simulation,” Communications
of the ACM, vol. 51, no. 7, pp. 91–97, 2008.
[43] J. Grossman, C. Young, J. A. Bank et al., “Simulation and embedded
software development for anton, a parallel machine with heterogeneous
multicore asics,” in Proceedings of the 6th IEEE/ACM/IFIP interna-
tional conference on Hardware/Software codesign and system synthesis.
ACM, 2008, pp. 125–130.
[44] C. Young, J. A. Bank, R. O. Dror et al., “A 32x32x32, spatially
distributed 3d fft in four microseconds on anton,” in Proceedings of
the Conference on High Performance Computing Networking, Storage
and Analysis. ACM, 2009, p. 23.
[45] Q. Xiong and M. C. Herbordt, “Bonded force computations on fpgas,” in
Field-Programmable Custom Computing Machines (FCCM), 2017 IEEE
25th Annual International Symposium on. IEEE, 2017, pp. 72–75.
[46] M. Chiu and M. C. Herbordt, “Molecular dynamics simulations on high-
performance reconfigurable computing systems,” ACM Transactions on
Reconfigurable Technology and Systems (TRETS), vol. 3, no. 4, p. 23,
2010.
[47] A. Putnam, A. M. Caulfield, E. S. Chung et al., “A reconfigurable
fabric for accelerating large-scale datacenter services,” ACM SIGARCH
Computer Architecture News, vol. 42, no. 3, pp. 13–24, 2014.
[48] A. M. Caulfield, E. S. Chung, A. Putnam et al., “A cloud-scale
acceleration architecture,” in The 49th Annual IEEE/ACM International
Symposium on Microarchitecture. IEEE Press, 2016, p. 7.
[49] A. Sanaullah, M. C. Herbordt, and V. Sachdeva, “Opencl for fpgas/hpc:
Case study in 3d fft.”
[50] C. Yang, J. Sheng, R. Patel et al., “Opencl for hpc with fpgas: Case study
in molecular electrostatics,” in High Performance Extreme Computing
Conference (HPEC), 2017 IEEE. IEEE, 2017, pp. 1–8.
[51] H. M. Waidyasooriya, M. Hariyama, and K. Kasahara, “An fpga accel-
erator for molecular dynamics simulation using opencl,” International
Journal of Networked and Distributed Computing, vol. 5, no. 1, pp.
52–61, 2017.
[52] B. Hess, C. Kutzner, D. Van Der Spoel et al., “GRGMACS 4: Algorithms
for highly efficient, load-balanced, and scalable molecular simulation,”
Journal of Chemical Theory and Computation, vol. 4, no. 3, pp. 435–
447, 2008.
[53] C. Kutzner, M. J. Abraham, B. Hess et al., “Tackling Exascale Software
Challenges in Molecular Dynamics Simulations with GROMACS,” pp.
1–27, 2015.
[54] M. J. Abraham, T. Murtola, R. Schulz et al., “Gromacs: High per-
formance molecular simulations through multi-level parallelism from
laptops to supercomputers,” SoftwareX, vol. 1-2, pp. 19–25, 2015.
[55] S. Plimpton, “Fast parallel algorithms for short-range molecular dynam-
ics,” Journal of Computational Physics, vol. 117, no. 1, pp. 1 – 19,
1995.
[56] D. Case, I. Ben-Shalom, S. Brozell et al., “AMBER 2018,” 2018.
[Online]. Available: http://ambermd.org
[57] J. C. Phillips, R. Braun, W. Wang et al., “Scalable molecular dynamics
with namd,” Journal of computational chemistry, vol. 26, no. 16, pp.
1781–1802, 2005.
[58] B. R. Brooks, C. L. Brooks III, A. D. Mackerell Jr et al., “Charmm: the
biomolecular simulation program,” Journal of computational chemistry,
vol. 30, no. 10, pp. 1545–1614, 2009.
[59] M. J. Harvey, G. Giupponi, and G. De Fabritiis, “ACEMD: Accelerating
biomolecular dynamics in the microsecond time scale,” Journal of
Chemical Theory and Computation, vol. 5, no. 6, pp. 1632–1639, 2009.
[60] K. J. Bowers, F. D. Sacerdoti, J. K. Salmon et al., “Molecular
dynamics—Scalable algorithms for molecular dynamics simulations on
commodity clusters,” Proceedings of the 2006 ACM/IEEE conference
on Supercomputing - SC ’06, no. November, p. 84, 2006. [Online].
Available: http://portal.acm.org/citation.cfm?doid=1188455.1188544
[61] “Structural Biology & Molecular Modeling Techniques Market
Analysis By Tools (SaaS & Standalone Modeling, Homology
Modeling, Threading, Molecular Dynamics, Ab Initio, Visualization
& Analysis, Databases, By Application, And Segment Forecasts,
2018 - 2025,” Grand View Research, Tech. Rep., Jan 2017. [On-
line]. Available: https://www.grandviewresearch.com/industry-analysis/
structural-biology-and-molecular-modeling-technique-market
[62] K. Rubenstein, “Cloud Computing in Life Sciences R&D,”
InsightPharmaReports.com, Tech. Rep., Apr 2010. [Online].
Available: http://www.insightpharmareports.com/uploadedFiles/Reports/
Reports/Cloud Computing/SamplePages.pdf
[63] G. Goldbeck, “The scientific software industry: a general overview,”
Goldbeck Consulting, Tech. Rep., Jan 2017. [Online]. Available: https:
//zenodo.org/record/260408/files/Scientific%20software%20industry.pdf
[64] C. Kutzner, S. Pa´ll, M. Fechner et al., “Best bang for your buck: GPU
nodes for GROMACS biomolecular simulations,” Journal of computa-
tional chemistry, vol. 36, no. 26, pp. 1990–2008, 2015.
[65] J. Sunhwan, K. Taehoon, I. V. G. et al., “Charmm-gui: A web-
based graphical user interface for charmm,” Journal of Computational
Chemistry, vol. 29, no. 11, pp. 1859–1865, 2008.
[66] J. Lee, X. Cheng, J. M. Swails et al., “Charmm-gui input generator
for namd, gromacs, amber, openmm, and charmm/openmm simulations
using the charmm36 additive force field,” Journal of Chemical Theory
and Computation, vol. 12, no. 1, pp. 405–413, 2016.
[67] D. J. Hardy, “Improving NAMD Performance on Multi-GPU Platforms,”
April 2018, slides from the 16th Annual Workshop on Charmm++ and
its Applications.
[68] J. Sheng, C. Yang, A. Sanaullah et al., “Hpc on fpga clouds: 3d ffts
and implications for molecular dynamics,” in Field Programmable Logic
and Applications (FPL), 2017 27th International Conference on. IEEE,
2017, pp. 1–4.
[69] F. Schuiki, M. Schaffner, F. K. Gu¨rkaynak et al., “A scalable
near-memory architecture for training deep neural networks on large
in-memory datasets,” CoRR, vol. abs/1803.04783, 2018. [Online].
Available: http://arxiv.org/abs/1803.04783
[70] K. J. Bowers, R. O. Dror, and D. E. Shaw, “Determining computational
units for computing multiple body interactions,” feb 2012, uS Patent
8,126,956.
[71] K. J. Bowers, R. O. Dror, D. E. Shaw et al., “Zonal methods for com-
putation of particle interactions,” may 2012, uS Patent App. 13/329,852.
[72] D. E. Shaw, M. M. Deneroff, R. O. Dror et al., “Parallel processing sys-
tem for computing particle interactions,” apr 2017, uS Patent 9,612,832.
[73] D. E. Shaw, M. M. Deneroff, R. O. Dror et al., “Parallel computer
architecture for computation of particle interactions,” aug 2017, uS
Patent 9,747,099.
