Abstract-Classical molecular dynamics (MD) simulations are important tools in life and material sciences since they allow studying chemical and biological processes in detail. However, the inherent scalability problem of particle-particle interactions and the sequential dependency of subsequent time steps render MD computationally intensive and difficult to scale. To this end, specialized FPGA-based accelerators have been repeatedly proposed to ameliorate this problem. However, to date none of the leading MD simulation packages fully support FPGA acceleration and a direct comparison of GPU versus FPGA accelerated codes has remained elusive so far.
I. INTRODUCTION
Classical molecular dynamics (MD) simulations [1] are important tools in life and material sciences since they allow studying chemical and biological processes in detail. For example, this enables researchers to study drug-target bindings for drug discovery purposes [2] or to analyze protein folding processes to understand their biological function [3] .
However, MD is computationally intensive and difficult to scale due to the sequential dependency between subsequent timesteps and the many particle-particle interactions. The timescales of interest are often orders of magnitude larger than the simulation time steps (i.e., ns or µs timescales versus fs time steps), which results in long simulation times even on HPC computing infrastructure.
To this end, several approaches have been pursued to improve simulation performance, ranging from novel algorithms to approximate forces between particles [4] over algorithmic tweaks [5] and biasing methods such as enhanced sampling methods [6] , to custom hardware solutions, such as the MDGRAPE systems from Riken [7] and the Anton-1/2 supercomputers developed by D.E. Shaw Research LLC [3] , [8] , [9] . However, we observe that algorithmic improvements are often very problem specific, and it often takes a long time until they are adopted by major production software packages.
Hence, the core algorithms used in classical MD simulations have largely remained in recent years, and most simulation speed improvements are due to the use of MPI parallelization in combination with GPGPUs that handle the computationally dominant parts. Specialized supercomputers such as the Anton systems are very inaccessible and expensive, and are hence not widely used today.
Besides these MPI and GPU-based solutions, FPGA accelerators have repeatedly been proposed as a viable alternative to accelerate the compute intensive parts [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] . However, these studies only show estimated or measured speedup with respect to older CPU implementations. To date none of the leading MD simulation packages fully support FPGA acceleration [23] and a direct comparison of GPU versus FPGA accelerated codes has remained elusive so far.
This report aims at shedding some light onto the questions whether and how FPGAs could be used to accelerate classical MD simulations in the scope of biochemistry, and whether such a solution would be commercially viable. To this end, we revisit existing FPGA architectures, model their behavior on current FPGA technology and estimate the performance and price of an FPGA accelerated system in order to compare with GPU accelerated solutions. We focus on single node systems in this report (possibly carrying several accelerator cards) since these represent the most common configuration employed today 1 . Typical MD problems with in the order of 100k atoms do not scale well across several nodes, and hence it is most economic to run these simulations on acceleratordense single node systems.
Our results show that, in principle, FPGAs can be used to accelerate MD, and we estimate full application-level speedups in the order of 2× with respect to GPU-based solutions. However, our estimates also indicate that this speedup is likely not high enough to compensate for the increased cost and reduced flexibility of FPGA-based solutions. Hence we conclude that FPGAs are likely not well suited as a replacement for GPU accelerators. However, we observe that other aspects of FPGAs like the low-latency networking capabilities could be leveraged to augment scaled multi-node systems and ameliorate scalability issues by providing network-compute capabilities in addition to GPU acceleration.
This report is structured in three main sections: Sec. II summarizes background and related work, performance benchmarks of two widely used software packages are given in Sec. III, and in Sec. IV we provide the FPGA estimates and comparisons with GPU-based systems.
II. BACKGROUND AND RELATED WORK

A. Classical MD
This section gives a brief overview of MD, for more details we refer to [1] , [2] , [24] , [25] .
1) Simulation Loop and Force-Fields
A typical biomolecular MD simulation consists of a macromolecule that is immersed in a solvent (e.g., water). Each atom in the system is assigned a coordinate x i , velocity v i and acceleration a i . The aim of MD is to simulate the individual trajectories of the atoms, and this is done by integrating the forces acting on each particle. The integration is carried out in discrete timesteps, with a resolution in the order of 2 fs, and the forces are calculated by taking the negative derivative of a potential function V w.r.t. to each particle coordinate x i :
where the potential function is based on classical Newtonian mechanics. The potential typically comprises the following terms: Such a set of functions and the corresponding parameterization is often also called a force-field (examples for specific force field instances are CHARMM36m, GROMOS36 or AMBERff99). The first three terms in (2) cover the interactions due to bonding relationships and are hence also called the bonded terms. The last two terms cover the van der Waals and electrostatic (Coulomb) interactions, and are hence called the non-bonded terms. We can already anticipate from these equations that the non-bonded terms will play a dominant role during force calculation since it is a particle-particle (PP) problem with O(N 2 ). The bonded terms are usually much less intensive (O(N )) since we do not have bonds among all particles in the system.
Most simulations are carried out using periodic boundary conditions (PBC) in order to correctly simulate the bulk properties of the solvent. This means that the simulation cell containing the N -particle system above is repeated infinitely many times in all directions, which has implications on the algorithms used for the non-bonded forces.
A typical MD simulation has an outer loop that advances the simulation time in discretized steps in the order of fs, and in each time step, all forces in the system have to be calculated, 1 http://ambermd.org/gpus/recommended hardware.htm the equations of motion have to be integrated, and additional constraints 2 have to be applied before restarting the loop.
2) Evaluation of Non-bonded Interactions
The first term with 1/r 1 2 and 1/r 6 modelling the Van der Waals interaction decays very quickly. Therefore, a cutoff radius r can be used (usually in the order of 8-16 Angstroms), and only neighboring particles within the spanned sphere are considered during the evaluation of this term.
The Coulomb interaction on the other hand is more difficult to handle, as it decays relatively slowly (1/r). Earlier methods also just used a cutoff radius, but this may introduce significant errors. To this end, different algorithms that leverage the periodic arrangement have been introduced. The most prominent one is the Ewald sum method (and variations thereof). This is basically a mathematical trick to split the Coulomb interaction into two separate terms: a localized term that is evaluated in space-domain, and a non-local term that is evaluated in the inverse space (i.e., in the spatial frequency domain / kspace). This decomposition enables efficient approximation methods that have computational complexity O(N log(N )) instead of O(N 2 ). One such method that is widely used together with PBC today is called Particle-Mesh Ewald (PME) [29] [30] [31] [32] . Basically, the spatial term is handled locally with a cutoff radius (same as for the Van der Waals term above), and the second is approximated using a gridded interpolation method combined with 3D FFTs (hence the O(N log(N ))) complexity). The spatial term is sometimes also called the short-range interaction part, whereas the inverse space term is called the long-range interaction part.
In terms of computation and communication, the nonbonded terms are the heavy ones, and consist of two cutoffradius limited PP problems, and two 3D FFT problems. As we shall see later in Sec. III, these non-bonded forces account for more than 90% of the compute time in an unparallelized MD simulation. Moreover, an intrinsic problem of MD is that the simulation time steps are sequentially dependent, which inherently limits the parallelizability, and means that we have to parallelize a single time step as good as possible (which essentially creates a communication problem). Note that the dataset sizes are very manageable and only in the order of a couple of MBytes.
3) Other Approaches
Today, the most serious contenders for replacing PME electrostatics seem to be fast multipole methods (FMM) [29] , [33] [34] [35] [36] , multi-level summation methods (MSM) [4] , [37] and multi-grid (MG) methods [38] :
• FMM: The FMM makes use of a hierarchical tree decomposition of simulation space and interacts particles with multipole approximations of the farfield, thereby giving rise to O (N ) complexity. For low order expansions, the multipole approximation of the electrostatic potential function has large discontinuities. This leads to drifts of the total energy over time, and this violation of energy conservation is not acceptable in MD applications. To resolve this issue, one has to resort to high order expansions that are more accurate, but also more complex to calculate. With sufficiently high expansion orders, the FMM can be slower [4] , [30] than fast PME algorithms in the important range of N = 10k -100k particles (despite the fact that FMM has better asymptotic complexity than PME) 3 .
• MG: These methods discretize the Laplace operator and solve the resulting linear system with a multigrid solver, which results in linear complexity as well. however, these methods suffer from relatively large discretization errors compared with FFT methods, thereby leading to clear energy drifts that need to be corrected for [33] .
• MSM: This method splits the interaction kernels smoothly into a sum of partial kernels of increasing range and decreasing variability, and the long-range parts are interpolated from grids of increasing coarseness, and hence MSM provides O (N ) complexity, too. The MSM is relatively new, and hence not widely adopted yet. It is unclear how it compares to PME in speed, but it is shown to provide better performance than FMM in the case of highly parallel computation and it can be used for nonperiodic systems, where FFT-based methods do not apply [4] . Due to the reasons mentioned, FMM, MSM and MG methods have not yet been widely adopted by common software packages. Another reason is that in a direct comparison [33] , FFT-based methods are still among the most efficient in performance and stability -and hence it is difficult to motivate a migration to an new experimental algorithm if it not absolutely required. The log (N ) contribution of FFT based methods does not seem to be visible for common system sizes. Hence, for single-node systems not operating at scale, FFTbased methods still seem to be the best algorithmic choice for simulations with PBC due to their efficiency and algorithmic simplicity. As we will see later in Sections III and IV, one way to address the communication bottleneck within PME can be addressed by dedicating it to a single FPGA or GPU, where it can be executed at very high speeds.
4) Classical versus Ab-Initio MD
Note that classical MD should not to be confused with ab-initio MD (AIMD) that operates on quantum-mechanical potential approximations. While AIMD has the advantages of being more accurate and not requiring explicit parameterization, it is also computationally much more involved and hence limited to small MD systems comprising a few 100 to 1000 particles. Classical MD works on a coarser abstraction 4 by employing classical mechanics in the form of force-fields that can be evaluated more efficiently, hence allowing to simulate larger systems with up to several 100k atoms. Although classical MD simluations are less accurate than their AIMD counterparts, they are often accurate enough for many biomolecular systems, where AIMD would not be feasible due to amount of particles involved. The drawback of classical MD is that the force-fields need to be parameterized with experimental data, and they can often not model chemical reactions as the bonds are static.
An interesting new algorithmic development for AIMDbased approaches is to use neural networks (NN) to accelerate the evaluation of the quantum-mechanical potentials [39] .
B. Dedicated ASICs for classical MD
There exist only a few dedicated systems that have been built to enhance MD simulation performance. Two older ones are the MD-ENGINE [40] and MDGRAPE-3 [41] . The MDENGINEis a simple ASIC coprocessor that evaluates nonbonded forces only (with a non-optimized Ewald sum method which is O N 2 ). Several such accelerators can be attached to a SPARC host system that carries out the rest of the calculations. Compared to newer architectures, the system does not scale well since it is still based on the a direct Ewald sum method. the interpolation units use a combination of fixed point and extended single-precision (40 bit) FP formats. MDGRAPE-3 is another MD co-processor chip which is similar to the MD-ENGINE above in the sense that only the non-bonded portion in the MD force calculation is accelerated, and they still use the direct Ewald sum method, which is accurate, but O(N 2 ). However, the complete system is much larger than previous ones: the complete system comprises 256 compute nodes equipped with 2 Itanium class CPUs and two MDGRAPE boards with 24 chips each. This results in a cluster with 512 CPUs and 6144 special purpose MDGRAPE chips. Also, they use a ring interconnect topology on each MDGRAPE board, which facilitates reduction operations such as force summation on one board.
Apart from these older instances above, there are also a few more recent incarnations of such special purpose machines that leverage complete SoC integration to accelerate MD in a more holistic fashion. The most notable machines are the Anton-1 [3] , [42] , [43] and Anton-2 [9] computers by D.E. Shaw Research LLC. The MDGRAPE-4 chip from Riken [7] is another special purpose SoC for MD that has similarities with the Anton chips, but the project does not seem to be as successful as Anton.
Both Anton generations are similar from a conceptual viewpoint, so we will mainly refer to numbers from Anton-2 in the following. The system configuration comprises a set of 512 compute nodes, interconnected with a 8×8×8 3D Torus, and each compute node consists of one Anton-2 ASIC fabricated in 40 nm. The 3D torus arrangement allows for natural problem mapping using a domain decomposition and facilitates communication in all directions. The Anton chips themselves are quite innovative with respect to the previous MD accelerators in the sense that they do not only contain fixed-function accelerators, but both C++ programmable Tensilica processors and specialized pairwise point interaction modules (PPIMs). Each Anton-2 chip contains 64 simple 32 bit general purpose processors with 4 SIMD lanes and two high-throughput interaction subsystem (HTIS) tiles containing an array of 38 PPIM streaming accelerators each. The heterogenous arrangement allows Anton to perform complete MD simulations without having to closely interact with a host system. I.e., it can also accelerate the other parts of the MD simulation (the remaining 10% of the computations), thus enabling better scalability. Moreover, Anton uses optimized PME algorithms to circumvent the O(N 2 ) issue of brute-force PP interactions. [44] provides a detailed study of distributed 32×32×32 and 64×64×64 3D FFTs on Anton machines.
Interestingly, newer firmware versions running on Anton use real-space convolution instead of 3D FFTs as this requires less all-to-all communication phases that impact scalability. Another interesting aspect to note is that external DRAMalbeit supported by the Anton-2 chips -is not required even in for large scale simulations, since the complete state of the problem fits into a couple of MBytes, and hence into the distributed on-chip SRAM available on the chips.
C. MD Acceleration using FPGAs
Most studies on FPGA accelerated MD target the PP calculations [10] , [11] , [13] , [14] , [20] [21] [22] , since they make up for over 90% of the sequential runtime in typical simulations. Apart from these studies, only relatively few papers cover PME and other aspects [16] , [18] , [19] , [45] .
We note that most work in the area has been done by the group by M. Herbordt at Boston University, including the design of several variations of PP interaction pipelines [13] , [14] , [20] , [21] , [46] and a prototype where such a PP interaction pipeline is integrated into a simplified variant of NAMD [14] . Their PP interaction pipelines makes use of particle filters that build the neighborlists on-the-fly. In contrast to software-based implementations, such a filter-based approach does not require large neighborlist buffers and can be efficiently implemented using systolic filter arrays and reduced arithmetic precision in hardware (the Anton PPIMs employ a similar approach). Their preferred PP design employs 1st order piece-wise polynomial interpolation with around 6 segments and 256 LUT entries per segment [20] . Since their prototype system uses relatively dated FPGA technology (Stratix III), it is difficult make comparisons with current GPU-based solutions, and updated performance estimates are required (no rigorous performance comparison with contemporary GPU technology has been carried out in their study). In [18] , [19] , it is shown that commonly used 3D FFTs in the order of 64 3 can be conveniently solved on single FPGAs at competitive speeds in the range of a few 100 µs. In [16] they present an interpolator design able to carry out the charge spreading and force gathering stems within PME, and in [45] a preliminary analysis of bonded-force computation on FPGAs is performed.
The work by Kasap et al. [22] is another attempt to implement a production grade MD accelerator using FPGAs which is not from the Herbordt group. They target the Maxwell platform, an FPGA based multi-node system that has similarities with Microsoft's Catapult [47] , [48] (albeit using more dated technology, i.e. Virtex 4 FPGAs). The overall system architecture template is very similar to the one used in this report. However, they only accelerate nonbonded pair interactions on the FPGAs and do not use inter-FPGA communication. Although they are able to accelerate the interactions, the overall system is not competitive due to a very limited bandwidth between host processor and the SD RAM on the FPGA card (PCI-X bus, with only around 500 MB s −1 for two FPGA cards). Further, they transfer many parameters that could be shared among several particles on a per atom basis onto the the accelerator which leads to significant I/O overhead. Virials and potential energies are calculated in each iteration, which is typically not required. Due to these factors, their FPGA solution is actually slower than the software baseline, since about 96% of the time is spent in communication. Apart from the apparent improvements in data communication volume (redundant parameters), the situation in connectivity is quite different today, where PCIe and NVLink provide much higher bandwidth between host and device. Further, modern FPGAs provide enough fast on-chip RAM resources such that no external SRAM/DRAM has to be used (this is another performance bottleneck of the system by Kasap et al.) .
In general, we can observe that FPGA designs have to leverage fixed-point arithemtic and operator fusing wherever possible (e.g. in the interpolators and for particle coordinates). Floating-point arithmetic should only be used where absolutely needed (e.g., final force values) as these cause high pipeline latencies and require a lot of LUT resources. According to literature [20] , the limiting resource types on modern FPGAs are likely going to be the LUTs and registers in this context, and the amount of DSP slices and memory blocks are of secondary concern. Further, fixed-point arithmetic enables to reach higher clock rates and maps better to DSP resources than FP. Note that the Anton chips almost exclusively employ fixed-point datapaths, too, since the use of fixed-point not only improves performance, but also has a positive impact on testability and reproducibility (out-of-order accumulations remain bit-true, for instance).
We base the FPGA estimates for the PP interaction and PME modules in this report on the resource figures provided by Herbordt's group, as these appear to be the best references that can be found in literature.
D. High-Level Synthesis
High-level synthesis (HLS) and related methods for FPGAs are often advertised as great productivity enhancers that significantly reduce the complexity of design-entry compared to traditional HDL flows. From experience, we can say that this is not always the case, as the benefit of HLS is quite design dependent. However, the results presented by [23] , [49] [50] [51] suggest that HLS or OpenCL based flows can indeed be used for certain parts of the PP and PME units.
For example, Saunullah et al. [49] compare an IP-based design flow with FFT IP's from Altera to an OpenCL kernel, and report quite large improvements in terms of overall resources (10× fewer ALMs, 25× less on-chip memory). While this needs to be interpreted with care (the paper does not reveal Node Name CPU (GHz) vCPUs †
GPU Config
The number of vCPUs (virtual CPUs) corresponds to the amount of concurrent threads. ‡ The K80 is a dual GPU card, so the effective number of GPUs is double the listed number. many implementation details), it seems to be reasonable that such an approach yields good results in this setting. The FFT is quite a regular algorithm which can be well described with OpenCL, and the Altera FFT IP's incur some overhead due to internal buffering, interfaces, etc. which are not needed.
On the other hand, Saunullah et al. [50] attempt to use OpenCL to design the same PP interaction pipeline that they developed with HDL in an earlier papers [14] , [20] , [21] . Their conclusion is that for such a highly tuned datapath OpenCL does not provide competitive results ("...the OpenCL versions are dramatically less efficient, with the Verilog design fitting from 3.5× to 7× more logic."). The work by Cong et al. [23] is a similar study that investigates a Xilinx HLS flow for the same PP interaction module. Their experiments reveal that part of the inefficiency of HLS for this particular module is the fact that HLS tools currently have difficulties producing efficient results for datapaths with dynamic data flow behavior where conditional execution exists within a processing element. Their solution is to describe certain parts critical for dynamic data flow in HDL, and generate the remaining subblocks of the datapath using HLS.
So based on our own experience and the above literature, we can say that OpenCL or HLS can increase productivity, and yield good results for algorithms that can be described well and that have regular compute patterns (e.g., systolic dataflow, no feedback loops, no dynamic control behavior). But for designs that are difficult to describe and tune with OpenCL such as parts of the force-pipeline a highly tuned HDL implementation turns out to be more efficient. An interesting design approach is to use a mix between both methodologies, leveraging HLS (and its automated verification functionality) for small datapath subblocks that are then connected and orchestrated using an HDL wrapper.
E. Vendor and Market Overview
Several well known high-performance software packages come from the academic sector and some of them like GROMACS [52] [53] [54] and LAMMPS [55] are released freely under opensource licences. Others like such as AMBER [56] , NAMD [57] and CHARMM [58] provide reduced or free academic licenses and require full licensing for commercial purposes.
Besides these codes originating from academia, there are a couple of companies that develop and sell MD simulation software such as Biovia from Dassault Systemes, AceMD from Acellera Ltd [59] , Yasara [5] , and Desmond from Shaw Research LLC [60] to name a few (see this market report for more info [61] ). Some of the main differentiators of these commercial software packages comprise:
• Workflow integration and collaborative tools (e.g., Dassault Systems), • Better visualization and GUI integration (e.g., Dassault
Systems, Yasara), • Performance tweaks for workstation systems (e.g., Yasara, Acellera), • Extreme scalability on clusters (e.g., Shaw Research), • Costumer support and consulting options (all of them). In terms of infrastructure, some companies sell application certified workstations and server blades, such as ExxactCorp 5 (GROMACS and AMBER workstations) and Acellera (AceMD MetroCubo) 6 . Another interesting trend in this field seem to be cloud services [61] , [62] . E.g., Acellera AceMD has built-in support for Amazon Web Services (AWS), that allows users to conveniently outsource simulation runs with a single command. The advantage of such cloud solutions are scalability and low up-front costs, which can be attractive for small labs and/or labs that have large variations in workload. Otherwise on-prem solutions can be more cost-effective.
The overall market however can be considered a niche market, since there are only a few 1000 to 10'000 users worldwide that use such specialized software packages. According to Goldbeck et al. [63] , the overall spending on scientific software is in the order of 0.1% of the total sector R&D spending, which would amount to roughly 100M$ in pharma/biotech and 50M$ in the chemicals/materials industry in Europe. Now this is an estimate for the total spending, so for a specific software package the market size will only be a fraction of that. So assuming a share in the order of 1% we can estimate that the market for a specific software package is likely in the order of several 100k$ to a few M$.
In this report, we use AMBER and GROMACS as benchmark baselines, since these provide very competitive runtimes on single-node systems and are widely used in the community. In addition to on-prem solutions, we also consider cloud infrastructure, since FPGAs have recently become available as a service (FaaS), in the form of the AWS F1 instance 7 . Edico Genomics 8 is an example for a company that successfully uses F1 instances to accelerate Genome sequencing.
III. SOFTWARE BENCHMARKS
In this section we assess the performance of two widely used GPU-accelerated MD packages using three different benchmarks and recent hardware. This is done to get a solid baseline for the comparisons with the FPGA performance estimates in Sec. IV. We used one on-site machine and several different Amazon Webservices (AWS) configurations to run the benchmarks. The machine configurations are listed in Tbl. I. All machines have almost identical processor models (only the core count and maximum frequency differs slightly), and all machines ran Linux 14.06 LTS. The configuration of the on-site machine Sassauna is in line with the GROMACS benchmarking paper by Kutzner et al. [64] , where it has been found that a dual Xeon machine with around 10 cores per socket and 2-4 customer grade GPUs provides the best cost-efficiency in terms of ns/$.
B. Benchmark Systems
Existing benchmarks are often MD software specific (in terms of input files), and hence we chose to recreate the input files from scratch to make sure we are simulating the same system. To accomplish that we used the CHARMM-GUI 9 online tool [65] , [66] . The benchmarks are listed in Tbl. II, and consist of three different proteins that have been solvated in a box of water. The amount of particles range from 35k to 157k, which represents typical problem sizes in biochemistry today. The two smaller systems are similar to the DHFR and Apoa1 benchmarks often used in literature in context with the AMBER and NAMD software packages. 9 Input generator on http://www.charmm-gui.org/
C. Benchmarking Results
We use the standard protocols provided by CHARMM-GUI to equilibrate the systems. The equilibrated systems are then simulated in the NPT ensemble for the amount of steps specified in Tbl. II. In GROMACS, a Nose-Hoover thermostat is used in combination with Parinello-Rahman pressure control and in Amber, a Langevin Thermostat is used together with Montecarlo Pressure control (these were the default settings from CHARMM-GUI). The force-fields employed is CHARMM36m, which uses a TIPS3P water model (with Coulomb and LJ interactions on the O and H atoms). Energies are calculated every 100 and stored every 1k steps in both cases. For each software package, we run different parallelization options to find the optimal one.
1) GROMACS Benchmarks on Sassauna
Similar as described in [64] , we sweeped different MPI rank/openmp thread combinations, in combination with different numbers of GPUs. The results are shown in Fig. 1a ,c,e. For standard GPU accelerated runs (blue lines) where the particleparticle (PP) interactions are mapped to the GPUs and the PME is handled on the CPUs 10 , we observe speedups in the order of 4× to 6× with respect to the CPU-only version (green line). Note that in GROMACS 2018 update, a new feature became available that allows to map the PME evaluation to one of the GPUs, which leads to significant runtime improvement (see orange lines) on single-node systems as the costly 3D FFT communication phases are now contained within one device instead of being spread accross many CPU cores. This leads to a performance improvement of another 2×.
2) AMBER Benchmarks on Sassauna
As can be seen in Fig. 1b,d ,f, the CPU-only implementation is quite slow when compared to GROMACS. However, the GPU accelerated configuration with 2 GPUs provides similar performance as GROMACS with 1-2 GPUs using the new GPU PME feature. For more GPUs, the performance does not scale on this system. Note that as opposed to GROMACS, AMBER can run the whole simulation loop on the GPUs without involving the CPU cores. In multi-GPU runs, several GPUs communicate via the PCIe bus. This is only working well for up to 2 cards in this machine, since each pair is connected to one XEON socket, and connections bettween these pairs have to go over QPI. Compute nodes with NVLINK or optimized PCIe infrastructure will likely perform better on such runs 11 . The advantage of AMBER is not necessarily the speed of a single MD run (GROMACS is faster in that case), but the fact that the no expensive multi-core CPUs have to be used in order to get decent performance as in GROMACS. This allows to run many simulations in parallel in a very costeffective manner.
Note that we used the licensed version of AMBER16 for these benchmarks with all updates as of June 2018. There also exists AMBER18 that has been released recently, and (dt=2fs, cuto =12A) gmx2018.1, sassauna, T40G4 0 GPUs 1 GPUs 2 GPUs 3 GPUs 4 GPUs 1 GPUs + GPU-PME 2 GPUs + GPU-PME 3 GPUs + GPU-PME 4 GPUs + GPU-PME (a) 0 GPUs 1 GPUs 2 GPUs 3 GPUs 4 GPUs 1 GPUs + GPU-PME 2 GPUs + GPU-PME 3 GPUs + GPU-PME 4 GPUs + GPU-PME 0 GPUs 1 GPUs 2 GPUs 3 GPUs 4 GPUs 1 GPUs + GPU-PME 2 GPUs + GPU-PME 3 GPUs + GPU-PME 4 GPUs + GPU-PME (e) that likely includes further optimizations for the Pascal and Volta generation GPUs.
3) GROMACS Benchmarks on AWS
As mentioned in the introduction, cloud infrastructure/software services (IaaS/SaaS) are interesting alternatives to on-site solutions. Hence, we also benchmark a few AWS instances in order to be able to include them in the performance cost comparison later on in Sec. IV. At the time of writing, the demand for P3 instances with V100 SMX2 cards was extremely high, and hence we only got access to one 2xlarge instance with one GPU. As such, the performance improvements for multi-GPU runs with V100's has to be extrapolated from these measurements. As we can observe in Fig. 2 the improvement is in the order of 20% with respect to the GTX1080Ti for the larger benchmarks. This improvement seems to be reasonable, since the raw increase in FLOP/s is around 35%. We can also see that the P2 instances with K80 GPUs are significantly slower than the P3 instance or Sassauna.
4) GROMACS Breakdown of Simulation Loop
The walltime breakdown for the three different benchmarks is shown in Tbl. III in % for three different cases: single threaded (CPU-only), multi-threaded with PME on CPUs, and multithreaded with both PP interactions and PME on GPUs.
The walltime breakdown for the single-threaded case is shows the well-known picture: the compute time is mainly dominated by non-bonded force evaluations, which accounts for more than 96%, including PME. The force time also includes bonded forces in this case, but the fraction is insignificant compared to the non-bonded part.
In the multi-threaded case with PP interactions on the GPU, the picture already changes quite a bit. The runtime of the PP interaction kernels is not visible since they are executed in parallel to the CPU code, for which the walltime accounting is performed. The breakdown is a bit more difficult to read, but we see that PME starts to become the dominant factor (more than 50%)..
In the third case, we note that other parts besides the PP interactions and PME are starting to become significant as well. In particular the operations that cannot be overlapped with the accelerated PP and PME calculations (such as domain decomposition, reduction operations, trajectory sampling, integration, global energy communication) start to amount for a significant percentage of overall walltime (around 25% in this benchmarks). As we shall see in Sec. IV on hardware performance estimates, this inherently limits the maximum acceleration that can be achieved by using a co-processor solution. This issue is not exclusive to GROMACS, and has also been noted, e.g., in [67] with NAMD.
It is worth noting that the constraints step is required to keep high-frequency oscillations of bonds involving light atoms (H) under control. Otherwise, significantly smaller timesteps than the currently employed 2 fs have to be employed in order to ensure correct integration and prevent the simulation from blowing up. These constraints are implemented using a combination of efficient parallel algorithms in GROMACS (P-LINCS for general bonds [26] , [27] , analytic SETTLE [28] for water molecules).
IV. PERFORMANCE ESTIMATIONS AND COMPARISONS
In this section, we estimate the performance of FPGA accelerated systems, compare them with GPU-based solutions 
A. Considered System Architectures
In order to narrow down the system-level architectural templates to focus on, we first make a couple of observations:
• While some kernels of the simulation loop can be well implemented in HW (PME, PP interactions), there are other parts such as the enforcement of constraints, bonded interactions and integration (double precision) which are (c) Non-Uniform Allocation: PME PME PME PME PME PME PME PME PME ... Fig. 3 . Architectural template used for the performance estimations and different PP/PME unit allocations.
better suited for CPUs due to the control flow and flexibility required for algorithmic changes. If we stick to this partitioning, this means that the simulation loop will always involve a complete state exchange (coordinates in, forces out) in each simulation step.
• Bandwidth from CPU to logic and vice versa is similar for a PCIe card (15 GB s −1 raw bandwidth) as for an ARM system on a SoC like Stratix 10, Zynq SoC (2×128 bit plug at, say 400 MHz gives 12.8 GB s −1 of raw bandwidth in total). The only benefit of a SoC will likely be latency, but the embedded CPUs are not as capable as on a server class system. Further, latency is the smaller evil on small single node systems -it is the bandwidth that is important. I.e., to transfer 2MB of simulation state over 12 GB s −1 166 µs, while link latency is only in the order of a few microseconds.
• As we have seen in Sec. III, a capable server class CPU is required to handle the non-HW-accelerated part of the simulation within the available time budget. An ARMbased SoC system will likely be too slow.
Hence, it makes sense to stick to a system architecture template of the form Xeon Class Server plus N FPGA cards with own interconnect. This is also aligned with what is currently available in cloud services (AWS F1 for instance), and such a system is depicted in Fig. 3 .
We do not consider IBM Power 8/9 systems in this report as due to their cost and limited availability. Judging from the literature, most people in this space seem to be used to x86 machines, and these machines together with customer grade GPUs are more affordable.
B. Estimated System Components
As explained before, we assume the architectural template shown in Fig. 3a) . For the FPGA cards, we base the resource and performance estimates on reported values in literature, as explained below.
1) Interconnects
The assumed architectural template consists of two XEON host CPUs that can host a maximum of 4 cards at full x16 PCIe bandwidth, or 8 cards at x8 bandwidth. For the PCIe link efficiency (framing and other overheads), we assume a bandwidth efficiency of 75% and a latency of 10 µs. Further we assume that the FPGA cards are interconnected with a bidirectional ring. This is readily possible due to the high amount of transceivers available on todays high-end FPGAs. In fact, almost every PCIe FPGA card features at least one QSFP port. As explained later, the targeted FPGA platforms either provide two or four QSFP28 cages, allowing for raw ring bandwidths of 200 or 200-400 Gbit. The assumed link efficiency is 85% including framing and packetizing overheads, and a link latency of 0.5 µs is assumed.
2) PP Interaction Pipelines
We base our estimates on the PP interaction pipeline by M. Herbordt's group [20] , [21] that employs particle filters that generate the neighbor lists on-the-fly. This architecture is well suited for hardware implementations, and does not require as much memory as an implementation with explicit neighborlists. The employed arithmetic is a mix between fixed-point and single-precision floating-point, and has been optimized for FPGAs. The employed domain decomposition is an optimized variant of the half shell method. Better methods with less inter-cell communication volume exist (e.g., neutral territory and mid-point methods [60] ), but these are likely to show their benefits only in the highly scaled regime (which is not the scope of this evaluation). According to the results of Herbordt et al., the preferred design employs first-order interpolation with piecewise polynomials, and this design amounts to around 14.5k ALMs and 82 DSP multipliers on a Stratix IV (including filters, LJ and Coulomb datapath, distribution and accumulation logic). We estimate the needed amount of memory separately as described further below.
The PP interaction pipelines can calculate one interaction per cycle. In terms of utilization, the authors reported that it is possible to achieve relatively high percentages in the order of 95% by properly mapping the particles to the filters and LJ/Coulomb force datapaths. However, since in our evaluation we deal with many more parallel pipelines (several 100 instead of several tens), we assume a slightly reduced utilization of 80% to account for potential imbalances.
The required RAM resources have been estimated by assuming that each particle position entry requires three 32 bit coordinate entries plus one 32 bit entry holding metadata like atom ID and type (we assume that atom specific LJ interaction values and charge parameterizations can be compiled into ROMs and indexed by these atom type IDs at runtime). The particle force accumulation entries are assumed to comprise ‡ For the AWS instances, the prices are per hour.
+ The Amber license is a site license. We assume here for simplicity that a site consists of 4 nodes to amortize the license cost.
* Or a similar FPGA card that provides up to 400 Gbit ring interconnection capability.
three 32 bit values. The estimation for memory usage includes all interpolator lookup tables, atom property ROMs, all position+metadata entries and private accumulation buffers of the PP interaction pipelines.
3) PME Unit
Herbordt's group demonstrated as well that it is possible to fit a 3D FFT unit with 64 3 grid points onto recent FPGAs [18] , [19] . The dominant factor here are the 1D FFT macros provided by the FPGA vendors. Hence, we use the complexity of these vendor provided macros to estimate the resources for the 3D FFT part. Since our FFTs sizes are in the order of 84 3 grid points, we assume that the resource consumption is similar to 128-point FFT macros, i.e., 5k LUT, 32 DSP multipliers and 16kBit memory on a VUP FPGA 12 . The latency corresponds to the amount of samples to be calculated.
For the particle-to-grid and grid-to-particle interpolators, we use the results reported in [16] , where an optimized single-cycle datapath is designed and implemented. One such unit requires 51k ALMs 192 DSP blocks (= 2 × 192 DSP multipliers) on a Stratix V FPGA. Further, we assume that this design can be optimized such that the same interpolators can be used for both interpolation directions, as well as the PME solution step in the frequency domain that entails a point-wise multiplication with pre-computed constants. We employ a similar load balancing mechanism as GROMACs between PME and PP cards. ‡ The maximum available resources are reduced in this case due to the AWS shell infrastructure (see text for more details). + We assume a PCIe link efficiency of 75%, and since we use 8 PCIe cards only half the 16x bandwidth is available per FPGA.
§ Assuming a link efficiency of 85% for the ring interconnect.
4) Resource Allocation
In terms of allocation of PME and PP units to accelerator boards it seems natural to uniformly distribute them as illustrated in Fig. 3b ). The advantage of this allocation is scalability of compute resources. However, the communication pattern of the two 3D FFT passes in PME leads to high communication volume between these distributed PME units, and hence it has been found that a contraction in the PME calculation can lead to better performance [68] . E.g., for highly parallel scenarios GROMACS supports the use of fewer separate MPI ranks [53] . Further, an additional optimization for single node systems has been added in the newest GROMACS release where the PME calculation can be allocated to one GPU. AMBER moved to single GPU implementations already a while ago to solve this communication issue. Hence, we do not consider uniform allocation, but the non-uniform allocation shown in Fig. 3c ).
5) Schedule
The computation schedule follows a relatively simple repetitive pattern. In each iteration, all MPI processes on the host push the particle coordinates and meta-information down to the FPGA card using bulk DMA transfers, thereby utilizing the full effectively available PCIe bandwidth. Computation on the FPGA side can be largely overlapped with the PCIe transfers, since we can start computing PP interactions already with a small part of the simulation volume. Further, the coordinates and meta data can be gathered and transferred to the PME card on-the-fly (via the ring interconnect), and the grid interpolators can start with particle-to-grid interpolation. As soon as the complete simulation volume has been transferred, the PP cards exchange the overlap regions required for PP interactions via the ring. Once the inverse 3D FFT is complete, the gridto-particle interpolation can be started and overlapped with the scatter operation that transfers the forces back to the originating card. The forces are then accumulated within the PP interaction force buffers, and once all non-bonded forces have been calculated, the results are copied back to the host.
C. FPGA Targets and Hardware Costs
In order to achieve the required performance, large datacenter grade FPGAs are required. In this report we target the Xilinx Virtex UltraScale+ series, as well as the Intel Stratix 10 series, since at the time of writing, these represent the best FPGA technology that is available (or soon will be). On the Xilinx side, we identified the VU9P device as an ideal target as it is currently widely used and stable in production (this device is also available in the AWS F1 instances). The roughly 50% larger VU13P device will soon reach stable production, as well, and can be seen as the natural successor of the VU9P device in the forthcoming years. On the Intel side, the only FPGA that can currently match the Virtex UltraScale+ devices is the upcoming Stratix 10 series (the Arria 10 devices are too small). In particular, it seems that the 1SGX280 device will be the equivalent of the VU9P in terms of adoption (several development boards feature this device). However, we found that the availability of these devices is not yet guaranteed (especially for the H-Tile devices with fast transceivers), and we can expect that stable products are likely not to going to be introduced before next year. In order to compare system level costs, we assume the prices and power consumptions 13 listed in Tbl. IV. For simplicity, we lump the costs for the optical modules together with the corresponding FPGA boards (roughly 200$ per QSFP28 slot). Equipment and software packages are amortized over 5 years (straight line), and the electricity is assumed to cost 0.1$ per kWh. Further, for each on-site solution we double the electricity cost in order to account for cooling. No sales margin is added to the FPGA solutions in this comparison, but in a commercial setting this has to be accounted for as well.
D. Evaluated Configurations and Assumptions
We calculate our estimates for a system with N = 8 FPGA cards in the system (as discussed earlier, the PP interaction pipelines are allocated on 7 FPGAs, while 1 FPGA card is used for PME). These configurations are listed in Tbl. V. The amount of units (e.g. FFTs) listed in that table is per FPGA instance. We note that on these modern FPGAs, logic resources are the ones that are going to be critical. In order to allow for enough headroom for additional infrastructure such as, e.g., Aurora and PCIe PHYs we target a LUT/ALM usage of 75% on the VU9P and 1SGX280 devices (both offer a similar amount of logic resources). On the larger VU13P devices and AWS F1 instance, we target a higher utilization in the order of 85%. This is possible since the VU13P offers around 50% more logic resources than the VU9P, and on the AWS instance, the infrastructure is already included in the AWS F1 Shell that wraps the user logic 14 . From test syntheses of an HDL design optimized for FPGAs (NTX cores from [69] ) we found that operating frequencies up to 400MHz and 700MHz should be achievable on the Virtex UltraScale+ and Stratix 10 devices, respectively. From these test syntheses we also calculated logic conversion factors for the LUTs and ALMs to derate the numbers reported for the 3D FFT and PP cores in literature (shown in Tbl. VI). For the estimations in this section, we consider the 91k problem (2LEM) of the previous section, and assume a software overhead of 280 µs on 40 logical cores that can not be overlapped with the force computations (this amounts to 25% of the overall walltime of a single step, see previous section on benchmarks). Note that we adjust the PME grid resolution and PP interaction cutoffs to balance the load between the PP and PME cards at similar accuracy. This is analogous to the PME load balancing procedure performed in GROMACS simulation runs [64] .
We also note that we do not explicitly account for potential and virial calculations here as these are only carried out every 100 steps in the considered benchmarks. The support for these calculations can be added to the hardware either by extending the force pipelines (this leads to a small increase in DSP slices which are still abundantly available) or by reusing existing interpolator infrastructure and repeated evaluation, leading to a decrease in performance of around 1%.
E. Results
The estimations for resources and timings are shown in Tbl. V at the bottom. The highest performance is achieved by the VU13P and f1.16xlarge designs. In the first case this is due to the high amount of logic resources on the VU13P and in the second case, more CPU cores translate into lower software overheads. In all cases we note that the non-hideable software overheads (domain decomposition, integration, constraints, etc.) are either in similar in magnitude than the accelerated PP and PME portions.
When comparing the VCU1525 and XUPP3R solutions with VU9P FPGAs, we can observe that the faster ring interconnect available with the more expensive XUPP3R does lead to a small improvement in speed, but this is likely not worth the price increase of around 2× in case of the VU9P. However, for the larger/faster VU13P and 1SGX280 FPGAs, the faster ring interconnect is desirable in order to match the bandwidth with the increased throughput.
In order to better compare different solutions, we cast these results into a performance (ns/d) versus cost ($/h) and add different operating points of GPU-accelerated solutions. The performance values for FPGA versions with less than eight cards have been derated from the 8 card solutions (assuming that one of the cards now contains both the complete PME unit and some PP interaction pipelines). The performance values for the GPU solutions employing 1-4×GTX1080Ti and 1 V100 (on AWS) have been measured (see previous section on benchmarks). The remaining operating points have been estimated using numbers from existing benchmarks 15 . The blue dots are all for GROMACS 2018, the green ones for AMBER 16/17 and the orange ones for FPGA solutions based Virtex UltraScale+ and Stratix 10 FPGAs. On the right side of the plot we have the AWS instances, and on the left the on-prem versions. The closer solutions are to the upper left corner of the plot, the better, and the diagonal lines represent same performance/cost. The red dot in the upper left corner shows the desirable performance/cost of an ideal solution that domain experts would consider commercially feasible.
F. Discussion
As can be observed in Fig. 4 , systems employing consumer grade GTX1080Ti cards are clearly at the forefront in terms of cost efficiency (slanted lines represent equal performance/cost). And it should be noted that this efficiency is bound to increase, since there seems to be a trend in moving more computations or even the complete simulation loop onto the GPUs and have them communicate in a peerto-peer fashion. AMBER already supports this, allowing to assemble very cost efficient desktop systems, since in that case no expensive high-core-count CPUs are required anymore. This is not reflected in the plot above yet, as these costs have been calculated assuming a dual XEON (20 core) server.
We can also see that with FPGA solutions based on UltraScale+ and Stratix 10 devices there is not much to be gained with respect to the GPU solutions. I.e., it is possible to achieve speedups in the order of around 1.5× to 2×, but the performance/cost ratio is similar to GPU solutions. This is a combination of two key factors: first, FPGAs prices are in the range of datacenter GPUs, which makes it difficult to compete with consumer grade GPUs that offer very attractive single-precision FLOPS/$ ratios. And second, the amount of remaining work that can not be overlapped with non-bonded force computations starts to become dominant, thereby leading to a saturation of the achievable speedup. For instance, in the case of a system with 8 VU13P FPGAs, our estimates indicate that the PP and PME calculations take less time than the remaining non-hideable parts in software (200 µs versus 280 µs). Hence, we see that even a 4× speedup of the calculation of the PP interactions and PME only leads to an application performance improvement of only 2×.
1) Commercial Feasibility
We have been in contact with domain experts and according to them, a new accelerator solution based on a different technology than GPUs should offer at least 1 µs of simulation performance at the cost of one high-end GPU in order to be perceived as a viable alternative (this ideal solution is indicated with a red dot in Fig. 4 ). Considering our results and this desirable target, it becomes evident that a FPGA co-processor solution will likely not be commercially successful since a mere 2× improvement in speed at similar cost efficiency does not represent a compelling value proposition for potential users (we have to keep in mind here that an FPGA solution will likely be less flexible in terms of functionality than a GPUbased one). Instead, MD users will likely just run several simulations in parallel at somewhat lower speed and better cost efficiency, e.g., to improve sampling statistics or to screen multiple chemical compounds in parallel.
2) A Note on Algorithmic Improvements
We see that when targeting non-bonded interactions, architectural specialization and current FPGA technology alone will likely not be enough to get the needed improvement in speed (with respect to GPU solutions), and some algorithmic innovation will be needed as well (that is, unless FPGA prices drop over 10× the coming years, which is not very probable). The following points should however be kept in mind when doing algorithmic work in this field:
• This is a mathematical field, and people would like to have error guarantees and bounds. Despite the fact that it is very challenging to come up with an algorithm that is better than SOA, it takes a long time until such a new method is established and accepted by the community. For example, variations of FMM and multi-level summation methods (MSM) [4] , [37] have been in development for many years now, and they still not used on a regular basis for production runs -even if they may have benefits in some operating regions. One of the reasons for this slow development and adoption is that it is difficult to find and introduce optimizations/approximations that do not introduce assumptions that violate physical laws and eventually lead to erratic behavior. Further, it is difficult to verify the correctness of a certain approach or implementation.
• It is likely that GPU-based solutions will benefit from algorithmic improvements, too (e.g., improvements of the integration and constraints steps), and the implementation turn-around time is shorter for GPUs than for FPGAs.
• Machine-learning approaches are well suited for a certain class of ab-initio potential evaluations (AIMD), and they have been shown to give quite impressive speedups in that case. However, in pure classical MD, the force fields already have a very simple and efficient functional form (basically polynomial expressions for classical mechanics), and contain far fewer parameters than, e.g., the ANI-1 NN potential [39] . It is currently unclear how such an NN-based approach could be used to accelerate potential expressions in classical MD simulations.
3) Other Hardware Acceleration Opportunities ASIC integration of the non-bonded interaction pipelines could be a way of improving the performance with respect to FPGAbased solutions, but in the end this approach suffers the same shortcomings as the FPGA co-processor solution, and in addition the market does not seem to be big enough to justify the development costs. Moving to fully integrated solutions in a similar fashion as this has been done in the Anton systems can also lead to higher performance, but the design effort (and involved risks) are quite large and hence difficult to justify. So far, Anton-1/2 SoCs have been the only successful chips built in this fashion. MDGRAPE-4, which is the only other SoCbased system, seems to be fizzled out as there has been no update on the project for 4 years (the last publication [7] is from 2014). Another fact that has to be kept in mind is that there are several patent applications and granted patents [70] [71] [72] [73] protecting features of the Anton-type SoCs, which could complicate commercial exploitation of such a solution.
Considering the difficulties mentioned above, it becomes clear that different approaches for leveraging FPGAs should be considered. It could make sense to turn towards scaled datacenter solutions and look into hybrid solutions that leverage FPGAs as near-network accelerators (similar as this has been done in Catapult-I/II [47] , [48] , or applications such as networking filtering, high-frequency trading, etc). I.e., we could use FPGAs to complement multi-node GPU systems in order to improve the scalability of such systems. Consider for instance the scaling behavior of GPU-accelerated GROMACS runs on the Hydra supercomputer (Max Planck Computing Centre, 20-core Ivy Bridge nodes with 2xK20X and FDR-14) in Fig. 5 . We can observe the well-known hard-scaling issue for typical problem sizes (top 2 curves, 81k atoms). Problems with several millions of atoms (lower 4 curves) are often used for benchmarking, but they do not represent common everyday problems. What is interesting to note is that GPU accelerated problems with ¡100k atoms often reach their 50% scaling limit very quickly at around 8 nodes -and this is something that can be observed on other clusters, too (see [64] , for example). From what can be read in literature this scaling bottleneck is mainly due to PME and global communication phases.
So it is likely that a network accelerator could be used to ameliorate this scaling bottleneck by bypassing the standard InfiniBand interconnects and the MPI software stack, and by providing a dedicated secondary network with functional capabilities tailored towards PME and global communication (e.g. for energies and virials) and even integration/constraints handling. Possible arrangements could be 1 FPGA network card per node and a 2D or 3D torus. To this end, initial studies on 3D FFTs for molecular dynamics on FPGA clouds by Herbordt, et al. [68] show promising results (also considering phased contraction where certain parts of the computation are carried out on a subset of all nodes to improve communication patterns). Another arrangement that is similar to phased contraction and suitable for small-scale clusters would be a star arrangement, where all nodes have a connection to a single external FPGA box that performs PME and global reductions/broadcasts. As we have seen with the recent GROMACS update, solving PME on one device alone can be beneficial since this improves the communication volume.
V. CONCLUSIONS
In this report, we benchmarked two widely used GPUaccelerated MD packages using typical MD model problems, and compare them with estimates of an FPGA-based solution in terms of performance and cost. Our results show that while FPGAs can indeed lead to higher performance than GPUs, the overall application-level speedup remains in the order of 2×. Considering the price for GPU and FPGA solutions, we observe that GPU-based solutions provide the better performance/cost tradeoff, and hence pure FPGA-based solutions are likely not commercially viable. However, we also note that scaled systems could potentially benefit from a hybrid composition, where GPUs are used for compute intensive parts and FPGAs for in-network acceleration of latency and communication sensitive tasks.
