24 research outputs found

    A GPU-accelerated Direct-sum Boundary Integral Poisson-Boltzmann Solver

    Full text link
    In this paper, we present a GPU-accelerated direct-sum boundary integral method to solve the linear Poisson-Boltzmann (PB) equation. In our method, a well-posed boundary integral formulation is used to ensure the fast convergence of Krylov subspace based linear algebraic solver such as the GMRES. The molecular surfaces are discretized with flat triangles and centroid collocation. To speed up our method, we take advantage of the parallel nature of the boundary integral formulation and parallelize the schemes within CUDA shared memory architecture on GPU. The schemes use only 11N+6Nc11N+6N_c size-of-double device memory for a biomolecule with NN triangular surface elements and NcN_c partial charges. Numerical tests of these schemes show well-maintained accuracy and fast convergence. The GPU implementation using one GPU card (Nvidia Tesla M2070) achieves 120-150X speed-up to the implementation using one CPU (Intel L5640 2.27GHz). With our approach, solving PB equations on well-discretized molecular surfaces with up to 300,000 boundary elements will take less than about 10 minutes, hence our approach is particularly suitable for fast electrostatics computations on small to medium biomolecules

    High-productivity, high-performance workflow for virus-scale electrostatic simulations with Bempp-Exafmm

    Full text link
    Biomolecular electrostatics is key in protein function and the chemical processes affecting it.Implicit-solvent models expressed by the Poisson-Boltzmann (PB) equation can provide insights with less computational power than full atomistic models, making large-system studies -- at the scale of viruses, for example -- accessible to more researchers. This paper presents a high-productivity and high-performance computational workflow combining Exafmm, a fast multipole method (FMM) library, and Bempp, a Galerkin boundary element method (BEM) package. It integrates an easy-to-use Python interface with well-optimized computational kernels that are written in compiled languages. Researchers can run PB simulations interactively via Jupyter notebooks, enabling faster prototyping and analyzing. We provide results that showcase the capability of the software, confirm correctness, and evaluate its performance with problem sizes between 8,000 and 2 million boundary elements. A study comparing two variants of the boundary integral formulation in regards to algebraic conditioning showcases the power of this interactive computing platform to give useful answers with just a few lines of code. As a form of solution verification, mesh refinement studies with a spherical geometry as well as with a real biological structure (5PTI) confirm convergence at the expected 1/N1/N rate, for NN boundary elements. Performance results include timings, breakdowns, and computational complexity. Exafmm offers evaluation speeds of just a few seconds for tens of millions of points, and O(N)\mathcal{O}(N) scaling. This allowed computing the solvation free energy of a Zika virus, represented by 1.6 million atoms and 10 million boundary elements, at 80-min runtime on a single compute node (dual 20-core Intel Xeon Gold 6148). All results in the paper are presented with utmost care for reproducibility.Comment: 14 pages, 6 figure

    A Tuned and Scalable Fast Multipole Method as a Preeminent Algorithm for Exascale Systems

    Full text link
    Among the algorithms that are likely to play a major role in future exascale computing, the fast multipole method (FMM) appears as a rising star. Our previous recent work showed scaling of an FMM on GPU clusters, with problem sizes in the order of billions of unknowns. That work led to an extremely parallel FMM, scaling to thousands of GPUs or tens of thousands of CPUs. This paper reports on a a campaign of performance tuning and scalability studies using multi-core CPUs, on the Kraken supercomputer. All kernels in the FMM were parallelized using OpenMP, and a test using 10^7 particles randomly distributed in a cube showed 78% efficiency on 8 threads. Tuning of the particle-to-particle kernel using SIMD instructions resulted in 4x speed-up of the overall algorithm on single-core tests with 10^3 - 10^7 particles. Parallel scalability was studied in both strong and weak scaling. The strong scaling test used 10^8 particles and resulted in 93% parallel efficiency on 2048 processes for the non-SIMD code and 54% for the SIMD-optimized code (which was still 2x faster). The weak scaling test used 10^6 particles per process, and resulted in 72% efficiency on 32,768 processes, with the largest calculation taking about 40 seconds to evaluate more than 32 billion unknowns. This work builds up evidence for our view that FMM is poised to play a leading role in exascale computing, and we end the paper with a discussion of the features that make it a particularly favorable algorithm for the emerging heterogeneous and massively parallel architectural landscape

    Petascale turbulence simulation using a highly parallel fast multipole method on GPUs

    Full text link
    This paper reports large-scale direct numerical simulations of homogeneous-isotropic fluid turbulence, achieving sustained performance of 1.08 petaflop/s on gpu hardware using single precision. The simulations use a vortex particle method to solve the Navier-Stokes equations, with a highly parallel fast multipole method (FMM) as numerical engine, and match the current record in mesh size for this application, a cube of 4096^3 computational points solved with a spectral method. The standard numerical approach used in this field is the pseudo-spectral method, relying on the FFT algorithm as numerical engine. The particle-based simulations presented in this paper quantitatively match the kinetic energy spectrum obtained with a pseudo-spectral method, using a trusted code. In terms of parallel performance, weak scaling results show the fmm-based vortex method achieving 74% parallel efficiency on 4096 processes (one gpu per mpi process, 3 gpus per node of the TSUBAME-2.0 system). The FFT-based spectral method is able to achieve just 14% parallel efficiency on the same number of mpi processes (using only cpu cores), due to the all-to-all communication pattern of the FFT algorithm. The calculation time for one time step was 108 seconds for the vortex method and 154 seconds for the spectral method, under these conditions. Computing with 69 billion particles, this work exceeds by an order of magnitude the largest vortex method calculations to date

    GPU fast multipole method with lambda-dynamics features

    Get PDF
    A significant and computationally most demanding part of molecular dynamics simulations is the calculation of long-range electrostatic interactions. Such interactions can be evaluated directly by the naĂŻve pairwise summation algorithm, which is a ubiquitous showcase example for the compute power of graphics processing units (GPUS). However, the pairwise summation has O(N^2) computational complexity for N interacting particles; thus, an approximation method with a better scaling is required. Today, the prevalent method for such approximation in the field is particle mesh Ewald (PME). PME takes advantage of fast Fourier transforms (FFTS) to approximate the solution efficiently. However, as the underlying FFTS require all-to-all communication between ranks, PME runs into a communication bottleneck. Such communication overhead is negligible only for a moderate parallelization. With increased parallelization, as needed for high-performance applications, the usage of PME becomes unprofitable. Another PME drawback is its inability to perform constant pH simulations efficiently. In such simulations, the protonation states of a protein are allowed to change dynamically during the simulation. The description of this process requires a separate evaluation of the energies for each protonation state. This can not be calculated efficiently with PME as the algorithm requires a repeated FFT for each state, which leads to a linear overhead with respect to the number of states. For a fast approximation of pairwise Coulombic interactions, which does not suffer from PME drawbacks, the Fast Multipole Method (FMM) has been implemented and fully parallelized with CUDA. To assure the optimal FMM performance for diverse MD systems multiple parallelization strategies have been developed. The algorithm has been efficiently incorporated into GROMACS and subsequently tested to determine the optimal FMM parameter set for MD simulations. Finally, the FMM has been incorporated into GROMACS to allow for out-of-the-box electrostatic calculations. The performance of the single-GPU FMM implementation, tested in GROMACS 2019, achieves about a third of highly optimized CUDA PME performance when simulating systems with uniform particle distributions. However, the FMM is expected to outperform PME at high parallelization because the FMM global communication overhead is minimal compared to that of PME. Further, the FMM has been enhanced to provide the energies of an arbitrary number of titratable sites as needed in the constant-pH method. The extension is not fully optimized yet, but the first results show the strength of the FMM for constant pH simulations. For a relatively large system with half a million particles and more than a hundred titratable sites, a straightforward approach to compute alternative energies requires the repetition of a simulation for each state of the sites. The FMM calculates all energy terms only a factor 1.5 slower than a single simulation step. Further improvements of the GPU implementation are expected to yield even more speedup compared to the actual implementation.2021-11-1
    corecore