1,496 research outputs found
Multi-Architecture Monte-Carlo (MC) Simulation of Soft Coarse-Grained Polymeric Materials: SOft coarse grained Monte-carlo Acceleration (SOMA)
Multi-component polymer systems are important for the development of new
materials because of their ability to phase-separate or self-assemble into
nano-structures. The Single-Chain-in-Mean-Field (SCMF) algorithm in conjunction
with a soft, coarse-grained polymer model is an established technique to
investigate these soft-matter systems. Here we present an im- plementation of
this method: SOft coarse grained Monte-carlo Accelera- tion (SOMA). It is
suitable to simulate large system sizes with up to billions of particles, yet
versatile enough to study properties of different kinds of molecular
architectures and interactions. We achieve efficiency of the simulations
commissioning accelerators like GPUs on both workstations as well as
supercomputers. The implementa- tion remains flexible and maintainable because
of the implementation of the scientific programming language enhanced by
OpenACC pragmas for the accelerators. We present implementation details and
features of the program package, investigate the scalability of our
implementation SOMA, and discuss two applications, which cover system sizes
that are difficult to reach with other, common particle-based simulation
methods
Simulation of 1+1 dimensional surface growth and lattices gases using GPUs
Restricted solid on solid surface growth models can be mapped onto binary
lattice gases. We show that efficient simulation algorithms can be realized on
GPUs either by CUDA or by OpenCL programming. We consider a
deposition/evaporation model following Kardar-Parisi-Zhang growth in 1+1
dimensions related to the Asymmetric Simple Exclusion Process and show that for
sizes, that fit into the shared memory of GPUs one can achieve the maximum
parallelization speedup ~ x100 for a Quadro FX 5800 graphics card with respect
to a single CPU of 2.67 GHz). This permits us to study the effect of quenched
columnar disorder, requiring extremely long simulation times. We compare the
CUDA realization with an OpenCL implementation designed for processor clusters
via MPI. A two-lane traffic model with randomized turning points is also
realized and the dynamical behavior has been investigated.Comment: 20 pages 12 figures, 1 table, to appear in Comp. Phys. Com
The GENGA Code: Gravitational Encounters in N-body simulations with GPU Acceleration
We describe an open source GPU implementation of a hybrid symplectic N-body
integrator, GENGA (Gravitational ENcounters with Gpu Acceleration), designed to
integrate planet and planetesimal dynamics in the late stage of planet
formation and stability analyses of planetary systems. GENGA uses a hybrid
symplectic integrator to handle close encounters with very good energy
conservation, which is essential in long-term planetary system integration. We
extended the second order hybrid integration scheme to higher orders. The GENGA
code supports three simulation modes: Integration of up to 2048 massive bodies,
integration with up to a million test particles, or parallel integration of a
large number of individual planetary systems. We compare the results of GENGA
to Mercury and pkdgrav2 in respect of energy conservation and performance, and
find that the energy conservation of GENGA is comparable to Mercury and around
two orders of magnitude better than pkdgrav2. GENGA runs up to 30 times faster
than Mercury and up to eight times faster than pkdgrav2. GENGA is written in
CUDA C and runs on all NVIDIA GPUs with compute capability of at least 2.0.Comment: Accepted by ApJ. 18 pages, 17 figures, 4 table
Mixing multi-core CPUs and GPUs for scientific simulation software
Recent technological and economic developments have led to widespread availability of
multi-core CPUs and specialist accelerator processors such as graphical processing units
(GPUs). The accelerated computational performance possible from these devices can be very
high for some applications paradigms. Software languages and systems such as NVIDIA's
CUDA and Khronos consortium's open compute language (OpenCL) support a number of
individual parallel application programming paradigms. To scale up the performance of some
complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and
very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica-
tions using threading approaches and multi-core CPUs to control independent GPU devices.
We present speed-up data and discuss multi-threading software issues for the applications
level programmer and o er some suggested areas for language development and integration
between coarse-grained and ne-grained multi-thread systems. We discuss results from three
common simulation algorithmic areas including: partial di erential equations; graph cluster
metric calculations and random number generation. We report on programming experiences
and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs;
a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and
trends in multi-core programming for scienti c applications developers
GPU-accelerated algorithms for many-particle continuous-time quantum walks
Many-particle continuous-time quantum walks (CTQWs) represent a resource for several tasks in quantum technology, including quantum search algorithms and universal quantum computation. In order to design and implement CTQWs in a realistic scenario, one needs effective simulation tools for Hamiltonians that take into account static noise and fluctuations in the lattice, i.e. Hamiltonians containing stochastic terms. To this aim, we suggest a parallel algorithm based on the Taylor series expansion of the evolution operator, and compare its performances with those of algorithms based on the exact diagonalization of the Hamiltonian or a 4th order Runge–Kutta integration. We prove that both Taylor-series expansion and Runge–Kutta algorithms are reliable and have a low computational cost, the Taylor-series expansion showing the additional advantage of a memory allocation not depending on the precision of calculation. Both algorithms are also highly parallelizable within the SIMT paradigm, and are thus suitable for GPGPU computing. In turn, we have benchmarked 4 NVIDIA GPUs and 3 quad-core Intel CPUs for a 2-particle system over lattices of increasing dimension, showing that the speedup provided by GPU computing, with respect to the OPENMP parallelization, lies in the range between 8x and (more than) 20x, depending on the frequency of post-processing. GPU-accelerated codes thus allow one to overcome concerns about the execution time, and make it possible simulations with many interacting particles on large lattices, with the only limit of the memory available on the device. Program summary Program Title: cuQuWa Licensing provisions: GNU General Public License, version 3 Program Files doi: http://dx.doi.org/10.17632/vjpnjgycdj.1 Programming language: CUDA C Nature of problem: Evolution of many-particle continuous-time quantum-walks on a multidimensional grid in a noisy environment. The submitted code is specialized for the simulation of 2-particle quantum-walks with periodic boundary conditions. Solution method: Taylor-series expansion of the evolution operator. The density-matrix is calculated by averaging multiple independent realizations of the system. External routines: cuBLAS, cuRAND Unusual features: Simulations are run exclusively on the graphic processing unit within the CUDA environment. An undocumented misbehavior in the random-number generation routine (cuRAND package) can corrupt the simulation of large systems, though no problems are reported for small and medium-size systems. Compiling the code with the -arch=sm_30 flag for compute capability 3.5 and above fixes this issue
GPU-accelerated algorithms for many-particle continuous-time quantum walks
Many-particle continuous-time quantum walks (CTQWs) represent a resource for several tasks in quantum technology, including quantum search algorithms and universal quantum computation. In order to design and implement CTQWs in a realistic scenario, one needs effective simulation tools for Hamiltonians that take into account static noise and fluctuations in the lattice, i.e.\ua0Hamiltonians containing stochastic terms. To this aim, we suggest a parallel algorithm based on the Taylor series expansion of the evolution operator, and compare its performances with those of algorithms based on the exact diagonalization of the Hamiltonian or a 4th order Runge\u2013Kutta integration. We prove that both Taylor-series expansion and Runge\u2013Kutta algorithms are reliable and have a low computational cost, the Taylor-series expansion showing the additional advantage of a memory allocation not depending on the precision of calculation. Both algorithms are also highly parallelizable within the SIMT paradigm, and are thus suitable for GPGPU computing. In turn, we have benchmarked 4 NVIDIA GPUs and 3 quad-core Intel CPUs for a 2-particle system over lattices of increasing dimension, showing that the speedup provided by GPU computing, with respect to the OPENMP parallelization, lies in the range between 8x and (more than) 20x, depending on the frequency of post-processing. GPU-accelerated codes thus allow one to overcome concerns about the execution time, and make it possible simulations with many interacting particles on large lattices, with the only limit of the memory available on the device. Program summary Program Title: cuQuWa Licensing provisions: GNU General Public License, version 3 Program Files doi: http://dx.doi.org/10.17632/vjpnjgycdj.1 Programming language: CUDA C Nature of problem: Evolution of many-particle continuous-time quantum-walks on a multidimensional grid in a noisy environment. The submitted code is specialized for the simulation of 2-particle quantum-walks with periodic boundary conditions. Solution method: Taylor-series expansion of the evolution operator. The density-matrix is calculated by averaging multiple independent realizations of the system. External routines: cuBLAS, cuRAND Unusual features: Simulations are run exclusively on the graphic processing unit within the CUDA environment. An undocumented misbehavior in the random-number generation routine (cuRAND package) can corrupt the simulation of large systems, though no problems are reported for small and medium-size systems. Compiling the code with the -arch=sm_30 flag for compute capability 3.5 and above fixes this issue
Extremely large scale simulation of a Kardar-Parisi-Zhang model using graphics cards
The octahedron model introduced recently has been implemented onto graphics
cards, which permits extremely large scale simulations via binary lattice gases
and bit coded algorithms. We confirm scaling behaviour belonging to the 2d
Kardar-Parisi-Zhang universality class and find a surface growth exponent:
beta=0.2415(15) on 2^17 x 2^17 systems, ruling out beta=1/4 suggested by field
theory. The maximum speed-up with respect to a single CPU is 240. The steady
state has been analysed by finite size scaling and a growth exponent
alpha=0.393(4) is found. Correction to scaling exponents are computed and the
power-spectrum density of the steady state is determined. We calculate the
universal scaling functions, cumulants and show that the limit distribution can
be obtained by the sizes considered. We provide numerical fitting for the small
and large tail behaviour of the steady state scaling function of the interface
width.Comment: 7 pages, 8 figures, slightly modified, accepted version for PR
Petascale turbulence simulation using a highly parallel fast multipole method on GPUs
This paper reports large-scale direct numerical simulations of
homogeneous-isotropic fluid turbulence, achieving sustained performance of 1.08
petaflop/s on gpu hardware using single precision. The simulations use a vortex
particle method to solve the Navier-Stokes equations, with a highly parallel
fast multipole method (FMM) as numerical engine, and match the current record
in mesh size for this application, a cube of 4096^3 computational points solved
with a spectral method. The standard numerical approach used in this field is
the pseudo-spectral method, relying on the FFT algorithm as numerical engine.
The particle-based simulations presented in this paper quantitatively match the
kinetic energy spectrum obtained with a pseudo-spectral method, using a trusted
code. In terms of parallel performance, weak scaling results show the fmm-based
vortex method achieving 74% parallel efficiency on 4096 processes (one gpu per
mpi process, 3 gpus per node of the TSUBAME-2.0 system). The FFT-based spectral
method is able to achieve just 14% parallel efficiency on the same number of
mpi processes (using only cpu cores), due to the all-to-all communication
pattern of the FFT algorithm. The calculation time for one time step was 108
seconds for the vortex method and 154 seconds for the spectral method, under
these conditions. Computing with 69 billion particles, this work exceeds by an
order of magnitude the largest vortex method calculations to date
Efficient Algorithms And Optimizations For Scientific Computing On Many-Core Processors
Designing efficient algorithms for many-core and multicore architectures requires using different strategies to allow for the best exploitation of the hardware resources on those architectures. Researchers have ported many scientific applications to modern many-core and multicore parallel architectures, and by doing so they have achieved significant speedups over running on single CPU cores. While many applications have achieved significant speedups, some applications still require more effort to accelerate due to their inherently serial behavior.
One class of applications that has this serial behavior is the Monte Carlo simulations. Monte Carlo simulations have been used to simulate many problems in statistical physics and statistical mechanics that were not possible to simulate using Molecular Dynamics. While there are a fair number of well-known and recognized GPU Molecular Dynamics codes, the existing Monte Carlo ensemble simulations have not been ported to the GPU, so they are relatively slow and could not run large systems in a reasonable amount of time. Due to the previously mentioned shortcomings of existing Monte Carlo ensemble codes and due to the interest of researchers to have a fast Monte Carlo simulation framework that can simulate large systems, a new GPU framework called GOMC is implemented to simulate different particle and molecular-based force fields and ensembles. GOMC simulates different Monte Carlo ensembles such as the canonical, grand canonical, and Gibbs ensembles. This work describes many challenges in developing a GPU Monte Carlo code for such ensembles and how I addressed these challenges.
This work also describes efficient many-core and multicore large-scale energy calculations for Monte Carlo Gibbs ensemble using cell lists. Designing Monte Carlo molecular simulations is challenging as they have less computation and parallelism when compared to similar molecular dynamics applications. The modified cell list allows for more speedup gains for energy calculations on both many-core and multicore architectures when compared to other implementations without using the conventional cell lists. The work presents results and analysis of the cell list algorithms for each one of the parallel architectures using top of the line GPUs, CPUs, and Intel’s Phi coprocessors. In addition, the work evaluates the performance of the cell list algorithms for different problem sizes and different radial cutoffs.
In addition, this work evaluates two cell list approaches, a hybrid MPI+OpenMP approach and a hybrid MPI+CUDA approach. The cell list methods are evaluated on a small cluster of multicore CPUs, Intel Phi coprocessors, and GPUs. The performance results are evaluated using different combinations of MPI processes, threads, and problem sizes.
Another application presented in this dissertation involves the understanding of the properties of crystalline materials, and their design and control. Recent developments include the introduction of new models to simulate system behavior and properties that are of large experimental and theoretical interest. One of those models is the Phase-Field Crystal (PFC) model. The PFC model has enabled researchers to simulate 2D and 3D crystal structures and study defects such as dislocations and grain boundaries. In this work, GPUs are used to accelerate various dynamic properties of polycrystals in the 2D PFC model. Some properties require very intensive computation that may involve hundreds of thousands of atoms. The GPU implementation has achieved significant speedups of more than 46 times for some large systems simulations
- …