905 research outputs found
GreeM : Massively Parallel TreePM Code for Large Cosmological N-body Simulations
In this paper, we describe the implementation and performance of GreeM, a
massively parallel TreePM code for large-scale cosmological N-body simulations.
GreeM uses a recursive multi-section algorithm for domain decomposition. The
size of the domains are adjusted so that the total calculation time of the
force becomes the same for all processes. The loss of performance due to
non-optimal load balancing is around 4%, even for more than 10^3 CPU cores.
GreeM runs efficiently on PC clusters and massively-parallel computers such as
a Cray XT4. The measured calculation speed on Cray XT4 is 5 \times 10^4
particles per second per CPU core, for the case of an opening angle of
\theta=0.5, if the number of particles per CPU core is larger than 10^6.Comment: 13 pages, 11 figures, accepted by PAS
A hybrid MPI-OpenMP scheme for scalable parallel pseudospectral computations for fluid turbulence
A hybrid scheme that utilizes MPI for distributed memory parallelism and
OpenMP for shared memory parallelism is presented. The work is motivated by the
desire to achieve exceptionally high Reynolds numbers in pseudospectral
computations of fluid turbulence on emerging petascale, high core-count,
massively parallel processing systems. The hybrid implementation derives from
and augments a well-tested scalable MPI-parallelized pseudospectral code. The
hybrid paradigm leads to a new picture for the domain decomposition of the
pseudospectral grids, which is helpful in understanding, among other things,
the 3D transpose of the global data that is necessary for the parallel fast
Fourier transforms that are the central component of the numerical
discretizations. Details of the hybrid implementation are provided, and
performance tests illustrate the utility of the method. It is shown that the
hybrid scheme achieves near ideal scalability up to ~20000 compute cores with a
maximum mean efficiency of 83%. Data are presented that demonstrate how to
choose the optimal number of MPI processes and OpenMP threads in order to
optimize code performance on two different platforms.Comment: Submitted to Parallel Computin
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS
GROMACS is a widely used package for biomolecular simulation, and over the
last two decades it has evolved from small-scale efficiency to advanced
heterogeneous acceleration and multi-level parallelism targeting some of the
largest supercomputers in the world. Here, we describe some of the ways we have
been able to realize this through the use of parallelization on all levels,
combined with a constant focus on absolute performance. Release 4.6 of GROMACS
uses SIMD acceleration on a wide range of architectures, GPU offloading
acceleration, and both OpenMP and MPI parallelism within and between nodes,
respectively. The recent work on acceleration made it necessary to revisit the
fundamental algorithms of molecular simulation, including the concept of
neighborsearching, and we discuss the present and future challenges we see for
exascale simulation - in particular a very fine-grained task parallelism. We
also discuss the software management, code peer review and continuous
integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin
Solving the Klein-Gordon equation using Fourier spectral methods: A benchmark test for computer performance
The cubic Klein-Gordon equation is a simple but non-trivial partial
differential equation whose numerical solution has the main building blocks
required for the solution of many other partial differential equations. In this
study, the library 2DECOMP&FFT is used in a Fourier spectral scheme to solve
the Klein-Gordon equation and strong scaling of the code is examined on
thirteen different machines for a problem size of 512^3. The results are useful
in assessing likely performance of other parallel fast Fourier transform based
programs for solving partial differential equations. The problem is chosen to
be large enough to solve on a workstation, yet also of interest to solve
quickly on a supercomputer, in particular for parametric studies. Unlike other
high performance computing benchmarks, for this problem size, the time to
solution will not be improved by simply building a bigger supercomputer.Comment: 10 page
An efficient parallel immersed boundary algorithm using a pseudo-compressible fluid solver
We propose an efficient algorithm for the immersed boundary method on
distributed-memory architectures, with the computational complexity of a
completely explicit method and excellent parallel scaling. The algorithm
utilizes the pseudo-compressibility method recently proposed by Guermond and
Minev [Comptes Rendus Mathematique, 348:581-585, 2010] that uses a directional
splitting strategy to discretize the incompressible Navier-Stokes equations,
thereby reducing the linear systems to a series of one-dimensional tridiagonal
systems. We perform numerical simulations of several fluid-structure
interaction problems in two and three dimensions and study the accuracy and
convergence rates of the proposed algorithm. For these problems, we compare the
proposed algorithm against other second-order projection-based fluid solvers.
Lastly, the strong and weak scaling properties of the proposed algorithm are
investigated
- …