101,579 research outputs found
Performance analysis of parallel gravitational -body codes on large GPU cluster
We compare the performance of two very different parallel gravitational
-body codes for astrophysical simulations on large GPU clusters, both
pioneer in their own fields as well as in certain mutual scales - NBODY6++ and
Bonsai. We carry out the benchmark of the two codes by analyzing their
performance, accuracy and efficiency through the modeling of structure
decomposition and timing measurements. We find that both codes are heavily
optimized to leverage the computational potential of GPUs as their performance
has approached half of the maximum single precision performance of the
underlying GPU cards. With such performance we predict that a speed-up of
can be achieved when up to 1k processors and GPUs are employed
simultaneously. We discuss the quantitative information about comparisons of
two codes, finding that in the same cases Bonsai adopts larger time steps as
well as relative energy errors than NBODY6++, typically ranging from
times larger, depending on the chosen parameters of the codes. While the two
codes are built for different astrophysical applications, in specified
conditions they may overlap in performance at certain physical scale, and thus
allowing the user to choose from either one with finetuned parameters
accordingly.Comment: 15 pages, 7 figures, 3 tables, accepted for publication in Research
in Astronomy and Astrophysics (RAA
A Parallel Monte Carlo Code for Simulating Collisional N-body Systems
We present a new parallel code for computing the dynamical evolution of
collisional N-body systems with up to N~10^7 particles. Our code is based on
the the Henon Monte Carlo method for solving the Fokker-Planck equation, and
makes assumptions of spherical symmetry and dynamical equilibrium. The
principal algorithmic developments involve optimizing data structures, and the
introduction of a parallel random number generation scheme, as well as a
parallel sorting algorithm, required to find nearest neighbors for interactions
and to compute the gravitational potential. The new algorithms we introduce
along with our choice of decomposition scheme minimize communication costs and
ensure optimal distribution of data and workload among the processing units.
The implementation uses the Message Passing Interface (MPI) library for
communication, which makes it portable to many different supercomputing
architectures. We validate the code by calculating the evolution of clusters
with initial Plummer distribution functions up to core collapse with the number
of stars, N, spanning three orders of magnitude, from 10^5 to 10^7. We find
that our results are in good agreement with self-similar core-collapse
solutions, and the core collapse times generally agree with expectations from
the literature. Also, we observe good total energy conservation, within less
than 0.04% throughout all simulations. We analyze the performance of the code,
and demonstrate near-linear scaling of the runtime with the number of
processors up to 64 processors for N=10^5, 128 for N=10^6 and 256 for N=10^7.
The runtime reaches a saturation with the addition of more processors beyond
these limits which is a characteristic of the parallel sorting algorithm. The
resulting maximum speedups we achieve are approximately 60x, 100x, and 220x,
respectively.Comment: 53 pages, 13 figures, accepted for publication in ApJ Supplement
4.45 Pflops Astrophysical N-Body Simulation on K computer -- The Gravitational Trillion-Body Problem
As an entry for the 2012 Gordon-Bell performance prize, we report performance
results of astrophysical N-body simulations of one trillion particles performed
on the full system of K computer. This is the first gravitational trillion-body
simulation in the world. We describe the scientific motivation, the numerical
algorithm, the parallelization strategy, and the performance analysis. Unlike
many previous Gordon-Bell prize winners that used the tree algorithm for
astrophysical N-body simulations, we used the hybrid TreePM method, for similar
level of accuracy in which the short-range force is calculated by the tree
algorithm, and the long-range force is solved by the particle-mesh algorithm.
We developed a highly-tuned gravity kernel for short-range forces, and a novel
communication algorithm for long-range forces. The average performance on 24576
and 82944 nodes of K computer are 1.53 and 4.45 Pflops, which correspond to 49%
and 42% of the peak speed.Comment: 10 pages, 6 figures, Proceedings of Supercomputing 2012
(http://sc12.supercomputing.org/), Gordon Bell Prize Winner. Additional
information is http://www.ccs.tsukuba.ac.jp/CCS/eng/gbp201
Improving the scalability of parallel N-body applications with an event driven constraint based execution model
The scalability and efficiency of graph applications are significantly
constrained by conventional systems and their supporting programming models.
Technology trends like multicore, manycore, and heterogeneous system
architectures are introducing further challenges and possibilities for emerging
application domains such as graph applications. This paper explores the space
of effective parallel execution of ephemeral graphs that are dynamically
generated using the Barnes-Hut algorithm to exemplify dynamic workloads. The
workloads are expressed using the semantics of an Exascale computing execution
model called ParalleX. For comparison, results using conventional execution
model semantics are also presented. We find improved load balancing during
runtime and automatic parallelism discovery improving efficiency using the
advanced semantics for Exascale computing.Comment: 11 figure
GreeM : Massively Parallel TreePM Code for Large Cosmological N-body Simulations
In this paper, we describe the implementation and performance of GreeM, a
massively parallel TreePM code for large-scale cosmological N-body simulations.
GreeM uses a recursive multi-section algorithm for domain decomposition. The
size of the domains are adjusted so that the total calculation time of the
force becomes the same for all processes. The loss of performance due to
non-optimal load balancing is around 4%, even for more than 10^3 CPU cores.
GreeM runs efficiently on PC clusters and massively-parallel computers such as
a Cray XT4. The measured calculation speed on Cray XT4 is 5 \times 10^4
particles per second per CPU core, for the case of an opening angle of
\theta=0.5, if the number of particles per CPU core is larger than 10^6.Comment: 13 pages, 11 figures, accepted by PAS
Sapporo2: A versatile direct -body library
Astrophysical direct -body methods have been one of the first production
algorithms to be implemented using NVIDIA's CUDA architecture. Now, almost
seven years later, the GPU is the most used accelerator device in astronomy for
simulating stellar systems. In this paper we present the implementation of the
Sapporo2 -body library, which allows researchers to use the GPU for -body
simulations with little to no effort. The first version, released five years
ago, is actively used, but lacks advanced features and versatility in numerical
precision and support for higher order integrators. In this updated version we
have rebuilt the code from scratch and added support for OpenCL,
multi-precision and higher order integrators. We show how to tune these codes
for different GPU architectures and present how to continue utilizing the GPU
optimal even when only a small number of particles () is integrated.
This careful tuning allows Sapporo2 to be faster than Sapporo1 even with the
added options and double precision data loads. The code runs on a range of
NVIDIA and AMD GPUs in single and double precision accuracy. With the addition
of OpenCL support the library is also able to run on CPUs and other
accelerators that support OpenCL.Comment: 15 pages, 7 figures. Accepted for publication in Computational
Astrophysics and Cosmolog
- …