242,632 research outputs found
Performance analysis of parallel gravitational -body codes on large GPU cluster
We compare the performance of two very different parallel gravitational
-body codes for astrophysical simulations on large GPU clusters, both
pioneer in their own fields as well as in certain mutual scales - NBODY6++ and
Bonsai. We carry out the benchmark of the two codes by analyzing their
performance, accuracy and efficiency through the modeling of structure
decomposition and timing measurements. We find that both codes are heavily
optimized to leverage the computational potential of GPUs as their performance
has approached half of the maximum single precision performance of the
underlying GPU cards. With such performance we predict that a speed-up of
can be achieved when up to 1k processors and GPUs are employed
simultaneously. We discuss the quantitative information about comparisons of
two codes, finding that in the same cases Bonsai adopts larger time steps as
well as relative energy errors than NBODY6++, typically ranging from
times larger, depending on the chosen parameters of the codes. While the two
codes are built for different astrophysical applications, in specified
conditions they may overlap in performance at certain physical scale, and thus
allowing the user to choose from either one with finetuned parameters
accordingly.Comment: 15 pages, 7 figures, 3 tables, accepted for publication in Research
in Astronomy and Astrophysics (RAA
Performance analysis of parallel gravitational N-body codes on large GPU clusters
We compare the performance of two very different parallel gravitational N-body codes for astrophysical simulations on large Graphics Processing Unit (GPU) clusters, both of which are pioneers in their own fields as well as on certain mutual scales - NBODY6++ and Bonsai. We carry out benchmarks of the two codes by analyzing their performance, accuracy and efficiency through the modeling of structure decomposition and timing measurements. We find that both codes are heavily optimized to leverage the computational potential of GPUs as their performance has approached half of the maximum single precision performance of the underlying GPU cards. With such performance we predict that a speed-up of 200 - 300 can be achieved when up to 1k processors and GPUs are employed simultaneously. We discuss the quantitative information about comparisons of the two codes, finding that in the same cases Bonsai adopts larger time steps as well as larger relative energy errors than NBODY6++, typically ranging from 10 - 50 times larger, depending on the chosen parameters of the codes. Although the two codes are built for different astrophysical applications, in specified conditions they may overlap in performance at certain physical scales, thus allowing the user to choose either one by fine-tuning parameters accordingly
Performance Analysis of Direct N-Body Algorithms on Special-Purpose Supercomputers
Direct-summation N-body algorithms compute the gravitational interaction
between stars in an exact way and have a computational complexity of O(N^2).
Performance can be greatly enhanced via the use of special-purpose accelerator
boards like the GRAPE-6A. However the memory of the GRAPE boards is limited.
Here, we present a performance analysis of direct N-body codes on two parallel
supercomputers that incorporate special-purpose boards, allowing as many as
four million particles to be integrated. Both computers employ high-speed,
Infiniband interconnects to minimize communication overhead, which can
otherwise become significant due to the small number of "active" particles at
each time step. We find that the computation time scales well with processor
number; for 2*10^6 particles, efficiencies greater than 50% and speeds in
excess of 2 TFlops are reached.Comment: 34 pages, 15 figures, submitted to New Astronom
Performance analysis of direct N-body algorithms for astrophysical simulations on distributed systems
We discuss the performance of direct summation codes used in the simulation
of astrophysical stellar systems on highly distributed architectures. These
codes compute the gravitational interaction among stars in an exact way and
have an O(N^2) scaling with the number of particles. They can be applied to a
variety of astrophysical problems, like the evolution of star clusters, the
dynamics of black holes, the formation of planetary systems, and cosmological
simulations. The simulation of realistic star clusters with sufficiently high
accuracy cannot be performed on a single workstation but may be possible on
parallel computers or grids. We have implemented two parallel schemes for a
direct N-body code and we study their performance on general purpose parallel
computers and large computational grids. We present the results of timing
analyzes conducted on the different architectures and compare them with the
predictions from theoretical models. We conclude that the simulation of star
clusters with up to a million particles will be possible on large distributed
computers in the next decade. Simulating entire galaxies however will in
addition require new hybrid methods to speedup the calculation.Comment: 22 pages, 8 figures, accepted for publication in Parallel Computin
A pilgrimage to gravity on GPUs
In this short review we present the developments over the last 5 decades that
have led to the use of Graphics Processing Units (GPUs) for astrophysical
simulations. Since the introduction of NVIDIA's Compute Unified Device
Architecture (CUDA) in 2007 the GPU has become a valuable tool for N-body
simulations and is so popular these days that almost all papers about high
precision N-body simulations use methods that are accelerated by GPUs. With the
GPU hardware becoming more advanced and being used for more advanced algorithms
like gravitational tree-codes we see a bright future for GPU like hardware in
computational astrophysics.Comment: To appear in: European Physical Journal "Special Topics" : "Computer
Simulations on Graphics Processing Units" . 18 pages, 8 figure
A sparse octree gravitational N-body code that runs entirely on the GPU processor
We present parallel algorithms for constructing and traversing sparse octrees
on graphics processing units (GPUs). The algorithms are based on parallel-scan
and sort methods. To test the performance and feasibility, we implemented them
in CUDA in the form of a gravitational tree-code which completely runs on the
GPU.(The code is publicly available at:
http://castle.strw.leidenuniv.nl/software.html) The tree construction and
traverse algorithms are portable to many-core devices which have support for
CUDA or OpenCL programming languages. The gravitational tree-code outperforms
tuned CPU code during the tree-construction and shows a performance improvement
of more than a factor 20 overall, resulting in a processing rate of more than
2.8 million particles per second.Comment: Accepted version. Published in Journal of Computational Physics. 35
pages, 12 figures, single colum
Recommended from our members
Leveraging legacy codes to distributed problem solving environments: A web service approach
This paper describes techniques used to leverage high performance legacy codes as CORBA components to a distributed problem solving environment. It first briefly introduces the software architecture adopted by the environment. Then it presents a CORBA oriented wrapper generator (COWG) which can be used to automatically wrap high performance legacy codes as CORBA components. Two legacy codes have been wrapped with COWG. One is an MPI-based molecular dynamic simulation (MDS) code, the other is a finite element based computational fluid dynamics (CFD) code for simulating incompressible Navier-Stokes flows. Performance comparisons between runs of the MDS CORBA component and the original MDS legacy code on a cluster of workstations and on a parallel computer are also presented. Wrapped as CORBA components, these legacy codes can be reused in a distributed computing environment. The first case shows that high performance can be maintained with the wrapped MDS component. The second case shows that a Web user can submit a task to the wrapped CFD component through a Web page without knowing the exact implementation of the component. In this way, a userâs desktop computing environment can be extended to a high performance computing environment using a cluster of workstations or a parallel computer
- âŠ