Search CORE

3,262 research outputs found

Scaling Hierarchical N-body Simulations on GPU Clusters

Author: Laxmikant V. Kalé
Lukasz Wesolowski
Pritish Jetley
Thomas R. Quinn
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Abstract — This paper focuses on the use of GPGPU-based clus-ters for hierarchical N-body simulations. Whereas the behavior of these hierarchical methods has been studied in the past on CPU-based architectures, we investigate key performance issues in the context of clusters of GPUs. These include kernel orga-nization and efficiency, the balance between tree traversal and force computation work, grain size selection through the tuning of offloaded work request sizes, and the reduction of sequential bottlenecks. The effects of various application parameters are studied and experiments done to quantify gains in performance. Our studies are carried out in the context of a production-quality parallel cosmological simulator called ChaNGa. We highlight the re-engineering of the application to make it more suitable for GPU-based environments. Finally, we present performance results from experiments on the NCSA Lincoln GPU cluster, including a note on GPU use in multistepped simulations

CiteSeerX

Crossref

A pilgrimage to gravity on GPUs

Author: A. Ahmad
A. Gualandris
A. Tanikawa
E. Gaburov
E. Holmberg
E.N. Dorband
G.J. Sussman
J. Barnes
J. Bédorf
J. Bédorf
J. Goodman
J. Makino
J.H. Applegate
J.R. Hurley
K. Nitadori
L. Nyland
M. Fujii
P. Hut
R. Spurzem
R. Spurzem
R. Spurzem
R. Yokota
R.G. Belleman
R.H. Miller
S. Harfst
S. Inagaki
S. Portegies Zwart
S. Portegies Zwart
S. Portegies Zwart
S. von Hoerner
S.F. Portegies Zwart
S.F. Portegies Zwart
S.J. Aarseth
S.J. Aarseth
T. Fukushige
T.S. van Albada
W. Dehnen
W. Dehnen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/04/2012
Field of study

In this short review we present the developments over the last 5 decades that have led to the use of Graphics Processing Units (GPUs) for astrophysical simulations. Since the introduction of NVIDIA's Compute Unified Device Architecture (CUDA) in 2007 the GPU has become a valuable tool for N-body simulations and is so popular these days that almost all papers about high precision N-body simulations use methods that are accelerated by GPUs. With the GPU hardware becoming more advanced and being used for more advanced algorithms like gravitational tree-codes we see a bright future for GPU like hardware in computational astrophysics.Comment: To appear in: European Physical Journal "Special Topics" : "Computer Simulations on Graphics Processing Units" . 18 pages, 8 figure

arXiv.org e-Print Archive

Crossref

EDP Sciences OAI-PMH repository (1.2.0)

Leiden University Scholary Publications

NBODY6++GPU: Ready for the gravitational million-body problem

Author: Aarseth Sverre
Berczik Peter
Kouwenhoven M. B. N.
Naab Thorsten
Nitadori Keigo
Spurzem Rainer
Wang Long
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2015
Field of study

Accurate direct

N

-body simulations help to obtain detailed information about the dynamical evolution of star clusters. They also enable comparisons with analytical models and Fokker-Planck or Monte-Carlo methods. NBODY6 is a well-known direct

N

-body code for star clusters, and NBODY6++ is the extended version designed for large particle number simulations by supercomputers. We present NBODY6++GPU, an optimized version of NBODY6++ with hybrid parallelization methods (MPI, GPU, OpenMP, and AVX/SSE) to accelerate large direct

N

-body simulations, and in particular to solve the million-body problem. We discuss the new features of the NBODY6++GPU code, benchmarks, as well as the first results from a simulation of a realistic globular cluster initially containing a million particles. For million-body simulations, NBODY6++GPU is

400-2000

times faster than NBODY6 with 320 CPU cores and 32 NVIDIA K20X GPUs. With this computing cluster specification, the simulations of million-body globular clusters including

5\%

primordial binaries require about an hour per half-mass crossing time.Comment: 13 pages, 9 figures, 3 table

arXiv.org e-Print Archive

MPG.PuRe

Performance analysis of parallel gravitational $N$ -body codes on large GPU cluster

Author: Berczik Peter
Huang Siyi
Spurzem Rainer
Publication venue: 'IOP Publishing'
Publication date: 11/08/2015
Field of study

We compare the performance of two very different parallel gravitational

N

-body codes for astrophysical simulations on large GPU clusters, both pioneer in their own fields as well as in certain mutual scales - NBODY6++ and Bonsai. We carry out the benchmark of the two codes by analyzing their performance, accuracy and efficiency through the modeling of structure decomposition and timing measurements. We find that both codes are heavily optimized to leverage the computational potential of GPUs as their performance has approached half of the maximum single precision performance of the underlying GPU cards. With such performance we predict that a speed-up of

200-300

can be achieved when up to 1k processors and GPUs are employed simultaneously. We discuss the quantitative information about comparisons of two codes, finding that in the same cases Bonsai adopts larger time steps as well as relative energy errors than NBODY6++, typically ranging from

10-50

times larger, depending on the chosen parameters of the codes. While the two codes are built for different astrophysical applications, in specified conditions they may overlap in performance at certain physical scale, and thus allowing the user to choose from either one with finetuned parameters accordingly.Comment: 15 pages, 7 figures, 3 tables, accepted for publication in Research in Astronomy and Astrophysics (RAA

arXiv.org e-Print Archive

A sparse octree gravitational N-body code that runs entirely on the GPU processor

Author: Barnes
Barnes
Barnes
Belleman
Billeter
Buck
Burtscher
de Berg
Dehnen
Dubinski
Evghenii Gaburov
Fukushige
Gaburov
Gaburov
Hamada
Hamada
Harfst
Hut
Jeroen Bédorf
Knuth
Lauterbach
Makino
Makino
McMillan
Nyland
Plummer
Portegies Zwart
Portegies Zwart
Raman
Salmon
Satish
Simon Portegies Zwart
Springel
Warren
Yokota
Publication venue: 'Elsevier BV'
Publication date: 01/04/2012
Field of study

We present parallel algorithms for constructing and traversing sparse octrees on graphics processing units (GPUs). The algorithms are based on parallel-scan and sort methods. To test the performance and feasibility, we implemented them in CUDA in the form of a gravitational tree-code which completely runs on the GPU.(The code is publicly available at: http://castle.strw.leidenuniv.nl/software.html) The tree construction and traverse algorithms are portable to many-core devices which have support for CUDA or OpenCL programming languages. The gravitational tree-code outperforms tuned CPU code during the tree-construction and shows a performance improvement of more than a factor 20 overall, resulting in a processing rate of more than 2.8 million particles per second.Comment: Accepted version. Published in Journal of Computational Physics. 35 pages, 12 figures, single colum

arXiv.org e-Print Archive

Crossref

Leiden University Scholary Publications

Sapporo2: A versatile direct $N$ -body library

Author: Bédorf Jeroen
Gaburov Evghenii
Zwart Simon Portegies
Publication venue
Publication date: 01/01/2015
Field of study

Astrophysical direct

N

-body methods have been one of the first production algorithms to be implemented using NVIDIA's CUDA architecture. Now, almost seven years later, the GPU is the most used accelerator device in astronomy for simulating stellar systems. In this paper we present the implementation of the Sapporo2

N

-body library, which allows researchers to use the GPU for

N

-body simulations with little to no effort. The first version, released five years ago, is actively used, but lacks advanced features and versatility in numerical precision and support for higher order integrators. In this updated version we have rebuilt the code from scratch and added support for OpenCL, multi-precision and higher order integrators. We show how to tune these codes for different GPU architectures and present how to continue utilizing the GPU optimal even when only a small number of particles (

N < 100

) is integrated. This careful tuning allows Sapporo2 to be faster than Sapporo1 even with the added options and double precision data loads. The code runs on a range of NVIDIA and AMD GPUs in single and double precision accuracy. With the addition of OpenCL support the library is also able to run on CPUs and other accelerators that support OpenCL.Comment: 15 pages, 7 figures. Accepted for publication in Computational Astrophysics and Cosmolog

arXiv.org e-Print Archive

Springer - Publisher Connector

Leiden University Scholary Publications

Petascale turbulence simulation using a highly parallel fast multipole method on GPUs

Author: Barnes
Chatelain
Cheng
Cottet
Davidson
Dehnen
Gingold
Greengard
Hamada
Ishihara
Kenji Yasuoka
L.A. Barba
Lambert
Rahimian
Rio Yokota
Salmon
Sundar
Tetsu Narumi
Warren
Warren
Yokokawa
Yokota
Yokota
Yokota
Yokota
Yokota
Publication venue: 'Elsevier BV'
Publication date: 03/09/2012
Field of study

This paper reports large-scale direct numerical simulations of homogeneous-isotropic fluid turbulence, achieving sustained performance of 1.08 petaflop/s on gpu hardware using single precision. The simulations use a vortex particle method to solve the Navier-Stokes equations, with a highly parallel fast multipole method (FMM) as numerical engine, and match the current record in mesh size for this application, a cube of 4096^3 computational points solved with a spectral method. The standard numerical approach used in this field is the pseudo-spectral method, relying on the FFT algorithm as numerical engine. The particle-based simulations presented in this paper quantitatively match the kinetic energy spectrum obtained with a pseudo-spectral method, using a trusted code. In terms of parallel performance, weak scaling results show the fmm-based vortex method achieving 74% parallel efficiency on 4096 processes (one gpu per mpi process, 3 gpus per node of the TSUBAME-2.0 system). The FFT-based spectral method is able to achieve just 14% parallel efficiency on the same number of mpi processes (using only cpu cores), due to the all-to-all communication pattern of the FFT algorithm. The calculation time for one time step was 108 seconds for the vortex method and 154 seconds for the spectral method, under these conditions. Computing with 69 billion particles, this work exceeds by an order of magnitude the largest vortex method calculations to date

arXiv.org e-Print Archive

Crossref