Search CORE

2,750 research outputs found

Improving the scalability of parallel N-body applications with an event driven constraint based execution model

Author: Aarseth SJ
Alfieri RA
Bonachea D
Chandra R
Dekate C
El-Ghazawi T
Hewitt C
Kale L
Message Passing Interface Forum
O’Shea BW
Salmon JK
Singh JP
Publication venue: 'SAGE Publications'
Publication date: 23/09/2011
Field of study

The scalability and efficiency of graph applications are significantly constrained by conventional systems and their supporting programming models. Technology trends like multicore, manycore, and heterogeneous system architectures are introducing further challenges and possibilities for emerging application domains such as graph applications. This paper explores the space of effective parallel execution of ephemeral graphs that are dynamically generated using the Barnes-Hut algorithm to exemplify dynamic workloads. The workloads are expressed using the semantics of an Exascale computing execution model called ParalleX. For comparison, results using conventional execution model semantics are also presented. We find improved load balancing during runtime and automatic parallelism discovery improving efficiency using the advanced semantics for Exascale computing.Comment: 11 figure

arXiv.org e-Print Archive

Crossref

A Tuned and Scalable Fast Multipole Method as a Preeminent Algorithm for Exascale Systems

Author: Bergman K
Chandramowlishwaran A
Hamada T
Lorena A Barba
Rahimian A
Rio Yokota
Warren M
Yokota R
Publication venue: 'SAGE Publications'
Publication date: 16/10/2011
Field of study

Among the algorithms that are likely to play a major role in future exascale computing, the fast multipole method (FMM) appears as a rising star. Our previous recent work showed scaling of an FMM on GPU clusters, with problem sizes in the order of billions of unknowns. That work led to an extremely parallel FMM, scaling to thousands of GPUs or tens of thousands of CPUs. This paper reports on a a campaign of performance tuning and scalability studies using multi-core CPUs, on the Kraken supercomputer. All kernels in the FMM were parallelized using OpenMP, and a test using 10^7 particles randomly distributed in a cube showed 78% efficiency on 8 threads. Tuning of the particle-to-particle kernel using SIMD instructions resulted in 4x speed-up of the overall algorithm on single-core tests with 10^3 - 10^7 particles. Parallel scalability was studied in both strong and weak scaling. The strong scaling test used 10^8 particles and resulted in 93% parallel efficiency on 2048 processes for the non-SIMD code and 54% for the SIMD-optimized code (which was still 2x faster). The weak scaling test used 10^6 particles per process, and resulted in 72% efficiency on 32,768 processes, with the largest calculation taking about 40 seconds to evaluate more than 32 billion unknowns. This work builds up evidence for our view that FMM is poised to play a leading role in exascale computing, and we end the paper with a discussion of the features that make it a particularly favorable algorithm for the emerging heterogeneous and massively parallel architectural landscape

arXiv.org e-Print Archive

Crossref

A Parallel Adaptive P3M code with Hierarchical Particle Reordering

Author: Anderson
Bagla
Balsara
Barnes
Becciani
Blumenthal
Bode
Boris
Brieu
Couchman
Couchman
Dave
Decyk
Dubinski
Dubinski
Eastwood
Efstathiou
Evrard
Ferrell
Frenk
Frigo
Gingold
Greengard
H.M.P. Couchman
Hernquist
Hernquist
Hockney
Kawata
Kravtsov
Li
Lia
MacFarland
Miocchi
Monaghan
Navarro
Pearce
Robert J. Thacker
Serna
Snir
Spergel
Springel
Springel
Steinmetz
Sugimoto
Swarztrauber
Thacker
Thacker
Thacker
Thacker
Theuns
Vetterling
Wadsley
White
Wisdom
Wood
Publication venue: 'Elsevier BV'
Publication date: 01/01/2005
Field of study

We discuss the design and implementation of HYDRA_OMP a parallel implementation of the Smoothed Particle Hydrodynamics-Adaptive P3M (SPH-AP3M) code HYDRA. The code is designed primarily for conducting cosmological hydrodynamic simulations and is written in Fortran77+OpenMP. A number of optimizations for RISC processors and SMP-NUMA architectures have been implemented, the most important optimization being hierarchical reordering of particles within chaining cells, which greatly improves data locality thereby removing the cache misses typically associated with linked lists. Parallel scaling is good, with a minimum parallel scaling of 73% achieved on 32 nodes for a variety of modern SMP architectures. We give performance data in terms of the number of particle updates per second, which is a more useful performance metric than raw MFlops. A basic version of the code will be made available to the community in the near future.Comment: 34 pages, 12 figures, accepted for publication in Computer Physics Communication

arXiv.org e-Print Archive

CiteSeerX

Crossref

CERN Document Server