Search CORE

286 research outputs found

A Parallel Adaptive P3M code with Hierarchical Particle Reordering

Author: Anderson
Bagla
Balsara
Barnes
Becciani
Blumenthal
Bode
Boris
Brieu
Couchman
Couchman
Dave
Decyk
Dubinski
Dubinski
Eastwood
Efstathiou
Evrard
Ferrell
Frenk
Frigo
Gingold
Greengard
H.M.P. Couchman
Hernquist
Hernquist
Hockney
Kawata
Kravtsov
Li
Lia
MacFarland
Miocchi
Monaghan
Navarro
Pearce
Robert J. Thacker
Serna
Snir
Spergel
Springel
Springel
Steinmetz
Sugimoto
Swarztrauber
Thacker
Thacker
Thacker
Thacker
Theuns
Vetterling
Wadsley
White
Wisdom
Wood
Publication venue: 'Elsevier BV'
Publication date: 01/01/2005
Field of study

We discuss the design and implementation of HYDRA_OMP a parallel implementation of the Smoothed Particle Hydrodynamics-Adaptive P3M (SPH-AP3M) code HYDRA. The code is designed primarily for conducting cosmological hydrodynamic simulations and is written in Fortran77+OpenMP. A number of optimizations for RISC processors and SMP-NUMA architectures have been implemented, the most important optimization being hierarchical reordering of particles within chaining cells, which greatly improves data locality thereby removing the cache misses typically associated with linked lists. Parallel scaling is good, with a minimum parallel scaling of 73% achieved on 32 nodes for a variety of modern SMP architectures. We give performance data in terms of the number of particle updates per second, which is a more useful performance metric than raw MFlops. A basic version of the code will be made available to the community in the near future.Comment: 34 pages, 12 figures, accepted for publication in Computer Physics Communication

arXiv.org e-Print Archive

CiteSeerX

Crossref

CERN Document Server

A Flexible Thread Scheduler for Hierarchical Multiprocessor Machines

Author: Thibault Samuel
Publication venue: HAL CCSD
Publication date: 19/06/2005
Field of study

International audienceWith the current trend of multiprocessor machines towards more and more hierarchical architectures, exploiting the full computational power requires careful distribution of execution threads and data so as to limit expensive remote memory accesses. Existing multi-threaded libraries provide only limited facilities to let applications express distribution indications, so that programmers end up with explicitly distributing tasks according to the underlying architecture, which is difficult and not portable. In this article, we present: (1) a model for dynamically expressing the structure of the computation; (2) a scheduler interpreting this model so as to make judicious hierarchical distribution decisions; (3) an implementation within the Marcel user-level thread library. We experimented our proposal on a scientific application running on a ccNUMA Bull NovaScale with 16 Intel Itanium II processors; results show a 30% gain compared to a classical scheduler, and are similar to what a handmade scheduler achieves in a non-portable way

INRIA a CCSD electronic archive server

Un ordonnanceur flexible pour machines multiprocesseurs hiérarchiques

Author: Thibault Samuel
Publication venue: HAL CCSD
Publication date: 06/04/2005
Field of study

National audienceL'évolution des machines multiprocesseurs vers des architectures de plus en plus hiérarchiques impose, pour en tirer la quintessence, de répartir les flots d'exécution et les données avec une extrême précaution afin de réduire au maximum les accès mémoire non locaux. Les bibliothèques de multithreading actuelles fournissent très peu de fonctionnalités pour exprimer des directives de répartition au niveau applicatif, ce qui contraint les programmeurs à effectuer cette répartition explicitement en fonction de l'architecture sous-jacente, et donc de manière non portable. Dans cet article nous présentons: (1) un modèle permettant au programme d'exprimer dynamiquement la structure du calcul; (2) un ordonnanceur capable d'interpréter cette modélisation afin de prendre de judicieuses décisions de placement hiérarchisé ; (3) une implémentation au sein de la bibliothèque de threads utilisateur Marcel. Une expérimentation a été menée sur une application scientifique exécutée par une machine ccNUMA Bull NovaScale à 16 processeurs Intel Itanium II; les résultats obtenus montrent un gain de 50% par rapport à un ordonnanceur classique et sont comparables à ceux que l'on obtient en effectuant le placement « à la main », ce qui n'est pas portable

INRIA a CCSD electronic archive server

Dense matrix computations on NUMA architectures with distance-aware work stealing

Author: Al-Omairy Rabab
Badia Sala Rosa Maria
Keyes David E.
Labarta Mancho Jesús José
Ltaief Hatem
Martorell Bofill Xavier
Miranda Álamo Guillermo
Publication venue
Publication date: 01/03/2015
Field of study

We employ the dynamic runtime system OmpSs to decrease the overhead of data motion in the now ubiquitous non-uniform memory access (NUMA) high concurrency environment of multicore processors. The dense numerical linear algebra algorithms of Cholesky factorization and symmetric matrix inversion are employed as representative benchmarks. Work stealing occurs within an innovative NUMA-aware scheduling policy to reduce data movement between NUMA nodes. The overall approach achieves separation of concerns by abstracting the complexity of the hardware from the end users so that high productivity can be achieved. Performance results on a large NUMA system outperform the state-of-the-art existing implementations up to a two fold speedup for the Cholesky factorization, as well as the symmetric matrix inversion, while the OmpSs-enabled code maintains strong similarity to its original sequential version.The authors would like to thank the National Institute for Computational Sciences for granting us access on the Nautilus system. The KAUST authors acknowledge support of the Extreme Computing Research Center. The BSC-affiliated authors thankfully acknowledges the support of the European Commission through the HiPEAC-3 Network of Excellence (FP7-ICT 287759), Intel-BSC Exascale Lab and IBM/BSC Exascale Initiative collaboration, Spanish Ministry of Education (FPU), Computación de Altas Prestaciones VI (TIN2012-34557), Generalitat de Catalunya (2014-SGR-1051) and the grant SEV-2011-00067 of the Severo Ochoa Program.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

BubbleSched, plate-forme de conception d'ordonnanceurs de threads sur machines hiérarchiques

Author: Namyst Raymond
Thibault Samuel
Wacrenier Pierre-André
Publication venue: 'Lavoisier'
Publication date: 01/01/2008
Field of study

National audienceExploiting full computational power of hierarchical multiprocessor machines with irregular multithreaded applications requires a very careful distribution of threads and data. To achieve most of the available performance, programmers often have to forget about portability and wire down ad hoc placement strategies that highly depend on the architecture. To guarantee the portability of performance, we have defined abstractions called ``bubbles'' for capturing both the hierarchical structure of the application's parallelism, and the hierarchical architecture of the targeted machine. We have defined a set of high level primitives to ease the implementation of dedicated, efficient and portable schedulers. We show the relevance of our approach and describe the mechanisms we developped for easily implementing such schedulers

INRIA a CCSD electronic archive server

Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages

Author: Cohen Albert
Drach Nathalie
Drebes Andi
Heydemann Karine
Pop Antoniu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/10/2014
Field of study

International audienceWe present a joint scheduling and memory allocation algorithm for efficient execution of task-parallel programs on non-uniform memory architecture (NUMA) systems. Task and data placement decisions are based on a static description of the memory hierarchy and on runtime information about intertask communication. Existing locality-aware scheduling strategies for fine-grained tasks have strong limitations: they are specific to some class of machines or applications, they do not handle task dependences, they require manual program annotations, or they rely on fragile profiling schemes. By contrast, our solution makes no assumption on the structure of programs or on the layout of data in memory. Experimental results, based on the OpenStream language, show that locality of accesses to main memory of scientific applications can be increased significantly on a 64-core machine, resulting in a speedup of up to 1.63× compared to a state-of-the-art work-stealing scheduler

INRIA a CCSD electronic archive server

The University of Manchester - Institutional Repository

Simulation models of shared-memory multiprocessor systems

Author: Coe Paul.
Publication venue: The University of Edinburgh
Publication date: 01/01/2000
Field of study

Edinburgh Research Archive