286 research outputs found

    A Parallel Adaptive P3M code with Hierarchical Particle Reordering

    Full text link
    We discuss the design and implementation of HYDRA_OMP a parallel implementation of the Smoothed Particle Hydrodynamics-Adaptive P3M (SPH-AP3M) code HYDRA. The code is designed primarily for conducting cosmological hydrodynamic simulations and is written in Fortran77+OpenMP. A number of optimizations for RISC processors and SMP-NUMA architectures have been implemented, the most important optimization being hierarchical reordering of particles within chaining cells, which greatly improves data locality thereby removing the cache misses typically associated with linked lists. Parallel scaling is good, with a minimum parallel scaling of 73% achieved on 32 nodes for a variety of modern SMP architectures. We give performance data in terms of the number of particle updates per second, which is a more useful performance metric than raw MFlops. A basic version of the code will be made available to the community in the near future.Comment: 34 pages, 12 figures, accepted for publication in Computer Physics Communication

    A Flexible Thread Scheduler for Hierarchical Multiprocessor Machines

    Get PDF
    International audienceWith the current trend of multiprocessor machines towards more and more hierarchical architectures, exploiting the full computational power requires careful distribution of execution threads and data so as to limit expensive remote memory accesses. Existing multi-threaded libraries provide only limited facilities to let applications express distribution indications, so that programmers end up with explicitly distributing tasks according to the underlying architecture, which is difficult and not portable. In this article, we present: (1) a model for dynamically expressing the structure of the computation; (2) a scheduler interpreting this model so as to make judicious hierarchical distribution decisions; (3) an implementation within the Marcel user-level thread library. We experimented our proposal on a scientific application running on a ccNUMA Bull NovaScale with 16 Intel Itanium II processors; results show a 30% gain compared to a classical scheduler, and are similar to what a handmade scheduler achieves in a non-portable way

    Un ordonnanceur flexible pour machines multiprocesseurs hiérarchiques

    Get PDF
    National audienceL'évolution des machines multiprocesseurs vers des architectures de plus en plus hiérarchiques impose, pour en tirer la quintessence, de répartir les flots d'exécution et les données avec une extrême précaution afin de réduire au maximum les accès mémoire non locaux. Les bibliothèques de multithreading actuelles fournissent très peu de fonctionnalités pour exprimer des directives de répartition au niveau applicatif, ce qui contraint les programmeurs à effectuer cette répartition explicitement en fonction de l'architecture sous-jacente, et donc de manière non portable. Dans cet article nous présentons: (1) un modèle permettant au programme d'exprimer dynamiquement la structure du calcul; (2) un ordonnanceur capable d'interpréter cette modélisation afin de prendre de judicieuses décisions de placement hiérarchisé ; (3) une implémentation au sein de la bibliothèque de threads utilisateur Marcel. Une expérimentation a été menée sur une application scientifique exécutée par une machine ccNUMA Bull NovaScale à 16 processeurs Intel Itanium II; les résultats obtenus montrent un gain de 50% par rapport à un ordonnanceur classique et sont comparables à ceux que l'on obtient en effectuant le placement « à la main », ce qui n'est pas portable

    Dense matrix computations on NUMA architectures with distance-aware work stealing

    Get PDF
    We employ the dynamic runtime system OmpSs to decrease the overhead of data motion in the now ubiquitous non-uniform memory access (NUMA) high concurrency environment of multicore processors. The dense numerical linear algebra algorithms of Cholesky factorization and symmetric matrix inversion are employed as representative benchmarks. Work stealing occurs within an innovative NUMA-aware scheduling policy to reduce data movement between NUMA nodes. The overall approach achieves separation of concerns by abstracting the complexity of the hardware from the end users so that high productivity can be achieved. Performance results on a large NUMA system outperform the state-of-the-art existing implementations up to a two fold speedup for the Cholesky factorization, as well as the symmetric matrix inversion, while the OmpSs-enabled code maintains strong similarity to its original sequential version.The authors would like to thank the National Institute for Computational Sciences for granting us access on the Nautilus system. The KAUST authors acknowledge support of the Extreme Computing Research Center. The BSC-affiliated authors thankfully acknowledges the support of the European Commission through the HiPEAC-3 Network of Excellence (FP7-ICT 287759), Intel-BSC Exascale Lab and IBM/BSC Exascale Initiative collaboration, Spanish Ministry of Education (FPU), Computación de Altas Prestaciones VI (TIN2012-34557), Generalitat de Catalunya (2014-SGR-1051) and the grant SEV-2011-00067 of the Severo Ochoa Program.Peer ReviewedPostprint (published version

    BubbleSched, plate-forme de conception d'ordonnanceurs de threads sur machines hiérarchiques

    Get PDF
    National audienceExploiting full computational power of hierarchical multiprocessor machines with irregular multithreaded applications requires a very careful distribution of threads and data. To achieve most of the available performance, programmers often have to forget about portability and wire down ad hoc placement strategies that highly depend on the architecture. To guarantee the portability of performance, we have defined abstractions called ``bubbles'' for capturing both the hierarchical structure of the application's parallelism, and the hierarchical architecture of the targeted machine. We have defined a set of high level primitives to ease the implementation of dedicated, efficient and portable schedulers. We show the relevance of our approach and describe the mechanisms we developped for easily implementing such schedulers

    Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages

    Get PDF
    International audienceWe present a joint scheduling and memory allocation algorithm for efficient execution of task-parallel programs on non-uniform memory architecture (NUMA) systems. Task and data placement decisions are based on a static description of the memory hierarchy and on runtime information about intertask communication. Existing locality-aware scheduling strategies for fine-grained tasks have strong limitations: they are specific to some class of machines or applications, they do not handle task dependences, they require manual program annotations, or they rely on fragile profiling schemes. By contrast, our solution makes no assumption on the structure of programs or on the layout of data in memory. Experimental results, based on the OpenStream language, show that locality of accesses to main memory of scientific applications can be increased significantly on a 64-core machine, resulting in a speedup of up to 1.63× compared to a state-of-the-art work-stealing scheduler

    Simulation models of shared-memory multiprocessor systems

    Get PDF
    corecore