Search CORE

4 research outputs found

Dynamic Task and Data Placement over NUMA Architectures: an OpenMP Runtime Perspective

Author: Broquedis François
Furmento Nathalie
Goglin Brice
Namyst Raymond
Wacrenier Pierre-André
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 03/06/2009
Field of study

International audienceExploiting the full computational power of current hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture so as to avoid memory access penalties. Directive-based programming languages such as OpenMP provide programmers with an easy way to structure the parallelism of their application and to transmit this information to the runtime system. Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into ``scheduling hints'' to solve thread/memory affinity issues. It enables dynamic load distribution guided by application structure and hardware topology, thus helping to achieve performance portability. First experiments show that mixed solutions (migrating threads and data) outperform Next-touch-based data distribution policies and open possibilities for new optimizations

INRIA a CCSD electronic archive server

HAL-Rennes 1

Structuring the execution of OpenMP applications for multicore architectures

Author: Aumage Olivier
Broquedis François
Goglin Brice
Namyst Raymond
Thibault Samuel
Wacrenier Pierre-André
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 19/04/2010
Field of study

International audienceThe now commonplace multi-core chips have introduced, by design, a deep hierarchy of memory and cache banks within parallel computers as a tradeoff between the user friendliness of shared memory on the one side, and memory access scalability and efficiency on the other side. However, to get high performance out of such machines requires a dynamic mapping of application tasks and data onto the underlying architecture. Moreover, depending on the application behavior, this mapping should favor cache affinity, memory bandwidth, computation synchrony, or a combination of these. The great challenge is then to perform this hardware-dependent mapping in a portable, abstract way. To meet this need, we propose a new, hierarchical approach to the execution of OpenMP threads onto multicore machines. Our ForestGOMP runtime system dynamically generates structured trees out of OpenMP programs. It collects relationship information about threads and data as well. This information is used together with scheduling hints and hardware counter feedback by the scheduler to select the most appropriate threads and data distribution. ForestGOMP features a high-level platform for developing and tuning portable threads schedulers. We present several applications for which we developed specific scheduling policies that achieve excellent speedups on 16-core machines

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Oskar Bordeaux

ForestGOMP: an efficient OpenMP environment for NUMA architectures

Author: Brice Goglin
D.S. Nikolopoulos
François Broquedis
Nathalie Furmento
Pierre-André Wacrenier
Raymond Namyst
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

International audienceExploiting the full computational power of current hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture so as to avoid remote memory access penalties. Directive-based programming languages such as OpenMP, can greatly help to perform such a distribution by providing programmers with an easy way to structure the parallelism of their application and to transmit this information to the runtime system. Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into Scheduling Hints related to thread-memory affinity issues. These hints enable dynamic load distribution guided by application structure and hardware topology, thus helping to achieve performance portability. Several experiments show that mixed solutions (migrating both threads and data) outperform work-stealing based balancing strategies and Next-Touch-based data distribution policies. These techniques provide insights about additional optimizations

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Efficient parallel programming on scalable shared memory systems with High Performance Fortran

Author: Antoniu
Benkner
Benkner
Benkner
Benkner
Benkner
Bircsak
Brandes
Brandes
Chapman
Chatterjee
Clinckemaillie
Gupta
Gutiérrez
High Performance Fortran Forum
Koelbel
Leair
Nikolopoulos
O'Boyle
Resch
Saltz
Silicon Graphics Inc.
The OpenMP Forum
Publication venue: 'Wiley'
Publication date: 01/01/2002
Field of study

Crossref