Search CORE

180 research outputs found

Kernel-assisted and Topology-aware MPI Collective Communication among Multicore or Many-core Clusters

Author: Ma Teng
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/12/2012
Field of study

Multicore or many-core clusters have become the most prominent form of High Performance Computing (HPC) systems. Hardware complexity and hierarchies not only exist in the inter-node layer, i.e., hierarchical networks, but also exist in internals of multicore compute nodes, e.g., Non Uniform Memory Accesses (NUMA), network-style interconnect, and memory and shared cache hierarchies. Message Passing Interface (MPI), the most widely adopted in the HPC communities, suffers from decreased performance and portability due to increased hardware complexity of multiple levels. We identified three critical issues specific to collective communication: The first problem arises from the gap between logical collective topologies and underlying hardware topologies; Second, current MPI communications lack efficient shared memory message delivering approaches; Last, on distributed memory machines, like multicore clusters, a single approach cannot encompass the extreme variations not only in the bandwidth and latency capabilities, but also in features such as the aptitude to operate multiple concurrent copies simultaneously. To bridge the gap between logical collective topologies and hardware topologies, we developed a distance-aware framework to integrate the knowledge of hardware distance into collective algorithms in order to dynamically reshape the communication patterns to suit the hardware capabilities. Based on process distance information, we used graph partitioning techniques to organize the MPI processes in a multi-level hierarchy, mapping on the hardware characteristics. Meanwhile, we took advantage of the kernel-assisted one-sided single-copy approach (KNEM) as the default shared memory delivering method. Via kernel-assisted memory copy, the collective algorithms offload copy tasks onto non-leader/not-root processes to evenly distribute copy workloads among available cores. Finally, on distributed memory machines, we developed a technique to compose multi-layered collective algorithms together to express a multi-level algorithm with tight interoperability between the levels. This tight collaboration results in more overlaps between inter- and intra-node communication. Experimental results have confirmed that, by leveraging several technologies together, such as kernel-assisted memory copy, the distance-aware framework, and collective algorithm composition, not only do MPI collectives reach the potential maximum performance on a wide variation of platforms, but they also deliver a level of performance immune to modifications of the underlying process-core binding

University of Tennessee, Knoxville: Trace

Topology and affinity aware hierarchical and distributed load-balancing in Charm++

Author: Jeannot Emmanuel
Mercier Guillaume
Tessier François
Publication venue: HAL CCSD
Publication date: 18/11/2016
Field of study

International audienceThe evolution of massively parallel supercomputers make palpable two issues in particular: the load imbalance and the poor management of data locality in applications. Thus, with the increase of the number of cores and the drastic decrease of amount of memory per core, the large performance needs imply to particularly take care of the load-balancing and as much as possible of the locality of data. One mean to take into account this locality issue relies on the placement of the processing entities and load balancing techniques are relevant in order to improve application performance. With large-scale platforms in mind, we developed a hierarchical and distributed algorithm which aim is to perform a topology-aware load balancing tailored for Charm++ applications. This algorithm is based on both LibTopoMap for the network awareness aspects and on TREEMATCH to determine a relevant placement of the processing entities. We show that the proposed algorithm improves the overall execution time in both the cases of real applications and a synthetic benchmark as well. For this last experiment, we show a scalability up to one millions processing entities

INRIA a CCSD electronic archive server

Topology-Aware Job Mapping

Author: Georgiou Yiannis
Jeannot Emmanuel
Mercier Guillaume
Villiermet Adèle
Publication venue: 'SAGE Publications'
Publication date: 01/01/2018
Field of study

International audienceA Resource and Job Management System (RJMS) is a crucial system software part of the HPC stack. It is responsible for eciently delivering computing power to applications in supercomputing environments. Its main intelligence relies on resource selection techniques to find the most adapted resources to schedule the users' jobs. This paper introduces a new method that takes into account the topology of the machine and the application characteristics to determine the best choice among the available nodes of the platform, based upon the network topology and taking into account the applications communication pattern. To validate our approach, we integrate this algorithm as a plugin for Slurm, a well-known and widespread RJMS. We assess our plugin with di↵erent optimization schemes by comparing with the default topology-aware Slurm algorithm, using both emulation and simulation of a large-scale platform and by carrying out experiments in a real cluster. We show that transparently taking into account a job communication pattern and the topology allows for relevant performance gains

INRIA a CCSD electronic archive server

Programming parallel dense matrix factorizations and inversion for new-generation NUMA architectures

Author: Catalán Pallarés Sandra
Herrero Zaragoza José Ramón
Igual Peña Francisco D.
Quintana Ortí Enrique Salvador
Rodríguez Sánchez Rafael
Publication venue: Elsevier
Publication date: 01/05/2023
Field of study

We propose a methodology to address the programmability issues derived from the emergence of new-generation shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use case, and we target two modern architectures (AMD Rome and Huawei Kunpeng 920) that exhibit configurable NUMA topologies. Our methodology pursues performance portability across different NUMA configurations by proposing multi-domain implementations for DMFI plus a hybrid task- and loop-level parallelization that configures multi-threaded executions to fix core-to-data binding, exploiting locality at the expense of minor code modifications. In addition, we introduce a generalization of the multi-domain implementations for DMFI that offers support for virtually any NUMA topology in present and future architectures. Our experimentation on the two target architectures for three representative dense linear algebra operations validates the proposal, reveals insights on the necessity of adapting both the codes and their execution to improve data access locality, and reports performance across architectures and inter- and intra-socket NUMA configurations competitive with state-of-the-art message-passing implementations, maintaining the ease of development usually associated with shared-memory programming.This research was sponsored by project PID2019-107255GB of Ministerio de Ciencia, Innovación y Universidades; project S2018/TCS-4423 of Comunidad de Madrid; project 2017-SGR-1414 of the Generalitat de Catalunya and the Madrid Government under the Multiannual Agreement with UCM in the line Program to Stimulate Research for Young Doctors in the context of the V PRICIT, project PR65/19-22445. This project has also received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558. The JU receives support from the European Union’s Horizon 2020 research and innovation programme, and Spain, Germany, France, Italy, Poland, Switzerland, Norway. The work is also supported by grants PID2020-113656RB-C22 and PID2021-126576NB-I00 of MCIN/AEI/10.13039/501100011033 and by ERDF A way of making Europe.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Un modèle hièrarchique pour la gestion de la topologie dans les applications MPI

Author: Goglin Brice
Jeannot Emmanuel
Mansouri Farouk
Mercier Guillaume
Publication venue: HAL CCSD
Publication date: 06/03/2018
Field of study

The MPI standard is a major contribution in the landscape of parallel programming. Since its inception in the mid 90's it has ensured portability and performance for parallel applications on a wide spectrum of machines and architectures. With the advent of multicore machines, understanding and taking into account the underlying physical topology and memory hierarchy as become of paramount importance. The MPI standard in its current state, however, and despite recent evolutions is still unable to offer mechanisms to achieve this. In this paper, we detail several additions to the standard that give the user tools to address the hardware topology and data locality issues while improving application performance.Le standard MPI est une contribution importante dans le domaine de la programmation parallèle. Il est destiné à l'écriture d'applications parallèles pour un large éventail d'architectures parallèles. L'arrivée des machines multicœur implique une compréhension plus fine de la topologie matérielle sous-jacente, notamment en ce qui concerne les hiérarchies mémoire et réseau. Or, dans son statut actuel, MPI ne permet pas de prendre ces aspects en compte. Nous détaillons dans cet article des modifications à MPI pour permettre la prise en compte de ces aspects afind'améliorer les performances applicatives

INRIA a CCSD electronic archive server

Hardware topology management in MPI applications through hierarchical communicators

Author: Goglin Brice
Jeannot Emmanuel
Mansouri Farouk
Mercier Guillaume
Publication venue: 'Elsevier BV'
Publication date: 01/08/2018
Field of study

International audienceThe MPI standard is a major contribution in the landscape of parallel programming. Since its inception in the mid 90s it has ensured portability and performance for parallel applications on a wide spectrum of machines and architectures. With the advent of multicore machines, understanding and taking into account the underlying physical topology and memory hierarchy have become of paramount importance. On the other hand, providing abstract mechanisms to manipulate the hardware topology is also fundamental. The MPI standard in its current state, however, and despite recent evolutions is still unable to offer mechanisms to achieve this. In this paper, we detail several additions to the standard for building new MPI communicators corresponding to hardware hierarchy levels. It provides the user with tools to address hardware topology and locality issues while improving application performance

INRIA a CCSD electronic archive server

Memory Footprint of Locality Information on Many-Core Platforms

Author: Goglin Brice
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/05/2018
Field of study

International audienceExploiting the power of HPC platforms requires knowledge of their increasingly complex hardware topologies. Multiple components of the software stack, for instance MPI implementations or OpenMP runtimes, now perform their own topology discovery to find out the available cores and memory, and to better place tasks based on their affinities.We study in this article the impact of this topology discovery in terms of memory footprint. Storing locality information wastes an amount of physical memory that is becoming an issue on many-core platforms on the road to exascale.We demonstrate that this information may be factorized between processes by using a shared-memory region. Our analysis of the physical and virtual memories in supercomputing architectures shows that this shared region can be mapped at the same virtual address in all processes, hence dramatically simplifying the software implementation.Our implementation in hwloc and Open MPI shows a memory footprint that does not increase with the number of MPI ranks per node anymore. Moreover the job launch time is decreased by more than a factor of 2 on an Intel Knights Landing Xeon Phi and on a 96-core NUMA platform

Crossref

INRIA a CCSD electronic archive server

A hierarchical model to manage hardware topology in MPI applications

Author: Jeannot Emmanuel
Mansouri Farouk
Mercier Guillaume
Publication venue: HAL CCSD
Publication date: 01/01/2017
Field of study

International audienceThe MPI standard is a major contribution in the landscape of parallel programming. Since its inception in the mid 90's it has ensured portability and performance for parallel applications on a wide spectrum of machines and architectures. With the advent of multicore machines, understanding and taking into account the underlying physical topology and memory hierarchy have become of paramount importance. The MPI standard in its current state, however, and despite recent evolutions is still unable to offer mechanisms to achieve this. In this paper, we detail several additions to the standard that give the user tools to address the hardware topology and data locality issues while improving application performance

Crossref

INRIA a CCSD electronic archive server

Towards an efficient process placement policy for MPI applications in multicore environments

Author: E. Gabriel
F. Pellegrini
J.L. Träff
R. Bolze
R. Namyst
S. Thibault
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

International audienceThis paper presents a method to efficiently place MPI processes on multicore machines. Since MPI implementations often feature efficient supports for both shared-memory and network communication, an adequate placement policy is a crucial step to improve applications performance. As a case study, we show the results obtained for several NAS computing kernels and explain how the policy influences overall performance. In particular, we found out that a policy merely increasing the intranode communication ratio is not enough and that cache utilization is also an influential factor. A more sophisticated policy (eg. one taking into account the architecture's memory structure) is required to observe performance improvements

Crossref

INRIA a CCSD electronic archive server

Large-scale experiment for topology-aware resource management

Author: Georgiou Yiannis
Mercier Guillaume
Villiermet Adèle
Publication venue: HAL CCSD
Publication date: 29/08/2017
Field of study

International audienceA Resource and Job Management System (RJMS) is a crucial system software part of the HPC stack. It is responsible for efficiently delivering computing power to applications in supercomputing environments and its main intelligence relies on resource selection techniques to find the most adapted resources to schedule the users' jobs. In [8], we introduced a new topology-aware resource selection algorithm to determine the best choice among the available nodes of the platform based on their position in the network and on application behaviour (expressed as a communication matrix). We did integrate this algorithm as a plugin in Slurm and validated it with several optimization schemes by making comparisons with the default Slurm algorithm. This paper presents further experiments with regard to this selection process

INRIA a CCSD electronic archive server