Search CORE

21 research outputs found

Better Process Mapping and Sparse Quadratic Assignment

Author: Schulz Christian
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 16th International Symposium on Experimental Algorithms (SEA 2017)
Publication date: 01/01/2017
Field of study

Communication and topology aware process mapping is a powerful approach to reduce communication time in parallel applications with known communication patterns on large, distributed memory systems. We address the problem as a quadratic assignment problem (QAP), and present algorithms to construct initial mappings of processes to processors as well as fast local search algorithms to further improve the mappings. By exploiting assumptions that typically hold for applications and modern supercomputer systems such as sparse communication patterns and hierarchically organized communication systems, we arrive at significantly more powerful algorithms for these special QAPs. Our multilevel construction algorithms employ recently developed, perfectly balanced graph partitioning techniques and excessively exploit the given communication system hierarchy. We present improvements to a local search algorithm of Brandfass et al. (2013), and decrease the running time by reducing the time needed to perform swaps in the assignment as well as by carefully constraining local search neighborhoods. Experiments indicate that our algorithms not only dramatically speed up local search, but due to the multilevel approach also find much better solutions in practice

Dagstuhl Research Online Publication Server

High-Quality Hierarchical Process Mapping

Author: Faraj Marcelo Fonseca
Meyerhenke Henning
Schulz Christian
van der Grinten Alexander
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 18th International Symposium on Experimental Algorithms (SEA 2020)
Publication date: 01/01/2020
Field of study

Partitioning graphs into blocks of roughly equal size such that few edges run between blocks is a frequently needed operation when processing graphs on a parallel computer. When a topology of a distributed system is known, an important task is then to map the blocks of the partition onto the processors such that the overall communication cost is reduced. We present novel multilevel algorithms that integrate graph partitioning and process mapping. Important ingredients of our algorithm include fast label propagation, more localized local search, initial partitioning, as well as a compressed data structure to compute processor distances without storing a distance matrix. Moreover, our algorithms are able to exploit a given hierarchical structure of the distributed system under consideration. Experiments indicate that our algorithms speed up the overall mapping process and, due to the integrated multilevel approach, also find much better solutions in practice. For example, one configuration of our algorithm yields similar solution quality as the previous state-of-the-art in terms of mapping quality for large numbers of partitions while being a factor 9.3 faster. Compared to the currently fastest iterated multilevel mapping algorithm Scotch, we obtain 16% better solutions while investing slightly more running time

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Improving MPI Applications Performance on Multicore Clusters with Rank Reordering

Author: Jeannot Emmanuel
Mercier Guillaume
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 18/09/2011
Field of study

International audienceModern hardware architectures featuring multicores and a complex memory hierarchy raise challenges that need to be addressed by parallel applications programmers. It is therefore tempting to adapt an application communication pattern to the characteristics of the underlying hardware. The MPI standard features several functions that allow the ranks of MPI processes to be reordered according to a graph attached to a newly created communicator. In this paper, we explain how the MPICH2 implementation of the MPI_Dist_graph_create function was modified to reorder the MPI process ranks to create a match between the application communication pattern and the hardware topology. The experimental results on a multicore cluster show that improvements can be achieved as long as the application communication pattern is expressed by a relevant metric

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Oskar Bordeaux

Une stratégie efﬁcace pour le placement de processus en environnement multicœur

Author: Clet-Ortega Jérôme
Publication venue: HAL CCSD
Publication date: 01/09/2009
Field of study

National audienceL'exploitation de ces machines sophistiquées, dites « multicœurs », est un déﬁ constant pour les développeurs des implémentations MPI, qui doivent offrir de très bonnes performances en terme de communication réseau, mais également en terme de communication en mémoire partagée. Une stratégie de placement de processus efﬁcace, en adéquation avec cette structure hiérarchique, est capitale pour atteindre les performances optimales des applications.Ainsi, cet article propose un mécanisme efﬁcace pour agencer les processus MPI sur ces machines multicœurs, se basant sur un outil de découverte de topologie et une analyse ﬁne du modèle de communication applicatif. Pour exemple, les expériences réalisées démontrent l'inﬂuence des stratégies de placement sur les performances globales. Nous avons notamment découvert qu'une stratégie d'accroissement des communications intranœud dégrade sou- vent les performances alors qu'une utilisation réﬂéchie du cache semble être un facteur plus inﬂuent. Une stratégie beaucoup plus sophistiquée, prenant en considération la hiérarchie mémoire, est requise pour observer une amélioration des performances

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Un modèle hièrarchique pour la gestion de la topologie dans les applications MPI

Author: Goglin Brice
Jeannot Emmanuel
Mansouri Farouk
Mercier Guillaume
Publication venue: HAL CCSD
Publication date: 06/03/2018
Field of study

The MPI standard is a major contribution in the landscape of parallel programming. Since its inception in the mid 90's it has ensured portability and performance for parallel applications on a wide spectrum of machines and architectures. With the advent of multicore machines, understanding and taking into account the underlying physical topology and memory hierarchy as become of paramount importance. The MPI standard in its current state, however, and despite recent evolutions is still unable to offer mechanisms to achieve this. In this paper, we detail several additions to the standard that give the user tools to address the hardware topology and data locality issues while improving application performance.Le standard MPI est une contribution importante dans le domaine de la programmation parallèle. Il est destiné à l'écriture d'applications parallèles pour un large éventail d'architectures parallèles. L'arrivée des machines multicœur implique une compréhension plus fine de la topologie matérielle sous-jacente, notamment en ce qui concerne les hiérarchies mémoire et réseau. Or, dans son statut actuel, MPI ne permet pas de prendre ces aspects en compte. Nous détaillons dans cet article des modifications à MPI pour permettre la prise en compte de ces aspects afind'améliorer les performances applicatives

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Hardware topology management in MPI applications through hierarchical communicators

Author: Goglin Brice
Jeannot Emmanuel
Mansouri Farouk
Mercier Guillaume
Publication venue: 'Elsevier BV'
Publication date: 01/08/2018
Field of study

International audienceThe MPI standard is a major contribution in the landscape of parallel programming. Since its inception in the mid 90s it has ensured portability and performance for parallel applications on a wide spectrum of machines and architectures. With the advent of multicore machines, understanding and taking into account the underlying physical topology and memory hierarchy have become of paramount importance. On the other hand, providing abstract mechanisms to manipulate the hardware topology is also fundamental. The MPI standard in its current state, however, and despite recent evolutions is still unable to offer mechanisms to achieve this. In this paper, we detail several additions to the standard for building new MPI communicators corresponding to hardware hierarchy levels. It provides the user with tools to address hardware topology and locality issues while improving application performance

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

Towards an efficient process placement policy for MPI applications in multicore environments

Author: E. Gabriel
F. Pellegrini
J.L. Träff
R. Bolze
R. Namyst
S. Thibault
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

International audienceThis paper presents a method to efficiently place MPI processes on multicore machines. Since MPI implementations often feature efficient supports for both shared-memory and network communication, an adequate placement policy is a crucial step to improve applications performance. As a case study, we show the results obtained for several NAS computing kernels and explain how the policy influences overall performance. In particular, we found out that a policy merely increasing the intranode communication ratio is not enough and that cache utilization is also an influential factor. A more sophisticated policy (eg. one taking into account the architecture's memory structure) is required to observe performance improvements

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Efficient Process-to-Node Mapping Algorithms for Stencil Computations

Author: Hunold Sascha
Lehr Markus
Schulz Christian
Träff Jesper Larsson
von Kirchbach Konrad
Publication venue
Publication date: 20/05/2020
Field of study

Good process-to-compute-node mappings can be decisive for well performing HPC applications. A special, important class of process-to-node mapping problems is the problem of mapping processes that communicate in a sparse stencil pattern to Cartesian grids. By thoroughly exploiting the inherently present structure in this type of problem, we devise three novel distributed algorithms that are able to handle arbitrary stencil communication patterns effectively. We analyze the expected performance of our algorithms based on an abstract model of inter- and intra-node communication. An extensive experimental evaluation on several HPC machines shows that our algorithms are up to two orders of magnitude faster in running time than a (sequential) high-quality general graph mapping tool, while obtaining similar results in communication performance. Furthermore, our algorithms also achieve significantly better mapping quality compared to previous state-of-the-art Cartesian grid mapping algorithms. This results in up to a threefold performance improvement of an MPI_Neighbor_alltoall exchange operation. Our new algorithms can be used to implement the MPI_Cart_create functionality.Comment: 18 pages, 9 Figure

arXiv.org e-Print Archive

Crossref

Affinité entre les processus, métriques et impact sur les performances : étude expérimentale

Author: Bordage Cyril
Jeannot Emmanuel
Publication venue: HAL CCSD
Publication date: 19/12/2017
Field of study

Process placement, also called topology mapping, is a well-known strategy to improve parallel program execution by reducing the communication cost between processes. It requires two inputs: the topology of the target machine and a measure of the affinity between processes. In the literature, the dominant affinity measure is the communication matrix that describes the amount of communication between processes. The goal of this paper is to study the accuracy of the communication matrix as a measure of affinity. We have done an extensive set of tests with two fat-tree machines and a 3d-torus machine to evaluate several hypotheses that are often made in the literature and to discuss their validity. First, we check the correlation between algorithmic metrics and the performance of the application. Then, we check whether a good generic process placement algorithm never degrades performance. And finally, we see whether the structure of the communication matrix can be used to predict gain.Le placement de processus en prenant en compte la topologie de la machine est unetechnique bien connue pour réduire le temps d’exécution d’un programme parallèle en diminuantle coût des communications entre les processus. Il nécessite deux entrées : la topologie de lamachine cible, et une mesure de l’affinité entre les processus. Dans la littérature, la mesured’affinité qui prédomine est la matrice de communication qui comptabilise les communicationsentre les processus. Le but de ce papier est d’étudier la pertinence de la matrice de communicationcomme mesure de l’affinité. Dans ce but, nous avons réalisé un grand nombre de tests sur unemachine de type fat-tree ainsi que sur un tore 3d, afin d’évaluer plusieurs hypothèse qui seretrouvent souvent dans la littérature et de discuter de leur validité. Pour cela, d’abord nousvérifions la corrélation entre des métriques algorithmiques et la performance de l’application.Ensuite, nous contrôlons qu’un bon algorithme de placement n’implique jamais une dégradationdes performances d’une application. Et finalement, nous étudions la structure de la matrice decommunication dans le but de voir si elle peut être utilisée dans la prédiction du gain

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot