21 research outputs found

    Better Process Mapping and Sparse Quadratic Assignment

    Get PDF
    Communication and topology aware process mapping is a powerful approach to reduce communication time in parallel applications with known communication patterns on large, distributed memory systems. We address the problem as a quadratic assignment problem (QAP), and present algorithms to construct initial mappings of processes to processors as well as fast local search algorithms to further improve the mappings. By exploiting assumptions that typically hold for applications and modern supercomputer systems such as sparse communication patterns and hierarchically organized communication systems, we arrive at significantly more powerful algorithms for these special QAPs. Our multilevel construction algorithms employ recently developed, perfectly balanced graph partitioning techniques and excessively exploit the given communication system hierarchy. We present improvements to a local search algorithm of Brandfass et al. (2013), and decrease the running time by reducing the time needed to perform swaps in the assignment as well as by carefully constraining local search neighborhoods. Experiments indicate that our algorithms not only dramatically speed up local search, but due to the multilevel approach also find much better solutions in practice

    High-Quality Hierarchical Process Mapping

    Get PDF
    Partitioning graphs into blocks of roughly equal size such that few edges run between blocks is a frequently needed operation when processing graphs on a parallel computer. When a topology of a distributed system is known, an important task is then to map the blocks of the partition onto the processors such that the overall communication cost is reduced. We present novel multilevel algorithms that integrate graph partitioning and process mapping. Important ingredients of our algorithm include fast label propagation, more localized local search, initial partitioning, as well as a compressed data structure to compute processor distances without storing a distance matrix. Moreover, our algorithms are able to exploit a given hierarchical structure of the distributed system under consideration. Experiments indicate that our algorithms speed up the overall mapping process and, due to the integrated multilevel approach, also find much better solutions in practice. For example, one configuration of our algorithm yields similar solution quality as the previous state-of-the-art in terms of mapping quality for large numbers of partitions while being a factor 9.3 faster. Compared to the currently fastest iterated multilevel mapping algorithm Scotch, we obtain 16% better solutions while investing slightly more running time

    Improving MPI Applications Performance on Multicore Clusters with Rank Reordering

    Get PDF
    International audienceModern hardware architectures featuring multicores and a complex memory hierarchy raise challenges that need to be addressed by parallel applications programmers. It is therefore tempting to adapt an application communication pattern to the characteristics of the underlying hardware. The MPI standard features several functions that allow the ranks of MPI processes to be reordered according to a graph attached to a newly created communicator. In this paper, we explain how the MPICH2 implementation of the MPI_Dist_graph_create function was modified to reorder the MPI process ranks to create a match between the application communication pattern and the hardware topology. The experimental results on a multicore cluster show that improvements can be achieved as long as the application communication pattern is expressed by a relevant metric

    Une stratégie efficace pour le placement de processus en environnement multicœur

    Get PDF
    National audienceL'exploitation de ces machines sophistiquées, dites « multicœurs », est un défi constant pour les développeurs des implémentations MPI, qui doivent offrir de très bonnes performances en terme de communication réseau, mais également en terme de communication en mémoire partagée. Une stratégie de placement de processus efficace, en adéquation avec cette structure hiérarchique, est capitale pour atteindre les performances optimales des applications.Ainsi, cet article propose un mécanisme efficace pour agencer les processus MPI sur ces machines multicœurs, se basant sur un outil de découverte de topologie et une analyse fine du modèle de communication applicatif. Pour exemple, les expériences réalisées démontrent l'influence des stratégies de placement sur les performances globales. Nous avons notamment découvert qu'une stratégie d'accroissement des communications intranœud dégrade sou- vent les performances alors qu'une utilisation réfléchie du cache semble être un facteur plus influent. Une stratégie beaucoup plus sophistiquée, prenant en considération la hiérarchie mémoire, est requise pour observer une amélioration des performances

    Un modèle hièrarchique pour la gestion de la topologie dans les applications MPI

    Get PDF
    The MPI standard is a major contribution in the landscape of parallel programming. Since its inception in the mid 90's it has ensured portability and performance for parallel applications on a wide spectrum of machines and architectures. With the advent of multicore machines, understanding and taking into account the underlying physical topology and memory hierarchy as become of paramount importance. The MPI standard in its current state, however, and despite recent evolutions is still unable to offer mechanisms to achieve this. In this paper, we detail several additions to the standard that give the user tools to address the hardware topology and data locality issues while improving application performance.Le standard MPI est une contribution importante dans le domaine de la programmation parallèle. Il est destiné à l'écriture d'applications parallèles pour un large éventail d'architectures parallèles. L'arrivée des machines multicœur implique une compréhension plus fine de la topologie matérielle sous-jacente, notamment en ce qui concerne les hiérarchies mémoire et réseau. Or, dans son statut actuel, MPI ne permet pas de prendre ces aspects en compte. Nous détaillons dans cet article des modifications à MPI pour permettre la prise en compte de ces aspects afind'améliorer les performances applicatives

    Hardware topology management in MPI applications through hierarchical communicators

    Get PDF
    International audienceThe MPI standard is a major contribution in the landscape of parallel programming. Since its inception in the mid 90s it has ensured portability and performance for parallel applications on a wide spectrum of machines and architectures. With the advent of multicore machines, understanding and taking into account the underlying physical topology and memory hierarchy have become of paramount importance. On the other hand, providing abstract mechanisms to manipulate the hardware topology is also fundamental. The MPI standard in its current state, however, and despite recent evolutions is still unable to offer mechanisms to achieve this. In this paper, we detail several additions to the standard for building new MPI communicators corresponding to hardware hierarchy levels. It provides the user with tools to address hardware topology and locality issues while improving application performance

    Towards an efficient process placement policy for MPI applications in multicore environments

    Get PDF
    International audienceThis paper presents a method to efficiently place MPI processes on multicore machines. Since MPI implementations often feature efficient supports for both shared-memory and network communication, an adequate placement policy is a crucial step to improve applications performance. As a case study, we show the results obtained for several NAS computing kernels and explain how the policy influences overall performance. In particular, we found out that a policy merely increasing the intranode communication ratio is not enough and that cache utilization is also an influential factor. A more sophisticated policy (eg. one taking into account the architecture's memory structure) is required to observe performance improvements

    Efficient Process-to-Node Mapping Algorithms for Stencil Computations

    Full text link
    Good process-to-compute-node mappings can be decisive for well performing HPC applications. A special, important class of process-to-node mapping problems is the problem of mapping processes that communicate in a sparse stencil pattern to Cartesian grids. By thoroughly exploiting the inherently present structure in this type of problem, we devise three novel distributed algorithms that are able to handle arbitrary stencil communication patterns effectively. We analyze the expected performance of our algorithms based on an abstract model of inter- and intra-node communication. An extensive experimental evaluation on several HPC machines shows that our algorithms are up to two orders of magnitude faster in running time than a (sequential) high-quality general graph mapping tool, while obtaining similar results in communication performance. Furthermore, our algorithms also achieve significantly better mapping quality compared to previous state-of-the-art Cartesian grid mapping algorithms. This results in up to a threefold performance improvement of an MPI_Neighbor_alltoall exchange operation. Our new algorithms can be used to implement the MPI_Cart_create functionality.Comment: 18 pages, 9 Figure

    Affinité entre les processus, métriques et impact sur les performances : étude expérimentale

    Get PDF
    Process placement, also called topology mapping, is a well-known strategy to improve parallel program execution by reducing the communication cost between processes. It requires two inputs: the topology of the target machine and a measure of the affinity between processes. In the literature, the dominant affinity measure is the communication matrix that describes the amount of communication between processes. The goal of this paper is to study the accuracy of the communication matrix as a measure of affinity. We have done an extensive set of tests with two fat-tree machines and a 3d-torus machine to evaluate several hypotheses that are often made in the literature and to discuss their validity. First, we check the correlation between algorithmic metrics and the performance of the application. Then, we check whether a good generic process placement algorithm never degrades performance. And finally, we see whether the structure of the communication matrix can be used to predict gain.Le placement de processus en prenant en compte la topologie de la machine est unetechnique bien connue pour réduire le temps d’exécution d’un programme parallèle en diminuantle coût des communications entre les processus. Il nécessite deux entrées : la topologie de lamachine cible, et une mesure de l’affinité entre les processus. Dans la littérature, la mesured’affinité qui prédomine est la matrice de communication qui comptabilise les communicationsentre les processus. Le but de ce papier est d’étudier la pertinence de la matrice de communicationcomme mesure de l’affinité. Dans ce but, nous avons réalisé un grand nombre de tests sur unemachine de type fat-tree ainsi que sur un tore 3d, afin d’évaluer plusieurs hypothèse qui seretrouvent souvent dans la littérature et de discuter de leur validité. Pour cela, d’abord nousvérifions la corrélation entre des métriques algorithmiques et la performance de l’application.Ensuite, nous contrôlons qu’un bon algorithme de placement n’implique jamais une dégradationdes performances d’une application. Et finalement, nous étudions la structure de la matrice decommunication dans le but de voir si elle peut être utilisée dans la prédiction du gain
    corecore