Search CORE

arXiv.org e-Print Archive

Load-Balanced Bottleneck Objectives in Process Mapping

Author: Langguth Johannes
Schlag Sebastian
Schulz Christian
Publication venue
Publication date: 27/01/2020
Field of study

We propose a new problem formulation for graph partitioning that is tailored to the needs of time-critical simulations on modern heterogeneous supercomputers

KITopen

An Overview of Process Mapping Techniques and Algorithms in High-Performance Computing

Author: Hoefler Torsten
Jeannot Emmanuel
Mercier Guillaume
Publication venue: 'Royal College of Obstetricians & Gynaecologists (RCOG)'
Publication date: 01/06/2014
Field of study

International audienceDue to the advent of modern hardware architectures of high-performance comput- ers, the way the parallel applications are laid out is of paramount importance for performance. This chapter surveys several techniques and algorithms that efficiently address this issue: the mapping of the application's virtual topology (for instance its communication pattern) onto the physical topology. Using such strategy enables to improve the application overall execution time significantly. The chapter concludes by listing a series of open issues and problems

Dagstuhl Research Online Publication Server

Better Process Mapping and Sparse Quadratic Assignment

Author: Schulz Christian
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 16th International Symposium on Experimental Algorithms (SEA 2017)
Publication date: 01/01/2017
Field of study

Communication and topology aware process mapping is a powerful approach to reduce communication time in parallel applications with known communication patterns on large, distributed memory systems. We address the problem as a quadratic assignment problem (QAP), and present algorithms to construct initial mappings of processes to processors as well as fast local search algorithms to further improve the mappings. By exploiting assumptions that typically hold for applications and modern supercomputer systems such as sparse communication patterns and hierarchically organized communication systems, we arrive at significantly more powerful algorithms for these special QAPs. Our multilevel construction algorithms employ recently developed, perfectly balanced graph partitioning techniques and excessively exploit the given communication system hierarchy. We present improvements to a local search algorithm of Brandfass et al. (2013), and decrease the running time by reducing the time needed to perform swaps in the assignment as well as by carefully constraining local search neighborhoods. Experiments indicate that our algorithms not only dramatically speed up local search, but due to the multilevel approach also find much better solutions in practice

Efficient Process-to-Node Mapping Algorithms for Stencil Computations

Author: Hunold Sascha
Lehr Markus
Schulz Christian
Träff Jesper Larsson
von Kirchbach Konrad
Publication venue
Publication date: 20/05/2020
Field of study

Good process-to-compute-node mappings can be decisive for well performing HPC applications. A special, important class of process-to-node mapping problems is the problem of mapping processes that communicate in a sparse stencil pattern to Cartesian grids. By thoroughly exploiting the inherently present structure in this type of problem, we devise three novel distributed algorithms that are able to handle arbitrary stencil communication patterns effectively. We analyze the expected performance of our algorithms based on an abstract model of inter- and intra-node communication. An extensive experimental evaluation on several HPC machines shows that our algorithms are up to two orders of magnitude faster in running time than a (sequential) high-quality general graph mapping tool, while obtaining similar results in communication performance. Furthermore, our algorithms also achieve significantly better mapping quality compared to previous state-of-the-art Cartesian grid mapping algorithms. This results in up to a threefold performance improvement of an MPI_Neighbor_alltoall exchange operation. Our new algorithms can be used to implement the MPI_Cart_create functionality.Comment: 18 pages, 9 Figure

arXiv.org e-Print Archive

Crossref

High-Quality Hierarchical Process Mapping

Author: Faraj Marcelo Fonseca
Meyerhenke Henning
Schulz Christian
van der Grinten Alexander
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 18th International Symposium on Experimental Algorithms (SEA 2020)
Publication date: 01/01/2020
Field of study

Partitioning graphs into blocks of roughly equal size such that few edges run between blocks is a frequently needed operation when processing graphs on a parallel computer. When a topology of a distributed system is known, an important task is then to map the blocks of the partition onto the processors such that the overall communication cost is reduced. We present novel multilevel algorithms that integrate graph partitioning and process mapping. Important ingredients of our algorithm include fast label propagation, more localized local search, initial partitioning, as well as a compressed data structure to compute processor distances without storing a distance matrix. Moreover, our algorithms are able to exploit a given hierarchical structure of the distributed system under consideration. Experiments indicate that our algorithms speed up the overall mapping process and, due to the integrated multilevel approach, also find much better solutions in practice. For example, one configuration of our algorithm yields similar solution quality as the previous state-of-the-art in terms of mapping quality for large numbers of partitions while being a factor 9.3 faster. Compared to the currently fastest iterated multilevel mapping algorithm Scotch, we obtain 16% better solutions while investing slightly more running time

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Improving MPI Applications Performance on Multicore Clusters with Rank Reordering

Author: Jeannot Emmanuel
Mercier Guillaume
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 18/09/2011
Field of study

International audienceModern hardware architectures featuring multicores and a complex memory hierarchy raise challenges that need to be addressed by parallel applications programmers. It is therefore tempting to adapt an application communication pattern to the characteristics of the underlying hardware. The MPI standard features several functions that allow the ranks of MPI processes to be reordered according to a graph attached to a newly created communicator. In this paper, we explain how the MPICH2 implementation of the MPI_Dist_graph_create function was modified to reorder the MPI process ranks to create a match between the application communication pattern and the hardware topology. The experimental results on a multicore cluster show that improvements can be achieved as long as the application communication pattern is expressed by a relevant metric

Oskar Bordeaux

A hierarchical model to manage hardware topology in MPI applications

Author: Jeannot Emmanuel
Mansouri Farouk
Mercier Guillaume
Publication venue: HAL CCSD
Publication date: 01/01/2017
Field of study

International audienceThe MPI standard is a major contribution in the landscape of parallel programming. Since its inception in the mid 90's it has ensured portability and performance for parallel applications on a wide spectrum of machines and architectures. With the advent of multicore machines, understanding and taking into account the underlying physical topology and memory hierarchy have become of paramount importance. The MPI standard in its current state, however, and despite recent evolutions is still unable to offer mechanisms to achieve this. In this paper, we detail several additions to the standard that give the user tools to address the hardware topology and data locality issues while improving application performance

Crossref