Search CORE

3 research outputs found

Process Affinity, Metrics and Impact on Performance: an Empirical Study

Author: Bordage Cyril
Jeannot Emmanuel
Publication venue: HAL CCSD
Publication date: 01/05/2018
Field of study

International audienceProcess placement, also called topology mapping, is a well-known strategy to improve parallel program execution by reducing the communication cost between processes. It requires two inputs: the topology of the target machine and a measure of the affinity between processes. In the literature, the dominant affinity measure is the communication matrix that describes the amount of communication between processes. The goal of this paper is to study the accuracy of the communication matrix as a measure of affinity. We have done an extensive set of tests with two fat-tree machines and a 3d-torus machine to evaluate several hypotheses that are often made in the literature and to discuss their validity. First, we check the correlation between algorithmic metrics and the performance of the application. Then, we check whether a good generic process placement algorithm never degrades performance. And finally, we see whether the structure of the communication matrix can be used to predict gain. I. INTRODUCTION We are currently seeing a deepening in the hierarchy of high-performance computing system. Nodes are composed of multicore processors with different levels of memory (standard DRAM, non-volatile memory, faster but smaller MCDRAM for KNL, etc.) and the network interconnecting these nodes can also be highly intricate with complex topology and high diameter. The consequence of these architectural features is that the performance of the parallel applications highly depends on the nodes allocated for the job as well as the mapping of these jobs. Process placement (also known as topology mapping) is an active field of research that deals with the development of strategies targeting the improvement of parallel applications by carefully allocating processes onto the resources [14]. The goal is to reduce the communication by mapping close to each other processes that communicate the most. The communication time depends on the algorithm implemented in the application: it depends on the quantity of data to be exchanged. Moreover, since all computing resources are not directly connected, it also depends on the distance between the running processes as well as the speed of the different links. Figure 1 shows what can be the distances (in number of hops) between cores in a fat-tree machine with 6 nodes with 24 cores each (two processors made of two NUMA nodes with 6 cores each). We see clearly blocks of same distances. Hence, it seems natural to put closer two processes that communicate a lot to reduce the communication cost. To this purpose, we need to adapt the execution of parallel applications to the target machine according to its specific topology

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Affinité entre les processus, métriques et impact sur les performances : étude expérimentale

Author: Bordage Cyril
Jeannot Emmanuel
Publication venue: HAL CCSD
Publication date: 19/12/2017
Field of study

Process placement, also called topology mapping, is a well-known strategy to improve parallel program execution by reducing the communication cost between processes. It requires two inputs: the topology of the target machine and a measure of the affinity between processes. In the literature, the dominant affinity measure is the communication matrix that describes the amount of communication between processes. The goal of this paper is to study the accuracy of the communication matrix as a measure of affinity. We have done an extensive set of tests with two fat-tree machines and a 3d-torus machine to evaluate several hypotheses that are often made in the literature and to discuss their validity. First, we check the correlation between algorithmic metrics and the performance of the application. Then, we check whether a good generic process placement algorithm never degrades performance. And finally, we see whether the structure of the communication matrix can be used to predict gain.Le placement de processus en prenant en compte la topologie de la machine est unetechnique bien connue pour réduire le temps d’exécution d’un programme parallèle en diminuantle coût des communications entre les processus. Il nécessite deux entrées : la topologie de lamachine cible, et une mesure de l’affinité entre les processus. Dans la littérature, la mesured’affinité qui prédomine est la matrice de communication qui comptabilise les communicationsentre les processus. Le but de ce papier est d’étudier la pertinence de la matrice de communicationcomme mesure de l’affinité. Dans ce but, nous avons réalisé un grand nombre de tests sur unemachine de type fat-tree ainsi que sur un tore 3d, afin d’évaluer plusieurs hypothèse qui seretrouvent souvent dans la littérature et de discuter de leur validité. Pour cela, d’abord nousvérifions la corrélation entre des métriques algorithmiques et la performance de l’application.Ensuite, nous contrôlons qu’un bon algorithme de placement n’implique jamais une dégradationdes performances d’une application. Et finalement, nous étudions la structure de la matrice decommunication dans le but de voir si elle peut être utilisée dans la prédiction du gain

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Optimizing Locality by Topology-aware Placement for a Task Based Programming Model

Author: Gustedt Jens
Jeannot Emmanuel
Mansouri Farouk
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 13/09/2016
Field of study

International audienceThe ordered read-write lock model (ORWL) is a modern framework that proposes high level abstractions for the decomposition of an application and for the management of synchronizations and communications. The implementation of the model reaches high performances thanks to a decentralized event-based runtime. In this paper, we propose to enrich ORWL by proposing a topology-aware placement module that is based on the Hardware Locality framework, HWLOC. The aim is double. On one hand we increase the abstraction and the portability of the framework, and on the other hand we enhance the performance of the model’s runtime. We propose a placement policy, that takes the characteristics of the application, of the runtime and of the architecture into account. We validate and compare our approach with the Livermore kernel23 benchmarks

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot