Search CORE

8 research outputs found

Runtime function instrumentation with EZTrace

Author: Aulagnon Charles
Martin-Guillerez Damien
Rue François
Trahay François
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 28/08/2012
Field of study

International audienceHigh-performance computing relies more and more on complex hardware: multiple computers, multi-processor computer, multi-core processing unit, multiple general purpose graphical processing units... To efficiently exploit the power of current computing architectures, modern applications rely on a high level of parallelism. To analyze and optimize these applications, tracking the software behavior with minimum impact on the software is necessary to extract time consumption of code sections as well as resource usage (e.g., network messages). In this paper, we present a method for instrumenting functions in a binary application. This method permits to collect data at the entry and the exit of a function, allowing to analyze the execution of an application. We implemented this mechanism in \eztrace and the evaluation shows a significant improvement compared to other tools for instrumentation

Crossref

INRIA a CCSD electronic archive server

EZTrace: a generic framework for performance analysis

Author: Dongarra Jack
Faverge Mathieu
Ishikawa Yutaka
Namyst Raymond
Rue François
Trahay François
Publication venue: HAL CCSD
Publication date: 23/05/2011
Field of study

Poster SessionInternational audienceModern supercomputers with multi-core nodes enhanced by accelerators, as well as hybrid programming models, introduce more complexity in modern applications. Exploiting efficiently all the resources requires a complex analysis of the performance of applications in order to detect time-consuming or idle sections. We present eztrace, a generic trace generation framework that aims at providing a simple way to analyze applications. eztrace is based on plugins that allow it to trace different programming models such as MPI, pthread or OpenMP as well as user-defined libraries or applications. This framework uses two steps: one to collect the basic information during execution and one post-mortem analysis. This permits tracing the execution of applications with low overhead while allowing to refine the analysis after the execution of the program. We also present a simple script language for \eztrace that gives the user the opportunity to easily define the functions to instrument without modifying the source code of the application

INRIA a CCSD electronic archive server

Using differential execution analysis to identify thread interference

Author: Brunet Elisabeth
Dulong Remi
Guermouche Amina
Lescouet Alexis
Mosli Bouksiaa Mohamed Said
Thomas Gaël
Trahay François
Voron Gauthier
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2019
Field of study

International audienceUnderstanding the performance of a multi-threaded application is difficult. The threads interfere when they access the same shared resource, which slows down their execution. Unfortunately, current profiling tools report the hardware components or the synchronization primitives that saturate, but they cannot tell if the saturation is the cause of a performance bottleneck. In this paper, we propose a holistic metric able to pinpoint the blocks of code that suffer interference the most, regardless of the interference cause. Our metric uses performance variation as a universal indicator of interference problems. With an evaluation of 27 applications we show that our metric can identify interference problems caused by 6 different kinds of interference in 9 applications. We are able to easily remove 7 of the bottlenecks, which leads to a performance improvement of up to 9 time

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Placement d'applications parallèles en fonction de l'affinité et de la topologie

Author: Tessier Francois
Publication venue: HAL CCSD
Publication date: 26/01/2015
Field of study

Computer simulation is one of the pillars of Sciences and industry. Climate simulation,cosmology, or heart modeling are all areas in which computing power needs are constantlygrowing. Thus, how to scale these applications ? Parallelization and massively parallel supercomputersare the only ways to do achieve. Nevertheless, there is a price to pay consideringthe hardware topologies incessantly complex, both in terms of network and memoryhierarchy. The issue of data locality becomes central : how to reduce the distance betweena processing entity and data to which it needs to access ? Application placement is one ofthe levers to address this problem. In this thesis, we present the TreeMatch algorithmand its application for static mapping, that is to say at the lauchtime of the application,and the dynamic placement. For this second approach, we propose the awareness of datalocality within a load balancing algorithm. The different approaches discussed are validatedby experiments both on benchmarking codes and on real applications.La simulation numérique est un des piliers des Sciences et de l’industrie. La simulationmétéorologique, la cosmologie ou encore la modélisation du coeur humain sont autantde domaines dont les besoins en puissance de calcul sont sans cesse croissants. Dès lors,comment passer ces applications à l’échelle ? La parallélisation et les supercalculateurs massivementparallèles sont les seuls moyens d’y parvenir. Néanmoins, il y a un prix à payercompte tenu des topologies matérielles de plus en plus complexes, tant en terme de réseauque de hiérarchie mémoire. La question de la localité des données devient ainsi centrale :comment réduire la distance entre une entité logicielle et les données auxquelles elle doitaccéder ? Le placement d’applications est un des leviers permettant de traiter ce problème.Dans cette thèse, nous présentons l’algorithme de placement TreeMatch et ses applicationsdans le cadre du placement statique, c’est-à-dire au lancement de l’application, et duplacement dynamique. Pour cette seconde approche, nous proposons la prise en comptede la localité des données dans le cadre d’un algorithme d’équilibrage de charge. Les différentesapproches abordées sont validées par des expériences réalisées tant sur des codesd’évaluation de performances que sur des applications réelles

Thèses en Ligne

INRIA a CCSD electronic archive server

Profilage continu et efficient de verrous pour Java pour les architectures multicœurs

Author: David Florian
Publication venue: HAL CCSD
Publication date: 08/07/2015
Field of study

Today, the processing of large dataset is generally parallelised and performed on computers with many cores. However, locks can serialize the execution of these cores and hurt the latency and the processing throughput. Spotting theses lock contention issues in-vitro (i.e. during the development phase) is complex because it is difficult to reproduce a production environment, to create a realistic workload representative of the context of use of the software and to test every possible configuration of deployment where will be executed the software. This thesis introduces Free Lunch, a lock profiler that diagnoses phases of high lock contention due to locks in-vivo (i.e. during the operational phase). Free Lunch is designed around a new metric, the Critical Section Pressure (CSP), which aims to evaluate the impact of lock contention on overall thread progress. Free Lunch is integrated in Hotpost in order to minimize the overhead and regularly reports the CSP during the execution in order to detect temporary issues due to locks. Free Lunch is evaluated over 31 benchmarks from Dacapo 9.12, SpecJVM08 and SpecJBB2005, and over the Cassandra database. We were able to pinpoint the phases of lock contention in 6 applications for which some of these were not detected by existing profilers. With this information, we have improved the performance of Xalan by 15% just by rewriting one line of code and identified a phase of high lock contention in Cassandra during the replay of transactions after a crash of a node. Free Lunch has never degraded performance by more than 6%, which makes it suitable to be deployed continuously in an operational environment.Aujourd’hui, le traitement de grands jeux de données est généralement parallélisé et effectué sur des machines multi-cœurs. Cependant, les verrous peuvent sérialiser l'exécution de ces coeurs et dégrader la latence et le débit du traitement. Détecter ces problèmes de contention de verrous in-vitro (i.e. pendant le développement du logiciel) est complexe car il est difficile de reproduire un environnement de production, de créer une charge de travail réaliste représentative du contexte d’utilisation du logiciel et de tester toutes les configurations de déploiement possibles où s'exécutera le logiciel. Cette thèse présente Free Lunch, un profiler permettant d'identifier les phases de contention dues aux verrous in-vivo (i.e. en production). Free Lunch intègre une nouvelle métrique appelée Critical Section Pressure (CSP) évaluant avec précision l'impact de la synchronisation sur le progrès des threads. Free Lunch est directement intégré dans la JVM Hotspot pour minimiser le surcoût d'exécution et reporte régulièrement la CSP afin de pouvoir détecter les problèmes transitoires dus aux verrous. Free Lunch est évalué sur 31 benchmarks issus de Dacapo 9.12, SpecJVM08 et SpecJBB2005, ainsi que sur la base de données Cassandra. Nous avons identifié des phases de contention dans 6 applications dont certaines n'étaient pas détectées par les profilers actuels. Grâce à ces informations, nous avons amélioré la performance de Xalan de 15% en modifiant une seule ligne de code et identifié une phase de haute contention dans Cassandra. Free Lunch n’a jamais dégradé les performances de plus de 6% ce qui le rend approprié pour être déployé continuellement dans un environnement de production

INRIA a CCSD electronic archive server

Profilage continu et efficient de verrous pour Java pour les architectures multicœurs

Author: David Florian
Publication venue: HAL CCSD
Publication date: 08/07/2015
Field of study

Thèses en Ligne

INRIA a CCSD electronic archive server

Theses.fr

Contributions à l'amélioration de l'extensibilité de simulations parallèles de plasmas turbulents

Author: Rozar Fabien
Publication venue: HAL CCSD
Publication date: 05/11/2015
Field of study

Energy needs around the world still increase despite the resources needed to produce fossil energy drain off year after year. An alternative way to produce energy is by nuclear fusion through magnetic confinement. Mastering this reaction is a challenge and represents an active field of the current research. In order to improve our understanding of the phenomena which occur during a fusion reaction, experiment and simulation are both put to use. The performed experiments, thanks to Tokamaks, allow some experimental reading. The process of experimental measurements is of great complexity and requires the use of the most advanced available technologies. Currently, these measurements do not give access to all scales of time and space of physical phenomenon. Numerical simulation permits the exploration of these scales which are still unreachable through experiment. An extreme computing power is mandatory to perform realistic simulations. The use of High Performance Computing (HPC) is necessary to access simulation of realistic cases. This requirement means the use of large computers, also known as supercomputers. The works realized through this thesis focuses on the optimization of the Gysela code which simulates a plasma turbulence. Optimization of a scientific application concerns mainly one of the three following points : (i ) the simulation of larger meshes, (ii ) the reduction of computing time and (iii ) the enhancement of the computation accuracy. The first part of this manuscript presents the contributions relative to simulation of larger mesh. Alike many simulation codes, getting more realistic simulations is often analogous to refine the meshes. The finer the mesh the larger the memory consumption. Moreover, during these last few years, the supercomputers had trend to provide less and less memory per computer core. For these reasons, we have developed a library, the libMTM (Modeling and Tracing Memory), dedicated to study precisely the memory consumption of parallel softwares. The libMTM tools allowed us to reduce the memory consumption of Gysela and to study its scalability. As far as we know, there is no other tool which provides equivalent features which allow the memoryscalability study. The second part of the manuscript presents the works relative to the optimization of the computation time and the improvement of accuracy of the gyroaverage operator. This operator represents a corner stone of the gyrokinetic model which is used by the Gysela application. The improvement of accuracy emanates from a change in the computing method : a scheme based on a 2D Hermite interpolation substitutes the Padé approximation. Although the new version of the gyroaverage operator is more accurate, it is also more expensive in computation time than the former one. In order to keep the simulation in reasonable time, diferent optimizations have been performed on the new computing method to get it competitive. Finally, we have developed a MPI parallelized version of the new gyroaverage operator. The good scalability of this new gyroaverage computer will allow, eventually, a reduction of MPI communication costs which are penalizing in Gysela.Les besoins en énergie dans le monde sont croissants alors que les ressources nécessaires pour la production d'énergie fossile s'épuisent d'année en année. Un des moyens alternatifs pour produire de l'énergie est la fusion nucléaire par confinement magnétique. La maîtrise de cette réaction est un défi et constitue un domaine actif de recherche. Pour améliorer notre connaissance des phénomènes qui interviennent lors de la réaction de fusion, deux approches sont mises en oeuvre : l'expérience et la simulation. Les expérience réalisées grâce aux Tokamaks permettent de prendre des mesures. Ceci nécessite l'utilisation des technologiques les plus avancées. Actuellement, ces mesures ne permettent pas d'accéder à toutes échelles de temps et d'espace des phénomènes physiques. La simulation numérique permet d'explorer ces échelles encore inaccessibles par l'expérience. Les ressources matérielles qui permettent d'effectuer des simulations réalistes sont conséquentes. L'usage du calcul haute performance (High Performance Computing HPC) est nécessaire pour avoir accès à ces simulations. Ceci se traduit par l'exploitation de grandes machines de calcul aussi appelées supercalculateurs. Les travaux réalisés dans cette thèse portent sur l'optimisation de l'application Gysela qui est un code de simulation de turbulence de plasma. L'optimisation d'un code de calcul scientifique vise classiquement l'un des trois points suivants : (i ) la simulation de plus grand domaine de calcul, (ii ) la réduction du temps de calcul et (iii ) l'amélioration de la précision des calculs. La première partie de ce manuscrit présente les contributions concernant la simulation de plus grand domaine. Comme beaucoup de codes de simulation, l'amélioration de la précision de la simulation est souvent synonyme de raffinement du maillage. Plus un maillage est fin, plus la consommation mémoire est grande. De plus, durant ces dernières années, les supercalculateurs ont eu tendance à disposer de moins en moins de mémoire par coeur de calcul. Pour ces raisons, nous avons développé une bibliothèque, la libMTM (Modeling and Tracing Memory), dédiée à l'étude précise de la consommation mémoire d'applications parallèles. Les outils de la libMTM ont permis de réduire la consommation mémoire de Gysela et d'étudier sa scalabilité. À l'heure actuelle, nous ne connaissons pas d'autre outil qui propose de fonctionnalités équivalentes permettant une étude précise de la scalabilité mémoire. La deuxième partie de ce manuscrit présente les travaux concernant l'optimisation du temps d'exécution et l'amélioration de la précision de l'opérateur de gyromoyenne. Cet opérateur est fondamental dans le modèle gyromagnétique qui est utilisé par l'application Gysela. L'amélioration de la précision vient d'un changement de la méthode de calcul : un schéma basé sur une interpolation de type Hermite vient remplacer l'approximation de Padé. Il s'avère que cette nouvelle version de l'opérateur est plus précise mais aussi plus coûteuse en terme de temps de calcul que l'opérateur existant. Afin que les temps de simulation restent raisonnables, différentes optimisations ont été réalisées sur la nouvelle méthode de calcul pour la rendre très compétitive. Nous avons aussi développé une version parallélisée en MPI du nouvel opérateur de gyromoyenne. La bonne scalabilité de cet opérateur de gyromoyenne permettra, à terme, de réduire des coûts en communication qui sont pénalisants dans une application parallèle comme Gysela

Thèses en Ligne

INRIA a CCSD electronic archive server