8 research outputs found

    Runtime function instrumentation with EZTrace

    Get PDF
    International audienceHigh-performance computing relies more and more on complex hardware: multiple computers, multi-processor computer, multi-core processing unit, multiple general purpose graphical processing units... To efficiently exploit the power of current computing architectures, modern applications rely on a high level of parallelism. To analyze and optimize these applications, tracking the software behavior with minimum impact on the software is necessary to extract time consumption of code sections as well as resource usage (e.g., network messages). In this paper, we present a method for instrumenting functions in a binary application. This method permits to collect data at the entry and the exit of a function, allowing to analyze the execution of an application. We implemented this mechanism in \eztrace and the evaluation shows a significant improvement compared to other tools for instrumentation

    EZTrace: a generic framework for performance analysis

    Get PDF
    Poster SessionInternational audienceModern supercomputers with multi-core nodes enhanced by accelerators, as well as hybrid programming models, introduce more complexity in modern applications. Exploiting efficiently all the resources requires a complex analysis of the performance of applications in order to detect time-consuming or idle sections. We present eztrace, a generic trace generation framework that aims at providing a simple way to analyze applications. eztrace is based on plugins that allow it to trace different programming models such as MPI, pthread or OpenMP as well as user-defined libraries or applications. This framework uses two steps: one to collect the basic information during execution and one post-mortem analysis. This permits tracing the execution of applications with low overhead while allowing to refine the analysis after the execution of the program. We also present a simple script language for \eztrace that gives the user the opportunity to easily define the functions to instrument without modifying the source code of the application

    Using differential execution analysis to identify thread interference

    Get PDF
    International audienceUnderstanding the performance of a multi-threaded application is difficult. The threads interfere when they access the same shared resource, which slows down their execution. Unfortunately, current profiling tools report the hardware components or the synchronization primitives that saturate, but they cannot tell if the saturation is the cause of a performance bottleneck. In this paper, we propose a holistic metric able to pinpoint the blocks of code that suffer interference the most, regardless of the interference cause. Our metric uses performance variation as a universal indicator of interference problems. With an evaluation of 27 applications we show that our metric can identify interference problems caused by 6 different kinds of interference in 9 applications. We are able to easily remove 7 of the bottlenecks, which leads to a performance improvement of up to 9 time

    Placement d'applications parallÚles en fonction de l'affinité et de la topologie

    Get PDF
    Computer simulation is one of the pillars of Sciences and industry. Climate simulation,cosmology, or heart modeling are all areas in which computing power needs are constantlygrowing. Thus, how to scale these applications ? Parallelization and massively parallel supercomputersare the only ways to do achieve. Nevertheless, there is a price to pay consideringthe hardware topologies incessantly complex, both in terms of network and memoryhierarchy. The issue of data locality becomes central : how to reduce the distance betweena processing entity and data to which it needs to access ? Application placement is one ofthe levers to address this problem. In this thesis, we present the TreeMatch algorithmand its application for static mapping, that is to say at the lauchtime of the application,and the dynamic placement. For this second approach, we propose the awareness of datalocality within a load balancing algorithm. The different approaches discussed are validatedby experiments both on benchmarking codes and on real applications.La simulation numĂ©rique est un des piliers des Sciences et de l’industrie. La simulationmĂ©tĂ©orologique, la cosmologie ou encore la modĂ©lisation du coeur humain sont autantde domaines dont les besoins en puissance de calcul sont sans cesse croissants. DĂšs lors,comment passer ces applications Ă  l’échelle ? La parallĂ©lisation et les supercalculateurs massivementparallĂšles sont les seuls moyens d’y parvenir. NĂ©anmoins, il y a un prix Ă  payercompte tenu des topologies matĂ©rielles de plus en plus complexes, tant en terme de rĂ©seauque de hiĂ©rarchie mĂ©moire. La question de la localitĂ© des donnĂ©es devient ainsi centrale :comment rĂ©duire la distance entre une entitĂ© logicielle et les donnĂ©es auxquelles elle doitaccĂ©der ? Le placement d’applications est un des leviers permettant de traiter ce problĂšme.Dans cette thĂšse, nous prĂ©sentons l’algorithme de placement TreeMatch et ses applicationsdans le cadre du placement statique, c’est-Ă -dire au lancement de l’application, et duplacement dynamique. Pour cette seconde approche, nous proposons la prise en comptede la localitĂ© des donnĂ©es dans le cadre d’un algorithme d’équilibrage de charge. Les diffĂ©rentesapproches abordĂ©es sont validĂ©es par des expĂ©riences rĂ©alisĂ©es tant sur des codesd’évaluation de performances que sur des applications rĂ©elles

    Profilage continu et efficient de verrous pour Java pour les architectures multicƓurs

    Get PDF
    Today, the processing of large dataset is generally parallelised and performed on computers with many cores. However, locks can serialize the execution of these cores and hurt the latency and the processing throughput. Spotting theses lock contention issues in-vitro (i.e. during the development phase) is complex because it is difficult to reproduce a production environment, to create a realistic workload representative of the context of use of the software and to test every possible configuration of deployment where will be executed the software. This thesis introduces Free Lunch, a lock profiler that diagnoses phases of high lock contention due to locks in-vivo (i.e. during the operational phase). Free Lunch is designed around a new metric, the Critical Section Pressure (CSP), which aims to evaluate the impact of lock contention on overall thread progress. Free Lunch is integrated in Hotpost in order to minimize the overhead and regularly reports the CSP during the execution in order to detect temporary issues due to locks. Free Lunch is evaluated over 31 benchmarks from Dacapo 9.12, SpecJVM08 and SpecJBB2005, and over the Cassandra database. We were able to pinpoint the phases of lock contention in 6 applications for which some of these were not detected by existing profilers. With this information, we have improved the performance of Xalan by 15% just by rewriting one line of code and identified a phase of high lock contention in Cassandra during the replay of transactions after a crash of a node. Free Lunch has never degraded performance by more than 6%, which makes it suitable to be deployed continuously in an operational environment.Aujourd’hui, le traitement de grands jeux de donnĂ©es est gĂ©nĂ©ralement parallĂ©lisĂ© et effectuĂ© sur des machines multi-cƓurs. Cependant, les verrous peuvent sĂ©rialiser l'exĂ©cution de ces coeurs et dĂ©grader la latence et le dĂ©bit du traitement. DĂ©tecter ces problĂšmes de contention de verrous in-vitro (i.e. pendant le dĂ©veloppement du logiciel) est complexe car il est difficile de reproduire un environnement de production, de crĂ©er une charge de travail rĂ©aliste reprĂ©sentative du contexte d’utilisation du logiciel et de tester toutes les configurations de dĂ©ploiement possibles oĂč s'exĂ©cutera le logiciel. Cette thĂšse prĂ©sente Free Lunch, un profiler permettant d'identifier les phases de contention dues aux verrous in-vivo (i.e. en production). Free Lunch intĂšgre une nouvelle mĂ©trique appelĂ©e Critical Section Pressure (CSP) Ă©valuant avec prĂ©cision l'impact de la synchronisation sur le progrĂšs des threads. Free Lunch est directement intĂ©grĂ© dans la JVM Hotspot pour minimiser le surcoĂ»t d'exĂ©cution et reporte rĂ©guliĂšrement la CSP afin de pouvoir dĂ©tecter les problĂšmes transitoires dus aux verrous. Free Lunch est Ă©valuĂ© sur 31 benchmarks issus de Dacapo 9.12, SpecJVM08 et SpecJBB2005, ainsi que sur la base de donnĂ©es Cassandra. Nous avons identifiĂ© des phases de contention dans 6 applications dont certaines n'Ă©taient pas dĂ©tectĂ©es par les profilers actuels. GrĂące Ă  ces informations, nous avons amĂ©liorĂ© la performance de Xalan de 15% en modifiant une seule ligne de code et identifiĂ© une phase de haute contention dans Cassandra. Free Lunch n’a jamais dĂ©gradĂ© les performances de plus de 6% ce qui le rend appropriĂ© pour ĂȘtre dĂ©ployĂ© continuellement dans un environnement de production

    Profilage continu et efficient de verrous pour Java pour les architectures multicƓurs

    Get PDF
    Today, the processing of large dataset is generally parallelised and performed on computers with many cores. However, locks can serialize the execution of these cores and hurt the latency and the processing throughput. Spotting theses lock contention issues in-vitro (i.e. during the development phase) is complex because it is difficult to reproduce a production environment, to create a realistic workload representative of the context of use of the software and to test every possible configuration of deployment where will be executed the software. This thesis introduces Free Lunch, a lock profiler that diagnoses phases of high lock contention due to locks in-vivo (i.e. during the operational phase). Free Lunch is designed around a new metric, the Critical Section Pressure (CSP), which aims to evaluate the impact of lock contention on overall thread progress. Free Lunch is integrated in Hotpost in order to minimize the overhead and regularly reports the CSP during the execution in order to detect temporary issues due to locks. Free Lunch is evaluated over 31 benchmarks from Dacapo 9.12, SpecJVM08 and SpecJBB2005, and over the Cassandra database. We were able to pinpoint the phases of lock contention in 6 applications for which some of these were not detected by existing profilers. With this information, we have improved the performance of Xalan by 15% just by rewriting one line of code and identified a phase of high lock contention in Cassandra during the replay of transactions after a crash of a node. Free Lunch has never degraded performance by more than 6%, which makes it suitable to be deployed continuously in an operational environment.Aujourd’hui, le traitement de grands jeux de donnĂ©es est gĂ©nĂ©ralement parallĂ©lisĂ© et effectuĂ© sur des machines multi-cƓurs. Cependant, les verrous peuvent sĂ©rialiser l'exĂ©cution de ces coeurs et dĂ©grader la latence et le dĂ©bit du traitement. DĂ©tecter ces problĂšmes de contention de verrous in-vitro (i.e. pendant le dĂ©veloppement du logiciel) est complexe car il est difficile de reproduire un environnement de production, de crĂ©er une charge de travail rĂ©aliste reprĂ©sentative du contexte d’utilisation du logiciel et de tester toutes les configurations de dĂ©ploiement possibles oĂč s'exĂ©cutera le logiciel. Cette thĂšse prĂ©sente Free Lunch, un profiler permettant d'identifier les phases de contention dues aux verrous in-vivo (i.e. en production). Free Lunch intĂšgre une nouvelle mĂ©trique appelĂ©e Critical Section Pressure (CSP) Ă©valuant avec prĂ©cision l'impact de la synchronisation sur le progrĂšs des threads. Free Lunch est directement intĂ©grĂ© dans la JVM Hotspot pour minimiser le surcoĂ»t d'exĂ©cution et reporte rĂ©guliĂšrement la CSP afin de pouvoir dĂ©tecter les problĂšmes transitoires dus aux verrous. Free Lunch est Ă©valuĂ© sur 31 benchmarks issus de Dacapo 9.12, SpecJVM08 et SpecJBB2005, ainsi que sur la base de donnĂ©es Cassandra. Nous avons identifiĂ© des phases de contention dans 6 applications dont certaines n'Ă©taient pas dĂ©tectĂ©es par les profilers actuels. GrĂące Ă  ces informations, nous avons amĂ©liorĂ© la performance de Xalan de 15% en modifiant une seule ligne de code et identifiĂ© une phase de haute contention dans Cassandra. Free Lunch n’a jamais dĂ©gradĂ© les performances de plus de 6% ce qui le rend appropriĂ© pour ĂȘtre dĂ©ployĂ© continuellement dans un environnement de production

    Contributions à l'amélioration de l'extensibilité de simulations parallÚles de plasmas turbulents

    Get PDF
    Energy needs around the world still increase despite the resources needed to produce fossil energy drain off year after year. An alternative way to produce energy is by nuclear fusion through magnetic confinement. Mastering this reaction is a challenge and represents an active field of the current research. In order to improve our understanding of the phenomena which occur during a fusion reaction, experiment and simulation are both put to use. The performed experiments, thanks to Tokamaks, allow some experimental reading. The process of experimental measurements is of great complexity and requires the use of the most advanced available technologies. Currently, these measurements do not give access to all scales of time and space of physical phenomenon. Numerical simulation permits the exploration of these scales which are still unreachable through experiment. An extreme computing power is mandatory to perform realistic simulations. The use of High Performance Computing (HPC) is necessary to access simulation of realistic cases. This requirement means the use of large computers, also known as supercomputers. The works realized through this thesis focuses on the optimization of the Gysela code which simulates a plasma turbulence. Optimization of a scientific application concerns mainly one of the three following points : (i ) the simulation of larger meshes, (ii ) the reduction of computing time and (iii ) the enhancement of the computation accuracy. The first part of this manuscript presents the contributions relative to simulation of larger mesh. Alike many simulation codes, getting more realistic simulations is often analogous to refine the meshes. The finer the mesh the larger the memory consumption. Moreover, during these last few years, the supercomputers had trend to provide less and less memory per computer core. For these reasons, we have developed a library, the libMTM (Modeling and Tracing Memory), dedicated to study precisely the memory consumption of parallel softwares. The libMTM tools allowed us to reduce the memory consumption of Gysela and to study its scalability. As far as we know, there is no other tool which provides equivalent features which allow the memoryscalability study. The second part of the manuscript presents the works relative to the optimization of the computation time and the improvement of accuracy of the gyroaverage operator. This operator represents a corner stone of the gyrokinetic model which is used by the Gysela application. The improvement of accuracy emanates from a change in the computing method : a scheme based on a 2D Hermite interpolation substitutes the PadĂ© approximation. Although the new version of the gyroaverage operator is more accurate, it is also more expensive in computation time than the former one. In order to keep the simulation in reasonable time, diferent optimizations have been performed on the new computing method to get it competitive. Finally, we have developed a MPI parallelized version of the new gyroaverage operator. The good scalability of this new gyroaverage computer will allow, eventually, a reduction of MPI communication costs which are penalizing in Gysela.Les besoins en Ă©nergie dans le monde sont croissants alors que les ressources nĂ©cessaires pour la production d'Ă©nergie fossile s'Ă©puisent d'annĂ©e en annĂ©e. Un des moyens alternatifs pour produire de l'Ă©nergie est la fusion nuclĂ©aire par confinement magnĂ©tique. La maĂźtrise de cette rĂ©action est un dĂ©fi et constitue un domaine actif de recherche. Pour amĂ©liorer notre connaissance des phĂ©nomĂšnes qui interviennent lors de la rĂ©action de fusion, deux approches sont mises en oeuvre : l'expĂ©rience et la simulation. Les expĂ©rience rĂ©alisĂ©es grĂące aux Tokamaks permettent de prendre des mesures. Ceci nĂ©cessite l'utilisation des technologiques les plus avancĂ©es. Actuellement, ces mesures ne permettent pas d'accĂ©der Ă  toutes Ă©chelles de temps et d'espace des phĂ©nomĂšnes physiques. La simulation numĂ©rique permet d'explorer ces Ă©chelles encore inaccessibles par l'expĂ©rience. Les ressources matĂ©rielles qui permettent d'effectuer des simulations rĂ©alistes sont consĂ©quentes. L'usage du calcul haute performance (High Performance Computing HPC) est nĂ©cessaire pour avoir accĂšs Ă  ces simulations. Ceci se traduit par l'exploitation de grandes machines de calcul aussi appelĂ©es supercalculateurs. Les travaux rĂ©alisĂ©s dans cette thĂšse portent sur l'optimisation de l'application Gysela qui est un code de simulation de turbulence de plasma. L'optimisation d'un code de calcul scientifique vise classiquement l'un des trois points suivants : (i ) la simulation de plus grand domaine de calcul, (ii ) la rĂ©duction du temps de calcul et (iii ) l'amĂ©lioration de la prĂ©cision des calculs. La premiĂšre partie de ce manuscrit prĂ©sente les contributions concernant la simulation de plus grand domaine. Comme beaucoup de codes de simulation, l'amĂ©lioration de la prĂ©cision de la simulation est souvent synonyme de raffinement du maillage. Plus un maillage est fin, plus la consommation mĂ©moire est grande. De plus, durant ces derniĂšres annĂ©es, les supercalculateurs ont eu tendance Ă  disposer de moins en moins de mĂ©moire par coeur de calcul. Pour ces raisons, nous avons dĂ©veloppĂ© une bibliothĂšque, la libMTM (Modeling and Tracing Memory), dĂ©diĂ©e Ă  l'Ă©tude prĂ©cise de la consommation mĂ©moire d'applications parallĂšles. Les outils de la libMTM ont permis de rĂ©duire la consommation mĂ©moire de Gysela et d'Ă©tudier sa scalabilitĂ©. À l'heure actuelle, nous ne connaissons pas d'autre outil qui propose de fonctionnalitĂ©s Ă©quivalentes permettant une Ă©tude prĂ©cise de la scalabilitĂ© mĂ©moire. La deuxiĂšme partie de ce manuscrit prĂ©sente les travaux concernant l'optimisation du temps d'exĂ©cution et l'amĂ©lioration de la prĂ©cision de l'opĂ©rateur de gyromoyenne. Cet opĂ©rateur est fondamental dans le modĂšle gyromagnĂ©tique qui est utilisĂ© par l'application Gysela. L'amĂ©lioration de la prĂ©cision vient d'un changement de la mĂ©thode de calcul : un schĂ©ma basĂ© sur une interpolation de type Hermite vient remplacer l'approximation de PadĂ©. Il s'avĂšre que cette nouvelle version de l'opĂ©rateur est plus prĂ©cise mais aussi plus coĂ»teuse en terme de temps de calcul que l'opĂ©rateur existant. Afin que les temps de simulation restent raisonnables, diffĂ©rentes optimisations ont Ă©tĂ© rĂ©alisĂ©es sur la nouvelle mĂ©thode de calcul pour la rendre trĂšs compĂ©titive. Nous avons aussi dĂ©veloppĂ© une version parallĂ©lisĂ©e en MPI du nouvel opĂ©rateur de gyromoyenne. La bonne scalabilitĂ© de cet opĂ©rateur de gyromoyenne permettra, Ă  terme, de rĂ©duire des coĂ»ts en communication qui sont pĂ©nalisants dans une application parallĂšle comme Gysela
    corecore