8 research outputs found
Runtime function instrumentation with EZTrace
International audienceHigh-performance computing relies more and more on complex hardware: multiple computers, multi-processor computer, multi-core processing unit, multiple general purpose graphical processing units... To efficiently exploit the power of current computing architectures, modern applications rely on a high level of parallelism. To analyze and optimize these applications, tracking the software behavior with minimum impact on the software is necessary to extract time consumption of code sections as well as resource usage (e.g., network messages). In this paper, we present a method for instrumenting functions in a binary application. This method permits to collect data at the entry and the exit of a function, allowing to analyze the execution of an application. We implemented this mechanism in \eztrace and the evaluation shows a significant improvement compared to other tools for instrumentation
EZTrace: a generic framework for performance analysis
Poster SessionInternational audienceModern supercomputers with multi-core nodes enhanced by accelerators, as well as hybrid programming models, introduce more complexity in modern applications. Exploiting efficiently all the resources requires a complex analysis of the performance of applications in order to detect time-consuming or idle sections. We present eztrace, a generic trace generation framework that aims at providing a simple way to analyze applications. eztrace is based on plugins that allow it to trace different programming models such as MPI, pthread or OpenMP as well as user-defined libraries or applications. This framework uses two steps: one to collect the basic information during execution and one post-mortem analysis. This permits tracing the execution of applications with low overhead while allowing to refine the analysis after the execution of the program. We also present a simple script language for \eztrace that gives the user the opportunity to easily define the functions to instrument without modifying the source code of the application
Using differential execution analysis to identify thread interference
International audienceUnderstanding the performance of a multi-threaded application is difficult. The threads interfere when they access the same shared resource, which slows down their execution. Unfortunately, current profiling tools report the hardware components or the synchronization primitives that saturate, but they cannot tell if the saturation is the cause of a performance bottleneck. In this paper, we propose a holistic metric able to pinpoint the blocks of code that suffer interference the most, regardless of the interference cause. Our metric uses performance variation as a universal indicator of interference problems. With an evaluation of 27 applications we show that our metric can identify interference problems caused by 6 different kinds of interference in 9 applications. We are able to easily remove 7 of the bottlenecks, which leads to a performance improvement of up to 9 time
Placement d'applications parallÚles en fonction de l'affinité et de la topologie
Computer simulation is one of the pillars of Sciences and industry. Climate simulation,cosmology, or heart modeling are all areas in which computing power needs are constantlygrowing. Thus, how to scale these applications ? Parallelization and massively parallel supercomputersare the only ways to do achieve. Nevertheless, there is a price to pay consideringthe hardware topologies incessantly complex, both in terms of network and memoryhierarchy. The issue of data locality becomes central : how to reduce the distance betweena processing entity and data to which it needs to access ? Application placement is one ofthe levers to address this problem. In this thesis, we present the TreeMatch algorithmand its application for static mapping, that is to say at the lauchtime of the application,and the dynamic placement. For this second approach, we propose the awareness of datalocality within a load balancing algorithm. The different approaches discussed are validatedby experiments both on benchmarking codes and on real applications.La simulation numĂ©rique est un des piliers des Sciences et de lâindustrie. La simulationmĂ©tĂ©orologique, la cosmologie ou encore la modĂ©lisation du coeur humain sont autantde domaines dont les besoins en puissance de calcul sont sans cesse croissants. DĂšs lors,comment passer ces applications Ă lâĂ©chelle ? La parallĂ©lisation et les supercalculateurs massivementparallĂšles sont les seuls moyens dây parvenir. NĂ©anmoins, il y a un prix Ă payercompte tenu des topologies matĂ©rielles de plus en plus complexes, tant en terme de rĂ©seauque de hiĂ©rarchie mĂ©moire. La question de la localitĂ© des donnĂ©es devient ainsi centrale :comment rĂ©duire la distance entre une entitĂ© logicielle et les donnĂ©es auxquelles elle doitaccĂ©der ? Le placement dâapplications est un des leviers permettant de traiter ce problĂšme.Dans cette thĂšse, nous prĂ©sentons lâalgorithme de placement TreeMatch et ses applicationsdans le cadre du placement statique, câest-Ă -dire au lancement de lâapplication, et duplacement dynamique. Pour cette seconde approche, nous proposons la prise en comptede la localitĂ© des donnĂ©es dans le cadre dâun algorithme dâĂ©quilibrage de charge. Les diffĂ©rentesapproches abordĂ©es sont validĂ©es par des expĂ©riences rĂ©alisĂ©es tant sur des codesdâĂ©valuation de performances que sur des applications rĂ©elles
Profilage continu et efficient de verrous pour Java pour les architectures multicĆurs
Today, the processing of large dataset is generally parallelised and performed on computers with many cores. However, locks can serialize the execution of these cores and hurt the latency and the processing throughput. Spotting theses lock contention issues in-vitro (i.e. during the development phase) is complex because it is difficult to reproduce a production environment, to create a realistic workload representative of the context of use of the software and to test every possible configuration of deployment where will be executed the software. This thesis introduces Free Lunch, a lock profiler that diagnoses phases of high lock contention due to locks in-vivo (i.e. during the operational phase). Free Lunch is designed around a new metric, the Critical Section Pressure (CSP), which aims to evaluate the impact of lock contention on overall thread progress. Free Lunch is integrated in Hotpost in order to minimize the overhead and regularly reports the CSP during the execution in order to detect temporary issues due to locks. Free Lunch is evaluated over 31 benchmarks from Dacapo 9.12, SpecJVM08 and SpecJBB2005, and over the Cassandra database. We were able to pinpoint the phases of lock contention in 6 applications for which some of these were not detected by existing profilers. With this information, we have improved the performance of Xalan by 15% just by rewriting one line of code and identified a phase of high lock contention in Cassandra during the replay of transactions after a crash of a node. Free Lunch has never degraded performance by more than 6%, which makes it suitable to be deployed continuously in an operational environment.Aujourdâhui, le traitement de grands jeux de donnĂ©es est gĂ©nĂ©ralement parallĂ©lisĂ© et effectuĂ© sur des machines multi-cĆurs. Cependant, les verrous peuvent sĂ©rialiser l'exĂ©cution de ces coeurs et dĂ©grader la latence et le dĂ©bit du traitement. DĂ©tecter ces problĂšmes de contention de verrous in-vitro (i.e. pendant le dĂ©veloppement du logiciel) est complexe car il est difficile de reproduire un environnement de production, de crĂ©er une charge de travail rĂ©aliste reprĂ©sentative du contexte dâutilisation du logiciel et de tester toutes les configurations de dĂ©ploiement possibles oĂč s'exĂ©cutera le logiciel. Cette thĂšse prĂ©sente Free Lunch, un profiler permettant d'identifier les phases de contention dues aux verrous in-vivo (i.e. en production). Free Lunch intĂšgre une nouvelle mĂ©trique appelĂ©e Critical Section Pressure (CSP) Ă©valuant avec prĂ©cision l'impact de la synchronisation sur le progrĂšs des threads. Free Lunch est directement intĂ©grĂ© dans la JVM Hotspot pour minimiser le surcoĂ»t d'exĂ©cution et reporte rĂ©guliĂšrement la CSP afin de pouvoir dĂ©tecter les problĂšmes transitoires dus aux verrous. Free Lunch est Ă©valuĂ© sur 31 benchmarks issus de Dacapo 9.12, SpecJVM08 et SpecJBB2005, ainsi que sur la base de donnĂ©es Cassandra. Nous avons identifiĂ© des phases de contention dans 6 applications dont certaines n'Ă©taient pas dĂ©tectĂ©es par les profilers actuels. GrĂące Ă ces informations, nous avons amĂ©liorĂ© la performance de Xalan de 15% en modifiant une seule ligne de code et identifiĂ© une phase de haute contention dans Cassandra. Free Lunch nâa jamais dĂ©gradĂ© les performances de plus de 6% ce qui le rend appropriĂ© pour ĂȘtre dĂ©ployĂ© continuellement dans un environnement de production
Profilage continu et efficient de verrous pour Java pour les architectures multicĆurs
Today, the processing of large dataset is generally parallelised and performed on computers with many cores. However, locks can serialize the execution of these cores and hurt the latency and the processing throughput. Spotting theses lock contention issues in-vitro (i.e. during the development phase) is complex because it is difficult to reproduce a production environment, to create a realistic workload representative of the context of use of the software and to test every possible configuration of deployment where will be executed the software. This thesis introduces Free Lunch, a lock profiler that diagnoses phases of high lock contention due to locks in-vivo (i.e. during the operational phase). Free Lunch is designed around a new metric, the Critical Section Pressure (CSP), which aims to evaluate the impact of lock contention on overall thread progress. Free Lunch is integrated in Hotpost in order to minimize the overhead and regularly reports the CSP during the execution in order to detect temporary issues due to locks. Free Lunch is evaluated over 31 benchmarks from Dacapo 9.12, SpecJVM08 and SpecJBB2005, and over the Cassandra database. We were able to pinpoint the phases of lock contention in 6 applications for which some of these were not detected by existing profilers. With this information, we have improved the performance of Xalan by 15% just by rewriting one line of code and identified a phase of high lock contention in Cassandra during the replay of transactions after a crash of a node. Free Lunch has never degraded performance by more than 6%, which makes it suitable to be deployed continuously in an operational environment.Aujourdâhui, le traitement de grands jeux de donnĂ©es est gĂ©nĂ©ralement parallĂ©lisĂ© et effectuĂ© sur des machines multi-cĆurs. Cependant, les verrous peuvent sĂ©rialiser l'exĂ©cution de ces coeurs et dĂ©grader la latence et le dĂ©bit du traitement. DĂ©tecter ces problĂšmes de contention de verrous in-vitro (i.e. pendant le dĂ©veloppement du logiciel) est complexe car il est difficile de reproduire un environnement de production, de crĂ©er une charge de travail rĂ©aliste reprĂ©sentative du contexte dâutilisation du logiciel et de tester toutes les configurations de dĂ©ploiement possibles oĂč s'exĂ©cutera le logiciel. Cette thĂšse prĂ©sente Free Lunch, un profiler permettant d'identifier les phases de contention dues aux verrous in-vivo (i.e. en production). Free Lunch intĂšgre une nouvelle mĂ©trique appelĂ©e Critical Section Pressure (CSP) Ă©valuant avec prĂ©cision l'impact de la synchronisation sur le progrĂšs des threads. Free Lunch est directement intĂ©grĂ© dans la JVM Hotspot pour minimiser le surcoĂ»t d'exĂ©cution et reporte rĂ©guliĂšrement la CSP afin de pouvoir dĂ©tecter les problĂšmes transitoires dus aux verrous. Free Lunch est Ă©valuĂ© sur 31 benchmarks issus de Dacapo 9.12, SpecJVM08 et SpecJBB2005, ainsi que sur la base de donnĂ©es Cassandra. Nous avons identifiĂ© des phases de contention dans 6 applications dont certaines n'Ă©taient pas dĂ©tectĂ©es par les profilers actuels. GrĂące Ă ces informations, nous avons amĂ©liorĂ© la performance de Xalan de 15% en modifiant une seule ligne de code et identifiĂ© une phase de haute contention dans Cassandra. Free Lunch nâa jamais dĂ©gradĂ© les performances de plus de 6% ce qui le rend appropriĂ© pour ĂȘtre dĂ©ployĂ© continuellement dans un environnement de production
Contributions à l'amélioration de l'extensibilité de simulations parallÚles de plasmas turbulents
Energy needs around the world still increase despite the resources needed to produce fossil energy drain off year after year. An alternative way to produce energy is by nuclear fusion through magnetic confinement. Mastering this reaction is a challenge and represents an active field of the current research. In order to improve our understanding of the phenomena which occur during a fusion reaction, experiment and simulation are both put to use. The performed experiments, thanks to Tokamaks, allow some experimental reading. The process of experimental measurements is of great complexity and requires the use of the most advanced available technologies. Currently, these measurements do not give access to all scales of time and space of physical phenomenon. Numerical simulation permits the exploration of these scales which are still unreachable through experiment. An extreme computing power is mandatory to perform realistic simulations. The use of High Performance Computing (HPC) is necessary to access simulation of realistic cases. This requirement means the use of large computers, also known as supercomputers. The works realized through this thesis focuses on the optimization of the Gysela code which simulates a plasma turbulence. Optimization of a scientific application concerns mainly one of the three following points : (i ) the simulation of larger meshes, (ii ) the reduction of computing time and (iii ) the enhancement of the computation accuracy. The first part of this manuscript presents the contributions relative to simulation of larger mesh. Alike many simulation codes, getting more realistic simulations is often analogous to refine the meshes. The finer the mesh the larger the memory consumption. Moreover, during these last few years, the supercomputers had trend to provide less and less memory per computer core. For these reasons, we have developed a library, the libMTM (Modeling and Tracing Memory), dedicated to study precisely the memory consumption of parallel softwares. The libMTM tools allowed us to reduce the memory consumption of Gysela and to study its scalability. As far as we know, there is no other tool which provides equivalent features which allow the memoryscalability study. The second part of the manuscript presents the works relative to the optimization of the computation time and the improvement of accuracy of the gyroaverage operator. This operator represents a corner stone of the gyrokinetic model which is used by the Gysela application. The improvement of accuracy emanates from a change in the computing method : a scheme based on a 2D Hermite interpolation substitutes the Padé approximation. Although the new version of the gyroaverage operator is more accurate, it is also more expensive in computation time than the former one. In order to keep the simulation in reasonable time, diferent optimizations have been performed on the new computing method to get it competitive. Finally, we have developed a MPI parallelized version of the new gyroaverage operator. The good scalability of this new gyroaverage computer will allow, eventually, a reduction of MPI communication costs which are penalizing in Gysela.Les besoins en énergie dans le monde sont croissants alors que les ressources nécessaires pour la production d'énergie fossile s'épuisent d'année en année. Un des moyens alternatifs pour produire de l'énergie est la fusion nucléaire par confinement magnétique. La maßtrise de cette réaction est un défi et constitue un domaine actif de recherche. Pour améliorer notre connaissance des phénomÚnes qui interviennent lors de la réaction de fusion, deux approches sont mises en oeuvre : l'expérience et la simulation. Les expérience réalisées grùce aux Tokamaks permettent de prendre des mesures. Ceci nécessite l'utilisation des technologiques les plus avancées. Actuellement, ces mesures ne permettent pas d'accéder à toutes échelles de temps et d'espace des phénomÚnes physiques. La simulation numérique permet d'explorer ces échelles encore inaccessibles par l'expérience. Les ressources matérielles qui permettent d'effectuer des simulations réalistes sont conséquentes. L'usage du calcul haute performance (High Performance Computing HPC) est nécessaire pour avoir accÚs à ces simulations. Ceci se traduit par l'exploitation de grandes machines de calcul aussi appelées supercalculateurs. Les travaux réalisés dans cette thÚse portent sur l'optimisation de l'application Gysela qui est un code de simulation de turbulence de plasma. L'optimisation d'un code de calcul scientifique vise classiquement l'un des trois points suivants : (i ) la simulation de plus grand domaine de calcul, (ii ) la réduction du temps de calcul et (iii ) l'amélioration de la précision des calculs. La premiÚre partie de ce manuscrit présente les contributions concernant la simulation de plus grand domaine. Comme beaucoup de codes de simulation, l'amélioration de la précision de la simulation est souvent synonyme de raffinement du maillage. Plus un maillage est fin, plus la consommation mémoire est grande. De plus, durant ces derniÚres années, les supercalculateurs ont eu tendance à disposer de moins en moins de mémoire par coeur de calcul. Pour ces raisons, nous avons développé une bibliothÚque, la libMTM (Modeling and Tracing Memory), dédiée à l'étude précise de la consommation mémoire d'applications parallÚles. Les outils de la libMTM ont permis de réduire la consommation mémoire de Gysela et d'étudier sa scalabilité. à l'heure actuelle, nous ne connaissons pas d'autre outil qui propose de fonctionnalités équivalentes permettant une étude précise de la scalabilité mémoire. La deuxiÚme partie de ce manuscrit présente les travaux concernant l'optimisation du temps d'exécution et l'amélioration de la précision de l'opérateur de gyromoyenne. Cet opérateur est fondamental dans le modÚle gyromagnétique qui est utilisé par l'application Gysela. L'amélioration de la précision vient d'un changement de la méthode de calcul : un schéma basé sur une interpolation de type Hermite vient remplacer l'approximation de Padé. Il s'avÚre que cette nouvelle version de l'opérateur est plus précise mais aussi plus coûteuse en terme de temps de calcul que l'opérateur existant. Afin que les temps de simulation restent raisonnables, différentes optimisations ont été réalisées sur la nouvelle méthode de calcul pour la rendre trÚs compétitive. Nous avons aussi développé une version parallélisée en MPI du nouvel opérateur de gyromoyenne. La bonne scalabilité de cet opérateur de gyromoyenne permettra, à terme, de réduire des coûts en communication qui sont pénalisants dans une application parallÚle comme Gysela