5,608 research outputs found

    Simulation of MPI applications with time-independent traces

    Get PDF
    International audienceAnalyzing and understanding the performance behavior of parallel applications on parallel computing platforms is a long-standing concern in the High Performance Computing community. When the targeted platforms are not available , simulation is a reasonable approach to obtain objective performance indicators and explore various hypothetical scenarios. In the context of applications implemented with the Message Passing Interface, two simulation methods have been proposed, on-line simulation and off-line simulation, both with their own drawbacks and advantages. In this work we present an off-line simulation framework, i.e., one that simulates the execution of an application based on event traces obtained from an actual execution. The main novelty of this work, when compared to previously proposed off-line simulators, is that traces that drive the simulation can be acquired on large, distributed, heterogeneous , and non-dedicated platforms. As a result the scalability of trace acquisition is increased, which is achieved by enforcing that traces contain no time-related information. Moreover, our framework is based on an state-of-the-art scalable, fast, and validated simulation kernel. We introduce the notion of performing off-line simulation from time-independent traces, propose and evaluate several trace acquisition strategies, describe our simulation framework, and assess its quality in terms of trace acquisition scalability, simulation accuracy, and simulation time

    Evaluation of Profiling Tools for the Acquisition of Time Independent Traces

    Get PDF
    In a previous work, we proposed a framework for the off-line simulation of MPI applications. Its main originality with regard to the literature is to rely on time-independent execution traces. Time-independent traces are an original way to estimate the performance of parallel applications. To acquire time-independent traces of the execution of MPI applications, we have to instrument them to log the necessary information. There exist many profiling tools which can instrument an application. In this report we propose a scoring system that corresponds to our framework specific requirements and evaluate the most well-known and open source profiling tools according to it. Furthermore we introduce an original tool called Minimal Instrumentation that was designed to fulfill the requirements of our framework.Dans nos précédents travaux, nous avons proposé un environnement pour la simulation hors-ligne d'applications MPI. Sa principale originalité vis-à-vis de la littérature est de s'appuyer sur des traces d'exécution indépendantes du temps. Cela constitue une maniÚre originale d'estimer les performances d'applications parallÚles. Pour acquérir de telles traces indépendantes du temps lors de l'exécution d'applications MPI, nous devns les instrumenter afin de recueillir toutes les informations nécessaires. Il existe de nombreux outils de profiling permettant d'instrumenter une application. Dans ce rapport, nous proposons une méthode de notation correspondant aux besoins spécifiques de notre environnement et évaluons les outils de profiling open-source les plus connus selon cette méthode. De plus, nous introduisons un outil original, appelé Minimal Instrumentation, spécialement conçu pour répondre aux besoins de notre environnement

    Assessing the Performance of MPI Applications Through Time-Independent Trace Replay

    Get PDF
    International audienceSimulation is a popular approach to obtain objective performance indicators platforms that are not at one's disposal. It may help the dimensioning of compute clusters in large computing centers. In this work we present a framework for the off-line simulation of MPI applications. Its main originality with regard to the literature is to rely on time-independent execution traces. This allows us to completely decouple the acquisition process from the actual replay of the traces in a simulation context. Then we are able to acquire traces for large application instances without being limited to an execution on a single compute cluster. Finally our framework is built on top of a scalable, fast, and validated simulation kernel. In this paper, we present the used time-independent trace format, investigate several acquisition strategies, detail the developed trace replay tool, and assess the quality of our simulation framework in terms of accuracy, acquisition time, simulation time, and trace size.La simulation est une approche trĂšs populaire pour obtenir des indicateurs de performances objectifs sur des plates-formes qui ne sont pas disponibles. Cela peut permettre le dimensionnement de grappes de calculs au sein de grands centres de calcul. Dans cet article nous prĂ©sentons un outil de simulation post-mortem d'applications MPI. Sa principale originalitĂ© au regard de la littĂ©rature est d'utiliser des traces d'exĂ©cution indĂ©pendantes du temps. Cela permet de dĂ©coupler intĂ©gralement le processus d'acquisition des traces de celui de rejeu dans un contexte de simulation. Il est ainsi possible d'obtenir des traces pour de grandes instances de problĂšmes sans ĂȘtre limitĂ© Ă  des exĂ©cutions au sein d'une unique grappe. Enfin notre outil est dĂ©veloppĂ© au dessus d'un noyau de simulation scalable, rapide et validĂ©. Cet article prĂ©sente le format de traces indĂ©pendantes du temps utilisĂ©, Ă©tudie plusieurs stratĂ©gies d'acquisition, dĂ©taille l'outil de rejeu que nous avons dĂ©velopĂ©, et evaluĂ© la qualitĂ© de nos simulations en termes de prĂ©cision, temps d'acuisition, temps de simulation et tailles de traces

    Improving the Accuracy and Efficiency of Time-Independent Trace Replay

    Get PDF
    Simulation is a popular approach to obtain objective performance indicators on platforms that are not at one's disposal. It may help the dimensioning of compute clusters in large computing centers. In a previous work, we proposed a framework for the off-line simulation of MPI applications. Its main originality with regard to the literature is to rely on time-independent execution traces. This allows us to completely decouple the acquisition process from the actual replay of the traces in a simulation context. Then we are able to acquire traces for large application instances without being limited to an execution on a single compute cluster. Finally our framework is built on top of a scalable, fast, and validated simulation kernel. In this paper, we detail the performance issues that we encountered with the first implementation of our trace replay framework. We propose several modifications to address these issues and analyze their impact. Results shows a clear improvement on the accuracy and efficiency with regard to the initial implementation.La simulation est une approche populaire pour obtenir des indicateurs de performance objectifs sur des plates-formes qui ne sont pas nĂ©cessairement accessibles. Elle peut par exemple aider au dimensionnement d'infrastructures dans de grands centres de calcul. Dans un article prĂ©cĂ©dent, nous avons proposĂ© un environnement pour la simulation hors-ligne d'applications MPI. La principale originalitĂ© de cet environnement par rapport Ă  la littĂ©rature est de ne reposer que sur des traces indĂ©pendantes du temps. Cela nous permet de dĂ©coupler totalement l'acquisition des traces de leur rejeu simulĂ© effectif. Nous sommes ainsi capables d'obtenir des traces pour de trĂšs grandes instances d'applications sans ĂȘtre limitĂ©s Ă  une exĂ©cution au sein d'une seule grappe de machines. Enfin, cet environnement est fondĂ© sur un noyau de simulation extensible, rapide et validĂ©. Dans cet article nous dĂ©taillons les problĂšmes de performance rencontrĂ©s par la premiĂšre implantation de notre environnement de rejeu de traces. Nous proposons plusieurs modifications pour rĂ©soudre ces problĂšmes et analysons leur impact. Les rĂ©sultats obtenus montrent une amĂ©lioration notable Ă  la fois en termes de prĂ©cision et d'efficacitĂ© par rapport Ă  l'implantation initiale

    Platform independent profiling of a QCD code

    Get PDF
    The supercomputing platforms available for high performance computing based research evolve at a great rate. However, this rapid development of novel technologies requires constant adaptations and optimizations of the existing codes for each new machine architecture. In such context, minimizing time of efficiently porting the code on a new platform is of crucial importance. A possible solution for this common challenge is to use simulations of the application that can assist in detecting performance bottlenecks. Due to prohibitive costs of classical cycle-accurate simulators, coarse-grain simulations are more suitable for large parallel and distributed systems. We present a procedure of implementing the profiling for openQCD code [1] through simulation, which will enable the global reduction of the cost of profiling and optimizing this code commonly used in the lattice QCD community. Our approach is based on well-known SimGrid simulator [2], which allows for fast and accurate performance predictions of HPC codes. Additionally, accurate estimations of the program behavior on some future machines, not yet accessible to us, are anticipated

    Performance and Power Analysis of HPC Workloads on Heterogenous Multi-Node Clusters

    Get PDF
    Performance analysis tools allow application developers to identify and characterize the inefficiencies that cause performance degradation in their codes, allowing for application optimizations. Due to the increasing interest in the High Performance Computing (HPC) community towards energy-efficiency issues, it is of paramount importance to be able to correlate performance and power figures within the same profiling and analysis tools. For this reason, we present a performance and energy-efficiency study aimed at demonstrating how a single tool can be used to collect most of the relevant metrics. In particular, we show how the same analysis techniques can be applicable on different architectures, analyzing the same HPC application on a high-end and a low-power cluster. The former cluster embeds Intel Haswell CPUs and NVIDIA K80 GPUs, while the latter is made up of NVIDIA Jetson TX1 boards, each hosting an Arm Cortex-A57 CPU and an NVIDIA Tegra X1 Maxwell GPU.The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc projects [17], grant agreements n. 288777, 610402 and 671697. E.C. was partially founded by “Contributo 5 per mille assegnato all’Università degli Studi di Ferrara-dichiarazione dei redditi dell’anno 2014”. We thank the University of Ferrara and INFN Ferrara for the access to the COKA Cluster. We warmly thank the BSC tools group, supporting us for the smooth integration and test of our setup within Extrae and Paraver.Peer ReviewedPostprint (published version
    • 

    corecore