36 research outputs found

    07341 Abstracts Collection -- Code Instrumentation and Modeling for Parallel Performance Analysis

    Get PDF
    From 20th to 24th August 2007, the Dagstuhl Seminar 07341 ``Code Instrumentation and Modeling for Parallel Performance Analysis\u27\u27 was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available

    Trace-based Performance Analysis for Hardware Accelerators

    Get PDF
    This thesis presents how performance data from hardware accelerators can be included in event logs. It extends the capabilities of trace-based performance analysis to also monitor and record data from this novel parallelization layer. The increasing awareness to power consumption of computing devices has led to an interest in hybrid computing architectures as well. High-end computers, workstations, and mobile devices start to employ hardware accelerators to offload computationally intense and parallel tasks, while at the same time retaining a highly efficient scalar compute unit for non-parallel tasks. This execution pattern is typically asynchronous so that the scalar unit can resume other work while the hardware accelerator is busy. Performance analysis tools provided by the hardware accelerator vendors cover the situation of one host using one device very well. Yet, they do not address the needs of the high performance computing community. This thesis investigates ways to extend existing methods for recording events from highly parallel applications to also cover scenarios in which hardware accelerators aid these applications. After introducing a generic approach that is suitable for any API based acceleration paradigm, the thesis derives a suggestion for a generic performance API for hardware accelerators and its implementation with NVIDIA CUPTI. In a next step the visualization of event logs containing data from execution streams on different levels of parallelism is discussed. In order to overcome the limitations of classic performance profiles and timeline displays, a graph-based visualization using Parallel Performance Flow Graphs (PPFGs) is introduced. This novel technical approach is using program states in order to display similarities and differences between the potentially very large number of event streams and, thus, enables a fast way to spot load imbalances. The thesis concludes with the in-depth analysis of a case-study of PIConGPU---a highly parallel, multi-hybrid plasma physics simulation---that benefited greatly from the developed performance analysis methods.Diese Dissertation zeigt, wie der Ablauf von Anwendungsteilen, die auf Hardwarebeschleuniger ausgelagert wurden, als Programmspur mit aufgezeichnet werden kann. Damit wird die bekannte Technik der Leistungsanalyse von Anwendungen mittels Programmspuren so erweitert, dass auch diese neue Parallelitätsebene mit erfasst wird. Die Beschränkungen von Computersystemen bezüglich der elektrischen Leistungsaufnahme hat zu einer steigenden Anzahl von hybriden Computerarchitekturen geführt. Sowohl Hochleistungsrechner, aber auch Arbeitsplatzcomputer und mobile Endgeräte nutzen heute Hardwarebeschleuniger um rechenintensive, parallele Programmteile auszulagern und so den skalaren Hauptprozessor zu entlasten und nur für nicht parallele Programmteile zu verwenden. Dieses Ausführungsschema ist typischerweise asynchron: der Skalarprozessor kann, während der Hardwarebeschleuniger rechnet, selbst weiterarbeiten. Die Leistungsanalyse-Werkzeuge der Hersteller von Hardwarebeschleunigern decken den Standardfall (ein Host-System mit einem Hardwarebeschleuniger) sehr gut ab, scheitern aber an einer Unterstützung von hochparallelen Rechnersystemen. Die vorliegende Dissertation untersucht, in wie weit auch multi-hybride Anwendungen die Aktivität von Hardwarebeschleunigern aufzeichnen können. Dazu wird die vorhandene Methode zur Erzeugung von Programmspuren für hochparallele Anwendungen entsprechend erweitert. In dieser Untersuchung wird zuerst eine allgemeine Methodik entwickelt, mit der sich für jede API-gestützte Hardwarebeschleunigung eine Programmspur erstellen lässt. Darauf aufbauend wird eine eigene Programmierschnittstelle entwickelt, die es ermöglicht weitere leistungsrelevante Daten aufzuzeichnen. Die Umsetzung dieser Schnittstelle wird am Beispiel von NVIDIA CUPTI darstellt. Ein weiterer Teil der Arbeit beschäftigt sich mit der Darstellung von Programmspuren, welche Aufzeichnungen von den unterschiedlichen Parallelitätsebenen enthalten. Um die Einschränkungen klassischer Leistungsprofile oder Zeitachsendarstellungen zu überwinden, wird mit den parallelen Programmablaufgraphen (PPFGs) eine neue graphenbasisierte Darstellungsform eingeführt. Dieser neuartige Ansatz zeigt eine Programmspur als eine Folge von Programmzuständen mit gemeinsamen und unterchiedlichen Abläufen. So können divergierendes Programmverhalten und Lastimbalancen deutlich einfacher lokalisiert werden. Die Arbeit schließt mit der detaillierten Analyse von PIConGPU -- einer multi-hybriden Simulation aus der Plasmaphysik --, die in großem Maße von den in dieser Arbeit entwickelten Analysemöglichkeiten profiert hat

    Automatic Detection of Performance Anomalies in Task-Parallel Programs

    Full text link
    To efficiently exploit the resources of new many-core architectures, integrating dozens or even hundreds of cores per chip, parallel programming models have evolved to expose massive amounts of parallelism, often in the form of fine-grained tasks. Task-parallel languages, such as OpenStream, X10, Habanero Java and C or StarSs, simplify the development of applications for new architectures, but tuning task-parallel applications remains a major challenge. Performance bottlenecks can occur at any level of the implementation, from the algorithmic level (e.g., lack of parallelism or over-synchronization), to interactions with the operating and runtime systems (e.g., data placement on NUMA architectures), to inefficient use of the hardware (e.g., frequent cache misses or misaligned memory accesses); detecting such issues and determining the exact cause is a difficult task. In previous work, we developed Aftermath, an interactive tool for trace-based performance analysis and debugging of task-parallel programs and run-time systems. In contrast to other trace-based analysis tools, such as Paraver or Vampir, Aftermath offers native support for tasks, i.e., visualization, statistics and analysis tools adapted for performance debugging at task granularity. However, the tool currently does not provide support for the automatic detection of performance bottlenecks and it is up to the user to investigate the relevant aspects of program execution by focusing the inspection on specific slices of a trace file. In this paper, we present ongoing work on two extensions that guide the user through this process.Comment: Presented at 1st Workshop on Resource Awareness and Adaptivity in Multi-Core Computing (Racing 2014) (arXiv:1405.2281

    Advanced Memory Data Structures for Scalable Event Trace Analysis

    Get PDF
    The thesis presents a contribution to the analysis and visualization of computational performance based on event traces with a particular focus on parallel programs and High Performance Computing (HPC). Event traces contain detailed information about specified incidents (events) during run-time of programs and allow minute investigation of dynamic program behavior, various performance metrics, and possible causes of performance flaws. Due to long running and highly parallel programs and very fine detail resolutions, event traces can accumulate huge amounts of data which become a challenge for interactive as well as automatic analysis and visualization tools. The thesis proposes a method of exploiting redundancy in the event traces in order to reduce the memory requirements and the computational complexity of event trace analysis. The sources of redundancy are repeated segments of the original program, either through iterative or recursive algorithms or through SPMD-style parallel programs, which produce equal or similar repeated event sequences. The data reduction technique is based on the novel Complete Call Graph (CCG) data structure which allows domain specific data compression for event traces in a combination of lossless and lossy methods. All deviations due to lossy data compression can be controlled by constant bounds. The compression of the CCG data structure is incorporated in the construction process, such that at no point substantial uncompressed parts have to be stored. Experiments with real-world example traces reveal the potential for very high data compression. The results range from factors of 3 to 15 for small scale compression with minimum deviation of the data to factors > 100 for large scale compression with moderate deviation. Based on the CCG data structure, new algorithms for the most common evaluation and analysis methods for event traces are presented, which require no explicit decompression. By avoiding repeated evaluation of formerly redundant event sequences, the computational effort of the new algorithms can be reduced in the same extent as memory consumption. The thesis includes a comprehensive discussion of the state-of-the-art and related work, a detailed presentation of the design of the CCG data structure, an elaborate description of algorithms for construction, compression, and analysis of CCGs, and an extensive experimental validation of all components.Diese Dissertation stellt einen neuartigen Ansatz für die Analyse und Visualisierung der Berechnungs-Performance vor, der auf dem Ereignis-Tracing basiert und insbesondere auf parallele Programme und das Hochleistungsrechnen (High Performance Computing, HPC) zugeschnitten ist. Ereignis-Traces (Ereignis-Spuren) enthalten detaillierte Informationen über spezifizierte Ereignisse während der Laufzeit eines Programms und erlauben eine sehr genaue Untersuchung des dynamischen Verhaltens, verschiedener Performance-Metriken und potentieller Performance-Probleme. Aufgrund lang laufender und hoch paralleler Anwendungen und dem hohen Detailgrad kann das Ereignis-Tracing sehr große Datenmengen produzieren. Diese stellen ihrerseits eine Herausforderung für interaktive und automatische Analyse- und Visualisierungswerkzeuge dar. Die vorliegende Arbeit präsentiert eine Methode, die Redundanzen in den Ereignis-Traces ausnutzt, um sowohl die Speicheranforderungen als auch die Laufzeitkomplexität der Trace-Analyse zu reduzieren. Die Ursachen für Redundanzen sind wiederholt ausgeführte Programmabschnitte, entweder durch iterative oder rekursive Algorithmen oder durch SPMD-Parallelisierung, die gleiche oder ähnliche Ereignis-Sequenzen erzeugen. Die Datenreduktion basiert auf der neuartigen Datenstruktur der "Vollständigen Aufruf-Graphen" (Complete Call Graph, CCG) und erlaubt eine Kombination von verlustfreier und verlustbehafteter Datenkompression. Dabei können konstante Grenzen für alle Abweichungen durch verlustbehaftete Kompression vorgegeben werden. Die Datenkompression ist in den Aufbau der Datenstruktur integriert, so dass keine umfangreichen unkomprimierten Teile vor der Kompression im Hauptspeicher gehalten werden müssen. Das enorme Kompressionsvermögen des neuen Ansatzes wird anhand einer Reihe von Beispielen aus realen Anwendungsszenarien nachgewiesen. Die dabei erzielten Resultate reichen von Kompressionsfaktoren von 3 bis 5 mit nur minimalen Abweichungen aufgrund der verlustbehafteten Kompression bis zu Faktoren > 100 für hochgradige Kompression. Basierend auf der CCG_Datenstruktur werden außerdem neue Auswertungs- und Analyseverfahren für Ereignis-Traces vorgestellt, die ohne explizite Dekompression auskommen. Damit kann die Laufzeitkomplexität der Analyse im selben Maß gesenkt werden wie der Hauptspeicherbedarf, indem komprimierte Ereignis-Sequenzen nicht mehrmals analysiert werden. Die vorliegende Dissertation enthält eine ausführliche Vorstellung des Stands der Technik und verwandter Arbeiten in diesem Bereich, eine detaillierte Herleitung der neu eingeführten Daten-strukturen, der Konstruktions-, Kompressions- und Analysealgorithmen sowie eine umfangreiche experimentelle Auswertung und Validierung aller Bestandteile

    Sintonización dinámica de aplicaciones MPI

    Get PDF
    En la actualidad, la computación de altas prestaciones está siendo utilizada en multitud de campos científicos donde los distintos problemas estudiados se resuelven mediante aplicaciones paralelas/distribuidas. Estas aplicaciones requieren gran capacidad de cómputo, bien sea por la complejidad de los problemas o por la necesidad de solventar situaciones en tiempo real. Por lo tanto se debe aprovechar los recursos y altas capacidades computacionales de los sistemas paralelos en los que se ejecutan estas aplicaciones con el fin de obtener un buen rendimiento. Sin embargo, lograr este rendimiento en una aplicación ejecutándose en un sistema es una dura tarea que requiere un alto grado de experiencia, especialmente cuando se trata de aplicaciones que presentan un comportamiento dinámico o cuando se usan sistemas heterogéneos. En estos casos actualmente se plantea realizar una mejora de rendimiento automática y dinámica de las aplicaciones como mejor enfoque para el análisis del rendimiento. El presente trabajo de investigación se sitúa dentro de este ámbito de estudio y su objetivo principal es sintonizar dinámicamente mediante MATE (Monitoring, Analysis and Tuning Environment) una aplicación MPI empleada en computación de altas prestaciones que siga un paradigma Master/Worker. Las técnicas de sintonización integradas en MATE han sido desarrolladas a partir del estudio de un modelo de rendimiento que refleja los cuellos de botella propios de aplicaciones situadas bajo un paradigma Master/Worker: balanceo de carga y número de workers. La ejecución de la aplicación elegida bajo el control dinámico de MATE y de la estrategia de sintonización implementada ha permitido observar la adaptación del comportamiento de dicha aplicación a las condiciones actuales del sistema donde se ejecuta, obteniendo así una mejora de su rendimiento.En l'actualitat, la computació d'altes prestacions està sent utilitzada en multitud de camps científics on els diferents problemes estudiats es resolen mitjançant aplicacions paral·leles/distribuïdes. Aquestes aplicacions requereixen gran capacitat de còmput, bé sigui per la complexitat dels problemes o per la necessitat de solucionar situacions en temps real. Per tant s'ha d'aprofitar els recursos i altes capacitats computacionals dels sistemes paral·lels en els quals s'executen aquestes aplicacions amb la finalitat d'obtenir un bon rendiment. No obstant això, assolir aquest rendiment en una aplicació executant-se en un sistema és una tasca complexa que requereix d'un alt grau d'experiència, especialment quan es tracta d'aplicacions que presenten un comportament dinàmic o quan s'usen sistemes heterogenis. En aquests casos actualment es planteja realitzar una millora de rendiment automàtica i dinàmica de les aplicacions com la millor via per l'anàlisi del rendiment. El present treball d'investigació es situa dins d'aquest àmbit d'estudi i el seu objectiu principal és és sintonitzar dinàmicament mitjançant MATE (Monitoring, Analysis and Tuning Environment) una aplicació MPI empleada en computació d'altes prestacions que segueixi un paradigma Master/Worker. Les tècniques de sintonització integrades en MATE han estat desenvolupades a partir de l'estudi d'un model de rendiment que reflecteix els colls d'ampolla propis d'aplicacions situades sota un paradigma Master/Worker: balanceig de càrrega i nombre de workers. L'execució de l'aplicació triada sota el control dinàmic de MATE i de l'estratègia de sintonització implementada ha permès observar l'adaptació del comportament d'aquesta aplicació a les condicions actuals del sistema on s'executa, obtenint així una millora en el seu rendiment.At the present time, high performance computing is used in a multitude of scientific fields, where the problems studied are resolved using parallel/distributed applications. These applications require an enormous computing capacity due to both the complexity of the problems and the necessity to solve them in real time situations. Therefore, the computational capacities and resources of the parallel systems, where these applications are executed, must be taken advantage of to attain this vital high performance. However, achieving high performance in applications executed in parallel systems is a complicated task that requires a high degree of experience, especially when dealing with applications with dynamic behaviour or those running on heterogenous systems. In these cases the use of automatic and dynamic performance improvements is proposed as a better approach to performance analysis. The research presented falls within this field of study and has the principle objective of dynamically tuning, using MATE (Monitoring, Analysis and Tuning Environment), an MPI application which employs high performance computing following the Master/Worker paradigm. The tuning techniques integrated in MATE have been developed following a study of the performance model that reflects the bottlenecks specific to the Master/Worker paradigm: load balancing and the number of workers. The execution of the chosen application under the dynamic control of MATE using the tuning strategies implemented has permitted the observation of the behaviour of said application adapting to the changing conditions in the system where it is being executed, thus obtaining an improvement in the performance

    Concepts for In-memory Event Tracing: Runtime Event Reduction with Hierarchical Memory Buffers

    Get PDF
    This thesis contributes to the field of performance analysis in High Performance Computing with new concepts for in-memory event tracing. Event tracing records runtime events of an application and stores each with a precise time stamp and further relevant metrics. The high resolution and detailed information allows an in-depth analysis of the dynamic program behavior, interactions in parallel applications, and potential performance issues. For long-running and large-scale parallel applications, event-based tracing faces three challenges, yet unsolved: the number of resulting trace files limits scalability, the huge amounts of collected data overwhelm file systems and analysis capabilities, and the measurement bias, in particular, due to intermediate memory buffer flushes prevents a correct analysis. This thesis proposes concepts for an in-memory event tracing workflow. These concepts include new enhanced encoding techniques to increase memory efficiency and novel strategies for runtime event reduction to dynamically adapt trace size during runtime. An in-memory event tracing workflow based on these concepts meets all three challenges: First, it not only overcomes the scalability limitations due to the number of resulting trace files but eliminates the overhead of file system interaction altogether. Second, the enhanced encoding techniques and event reduction lead to remarkable smaller trace sizes. Finally, an in-memory event tracing workflow completely avoids intermediate memory buffer flushes, which minimizes measurement bias and allows a meaningful performance analysis. The concepts further include the Hierarchical Memory Buffer data structure, which incorporates a multi-dimensional, hierarchical ordering of events by common metrics, such as time stamp, calling context, event class, and function call duration. This hierarchical ordering allows a low-overhead event encoding, event reduction and event filtering, as well as new hierarchy-aided analysis requests. An experimental evaluation based on real-life applications and a detailed case study underline the capabilities of the concepts presented in this thesis. The new enhanced encoding techniques reduce memory allocation during runtime by a factor of 3.3 to 7.2, while at the same do not introduce any additional overhead. Furthermore, the combined concepts including the enhanced encoding techniques, event reduction, and a new filter based on function duration within the Hierarchical Memory Buffer remarkably reduce the resulting trace size up to three orders of magnitude and keep an entire measurement within a single fixed-size memory buffer, while still providing a coarse but meaningful analysis of the application. This thesis includes a discussion of the state-of-the-art and related work, a detailed presentation of the enhanced encoding techniques, the event reduction strategies, the Hierarchical Memory Buffer data structure, and a extensive experimental evaluation of all concepts

    EZTrace: a generic framework for performance analysis

    Get PDF
    Poster SessionInternational audienceModern supercomputers with multi-core nodes enhanced by accelerators, as well as hybrid programming models, introduce more complexity in modern applications. Exploiting efficiently all the resources requires a complex analysis of the performance of applications in order to detect time-consuming or idle sections. We present eztrace, a generic trace generation framework that aims at providing a simple way to analyze applications. eztrace is based on plugins that allow it to trace different programming models such as MPI, pthread or OpenMP as well as user-defined libraries or applications. This framework uses two steps: one to collect the basic information during execution and one post-mortem analysis. This permits tracing the execution of applications with low overhead while allowing to refine the analysis after the execution of the program. We also present a simple script language for \eztrace that gives the user the opportunity to easily define the functions to instrument without modifying the source code of the application

    Understanding communication patterns in HPCG

    Get PDF
    Conjugate Gradient (CG) algorithms form a large part of many HPC applications, examples include bioinformatics and weather applications. These algorithms allow numerical solutions to complex linear systems. Understanding how distributed implementations of these algorithms use a network interconnect will allow system designers to gain a deeper insight into their exacting requirements for existing and future applications. This short paper documents our initial investigation into the communication patterns present in the High Performance Conjugate Gradient (HPCG) benchmark. Through our analysis, we identify patterns and features which may warrant further investigation to improve the performance of CG algorithms and applications which make extensive use of them. In this paper, we capture communication traces from runs of the HPCG benchmark at a variety of different processor counts and then examine this data to identify potential performance bottlenecks. Initial results show that there is a fall in the throughput of the network when more processes are communicating with each other, due to network contention
    corecore