496 research outputs found

    ScALPEL: A Scalable Adaptive Lightweight Performance Evaluation Library for application performance monitoring

    Get PDF
    As supercomputers continue to grow in scale and capabilities, it is becoming increasingly difficult to isolate processor and system level causes of performance degradation. Over the last several years, a significant number of performance analysis and monitoring tools have been built/proposed. However, these tools suffer from several important shortcomings, particularly in distributed environments. In this paper we present ScALPEL, a Scalable Adaptive Lightweight Performance Evaluation Library for application performance monitoring at the functional level. Our approach provides several distinct advantages. First, ScALPEL is portable across a wide variety of architectures, and its ability to selectively monitor functions presents low run-time overhead, enabling its use for large-scale production applications. Second, it is run-time configurable, enabling both dynamic selection of functions to profile as well as events of interest on a per function basis. Third, our approach is transparent in that it requires no source code modifications. Finally, ScALPEL is implemented as a pluggable unit by reusing existing performance monitoring frameworks such as Perfmon and PAPI and extending them to support both sequential and MPI applications.Comment: 10 pages, 4 figures, 2 table

    Extending the Functionality of Score-P through Plugins: Interfaces and Use Cases

    Get PDF
    Performance measurement and runtime tuning tools are both vital in the HPC software ecosystem and use similar techniques: the analyzed application is interrupted at specific events and information on the current system state is gathered to be either recorded or used for tuning. One of the established performance measurement tools is Score-P. It supports numerous HPC platforms and parallel programming paradigms. To extend Score-P with support for different back-ends, create a common framework for measurement and tuning of HPC applications, and to enable the re-use of common software components such as implemented instrumentation techniques, this paper makes the following contributions: (I) We describe the Score-P metric plugin interface, which enables programmers to augment the event stream with metric data from supplementary data sources that are otherwise not accessible for Score-P. (II) We introduce the flexible Score-P substrate plugin interface that can be used for custom processing of the event stream according to the specific requirements of either measurement, analysis, or runtime tuning tasks. (III) We provide examples for both interfaces that extend Score-P’s functionality for monitoring and tuning purposes

    Performance analysis of parallel Python applications

    Get PDF
    Python is progressively consolidating itself within the HPC community with its simple syntax, large standard library, and powerful third-party libraries for scientific computing that are especially attractive to domain scientists. Despite Python lowering the bar for accessing parallel computing, utilizing the capacities of HPC systems efficiently remains a challenging task, after all. Yet, at the moment only few supporting tools exist and provide merely basic information in the form of summarized profile data. In this paper, we present our efforts in developing event-based tracing support for Python within the performance monitor Extrae to provide detailed information and enable a profound performance analysis. We present concepts to record the complete communication behavior as well as to capture entry and exit of functions in Python to provide the according application context. We evaluate our implementation in Extrae by analyzing the well-established electronic structure simulation package GPAW and demonstrate that the recorded traces provide equivalent information as for traditional C or Fortran applications and, therefore, offering the same profound analysis capabilities now for Python, as well.Peer ReviewedPostprint (published version

    A Comparison of Parallel Graph Processing Implementations

    Full text link
    The rapidly growing number of large network analysis problems has led to the emergence of many parallel and distributed graph processing systems---one survey in 2014 identified over 80. Since then, the landscape has evolved; some packages have become inactive while more are being developed. Determining the best approach for a given problem is infeasible for most developers. To enable easy, rigorous, and repeatable comparison of the capabilities of such systems, we present an approach and associated software for analyzing the performance and scalability of parallel, open-source graph libraries. We demonstrate our approach on five graph processing packages: GraphMat, the Graph500, the Graph Algorithm Platform Benchmark Suite, GraphBIG, and PowerGraph using synthetic and real-world datasets. We examine previously overlooked aspects of parallel graph processing performance such as phases of execution and energy usage for three algorithms: breadth first search, single source shortest paths, and PageRank and compare our results to Graphalytics.Comment: 10 pages, 10 figures, Submitted to EuroPar 2017 and rejected. Revised and submitted to IEEE Cluster 201

    A policy compliance detection architecture for data exchange infrastructures

    Get PDF
    Data sharing and federation can significantly increase efficiency and lower the cost of digital collaborations. It is important to convince the data owners that their outsourced data will be used in a secure and controlled manner. To achieve this goal, constructing a policy governing concrete data usage rule among all parties is essential. More importantly, we need to establish digital infrastructures that can enforce the policy. In this thesis, we investigate how to select optimal application-tailored infrastructures and enhance policy compliance capabilities. First, we introduce a component linking the policy to the infrastructure patterns. The mechanism selects digital infrastructure patterns that satisfy the collaboration request to a maximal degree by modelling and closeness identification. Second, we present a threat-analysis driven risk assessment framework. The framework quantitatively assesses the remaining risk of an application delegated to digital infrastructure. The optimal digital infrastructure for a specific data federation application is the one which can support the requested collaboration model and provides the best security guarantee. Finally, we present a distributed architecture that detects policy compliance when an algorithm executes on the data. A profile and an IDS model are built for each containerized algorithm and are distributed to endpoint execution platforms via a secure channel. Syscall traces are monitored and analysed in endpoint points platforms. The machine learning based IDS is retrained periodically to increase generalization. A sanitization algorithm is implemented to filter out malicious samples to further defend the architecture against adversarial machine learning attacks

    An Efficient Execution Model for Reactive Stream Programs

    Get PDF
    Stream programming is a paradigm where a program is structured by a set of computational nodes connected by streams. Focusing on data moving between computational nodes via streams, this programming model fits well for applications that process long sequences of data. We call such applications reactive stream programs (RSPs) to distinguish them from stream programs with rather small and finite input data. In stream programming, concurrency is expressed implicitly via communication streams. This helps to reduce the complexity of parallel programming. For this reason, stream programming has gained popularity as a programming model for parallel platforms. However, it is also challenging to analyse and improve the performance without an understanding of the program's internal behaviour. This thesis targets an effi cient execution model for deploying RSPs on parallel platforms. This execution model includes a monitoring framework to understand the internal behaviour of RSPs, scheduling strategies for RSPs on uniform shared-memory platforms; and mapping techniques for deploying RSPs on heterogeneous distributed platforms. The foundation of the execution model is based on a study of the performance of RSPs in terms of throughput and latency. This study includes quantitative formulae for throughput and latency; and the identification of factors that influence these performance metrics. Based on the study of RSP performance, this thesis exploits characteristics of RSPs to derive effective scheduling strategies on uniform shared-memory platforms. Aiming to optimise both throughput and latency, these scheduling strategies are implemented in two heuristic-based schedulers. Both of them are designed to be centralised to provide load balancing for RSPs with dynamic behaviour as well as dynamic structures. The first one uses the notion of positive and negative data demands on each stream to determine the scheduling priorities. This scheduler is independent from the runtime system. The second one requires the runtime system to provide the position information for each computational node in the RSP; and uses that to decide the scheduling priorities. Our experiments show that both schedulers provides similar performance while being significantly better than a reference implementation without dynamic load balancing. Also based on the study of RSP performance, we present in this thesis two new heuristic partitioning algorithms which are used to map RSPs onto heterogeneous distributed platforms. These are Kernighan-Lin Adaptation (KLA) and Congestion Avoidance (CA), where the main objective is to optimise the throughput. This is a multi-parameter optimisation problem where existing graph partitioning algorithms are not applicable. Compared to the generic meta-heuristic Simulated Annealing algorithm, both proposed algorithms achieve equally good or better results. KLA is faster for small benchmarks while slower for large ones. In contrast, CA is always orders of magnitudes faster even for very large benchmarks

    Trace-based Performance Analysis for Hardware Accelerators

    Get PDF
    This thesis presents how performance data from hardware accelerators can be included in event logs. It extends the capabilities of trace-based performance analysis to also monitor and record data from this novel parallelization layer. The increasing awareness to power consumption of computing devices has led to an interest in hybrid computing architectures as well. High-end computers, workstations, and mobile devices start to employ hardware accelerators to offload computationally intense and parallel tasks, while at the same time retaining a highly efficient scalar compute unit for non-parallel tasks. This execution pattern is typically asynchronous so that the scalar unit can resume other work while the hardware accelerator is busy. Performance analysis tools provided by the hardware accelerator vendors cover the situation of one host using one device very well. Yet, they do not address the needs of the high performance computing community. This thesis investigates ways to extend existing methods for recording events from highly parallel applications to also cover scenarios in which hardware accelerators aid these applications. After introducing a generic approach that is suitable for any API based acceleration paradigm, the thesis derives a suggestion for a generic performance API for hardware accelerators and its implementation with NVIDIA CUPTI. In a next step the visualization of event logs containing data from execution streams on different levels of parallelism is discussed. In order to overcome the limitations of classic performance profiles and timeline displays, a graph-based visualization using Parallel Performance Flow Graphs (PPFGs) is introduced. This novel technical approach is using program states in order to display similarities and differences between the potentially very large number of event streams and, thus, enables a fast way to spot load imbalances. The thesis concludes with the in-depth analysis of a case-study of PIConGPU---a highly parallel, multi-hybrid plasma physics simulation---that benefited greatly from the developed performance analysis methods.Diese Dissertation zeigt, wie der Ablauf von Anwendungsteilen, die auf Hardwarebeschleuniger ausgelagert wurden, als Programmspur mit aufgezeichnet werden kann. Damit wird die bekannte Technik der Leistungsanalyse von Anwendungen mittels Programmspuren so erweitert, dass auch diese neue Parallelitätsebene mit erfasst wird. Die Beschränkungen von Computersystemen bezüglich der elektrischen Leistungsaufnahme hat zu einer steigenden Anzahl von hybriden Computerarchitekturen geführt. Sowohl Hochleistungsrechner, aber auch Arbeitsplatzcomputer und mobile Endgeräte nutzen heute Hardwarebeschleuniger um rechenintensive, parallele Programmteile auszulagern und so den skalaren Hauptprozessor zu entlasten und nur für nicht parallele Programmteile zu verwenden. Dieses Ausführungsschema ist typischerweise asynchron: der Skalarprozessor kann, während der Hardwarebeschleuniger rechnet, selbst weiterarbeiten. Die Leistungsanalyse-Werkzeuge der Hersteller von Hardwarebeschleunigern decken den Standardfall (ein Host-System mit einem Hardwarebeschleuniger) sehr gut ab, scheitern aber an einer Unterstützung von hochparallelen Rechnersystemen. Die vorliegende Dissertation untersucht, in wie weit auch multi-hybride Anwendungen die Aktivität von Hardwarebeschleunigern aufzeichnen können. Dazu wird die vorhandene Methode zur Erzeugung von Programmspuren für hochparallele Anwendungen entsprechend erweitert. In dieser Untersuchung wird zuerst eine allgemeine Methodik entwickelt, mit der sich für jede API-gestützte Hardwarebeschleunigung eine Programmspur erstellen lässt. Darauf aufbauend wird eine eigene Programmierschnittstelle entwickelt, die es ermöglicht weitere leistungsrelevante Daten aufzuzeichnen. Die Umsetzung dieser Schnittstelle wird am Beispiel von NVIDIA CUPTI darstellt. Ein weiterer Teil der Arbeit beschäftigt sich mit der Darstellung von Programmspuren, welche Aufzeichnungen von den unterschiedlichen Parallelitätsebenen enthalten. Um die Einschränkungen klassischer Leistungsprofile oder Zeitachsendarstellungen zu überwinden, wird mit den parallelen Programmablaufgraphen (PPFGs) eine neue graphenbasisierte Darstellungsform eingeführt. Dieser neuartige Ansatz zeigt eine Programmspur als eine Folge von Programmzuständen mit gemeinsamen und unterchiedlichen Abläufen. So können divergierendes Programmverhalten und Lastimbalancen deutlich einfacher lokalisiert werden. Die Arbeit schließt mit der detaillierten Analyse von PIConGPU -- einer multi-hybriden Simulation aus der Plasmaphysik --, die in großem Maße von den in dieser Arbeit entwickelten Analysemöglichkeiten profiert hat
    • …
    corecore