Search CORE

788 research outputs found

ScalAna: Automating Scaling Loss Detection with Graph Analysis

Author: Hoefler Torsten
Jin Yuyang
Liu Xu
Tang Xiongchao
Wang Haojie
Yu Teng
Zhai Jidong
Publication venue
Publication date: 01/01/2020
Field of study

Scaling a parallel program to modern supercomputers is challenging due to inter-process communication, Amdahl's law, and resource contention. Performance analysis tools for finding such scaling bottlenecks either base on profiling or tracing. Profiling incurs low overheads but does not capture detailed dependencies needed for root-cause analysis. Tracing collects all information at prohibitive overheads. In this work, we design ScalAna that uses static analysis techniques to achieve the best of both worlds - it enables the analyzability of traces at a cost similar to profiling. ScalAna first leverages static compiler techniques to build a Program Structure Graph, which records the main computation and communication patterns as well as the program's control structures. At runtime, we adopt lightweight techniques to collect performance data according to the graph structure and generate a Program Performance Graph. With this graph, we propose a novel approach, called backtracking root cause detection, which can automatically and efficiently detect the root cause of scaling loss. We evaluate ScalAna with real applications. Results show that our approach can effectively locate the root cause of scaling loss for real applications and incurs 1.73% overhead on average for up to 2,048 processes. We achieve up to 11.11% performance improvement by fixing the root causes detected by ScalAna on 2,048 processes.Comment: conferenc

arXiv.org e-Print Archive

Repository for Publications and Research Data

Algorithmic Debugging of Real-World Haskell Programs: Deriving Dependencies from the Cost Centre Stack

Author: ACM
Chitil O.
Faddegon M.
Gill A.
Marlow S.
Naish L.
Nilsson H.
Pareja C.
Pope B.
Sparud J.
Wallace M.
Zeller A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/06/2015
Field of study

Existing algorithmic debuggers for Haskell require a transformation of all modules in a program, even libraries that the user does not want to debug and which may use language features not supported by the debugger. This is a pity, because a promising ap- proach to debugging is therefore not applicable to many real-world programs. We use the cost centre stack from the Glasgow Haskell Compiler profiling environment together with runtime value observations as provided by the Haskell Object Observation Debugger (HOOD) to collect enough information for algorithmic debugging. Program annotations are in suspected modules only. With this technique algorithmic debugging is applicable to a much larger set of Haskell programs. This demonstrates that for functional languages in general a simple stack trace extension is useful to support tasks such as profiling and debugging

Crossref

Kent Academic Repository

Inferring program structure from execution traces

Author: Martínez Vera Juan Francisco
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2018
Field of study

Application structure detection problem have been typical solved by means of sequential pattern mining techniques but they present to be difficultly scalable. In this thesis we propose a new approach for HPC apps facing this problem as a classification problem such that scalability can be improved

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Concepts for In-memory Event Tracing: Runtime Event Reduction with Hierarchical Memory Buffers

Author: Wagner Michael
Publication venue
Publication date: 03/07/2015
Field of study

This thesis contributes to the field of performance analysis in High Performance Computing with new concepts for in-memory event tracing. Event tracing records runtime events of an application and stores each with a precise time stamp and further relevant metrics. The high resolution and detailed information allows an in-depth analysis of the dynamic program behavior, interactions in parallel applications, and potential performance issues. For long-running and large-scale parallel applications, event-based tracing faces three challenges, yet unsolved: the number of resulting trace files limits scalability, the huge amounts of collected data overwhelm file systems and analysis capabilities, and the measurement bias, in particular, due to intermediate memory buffer flushes prevents a correct analysis. This thesis proposes concepts for an in-memory event tracing workflow. These concepts include new enhanced encoding techniques to increase memory efficiency and novel strategies for runtime event reduction to dynamically adapt trace size during runtime. An in-memory event tracing workflow based on these concepts meets all three challenges: First, it not only overcomes the scalability limitations due to the number of resulting trace files but eliminates the overhead of file system interaction altogether. Second, the enhanced encoding techniques and event reduction lead to remarkable smaller trace sizes. Finally, an in-memory event tracing workflow completely avoids intermediate memory buffer flushes, which minimizes measurement bias and allows a meaningful performance analysis. The concepts further include the Hierarchical Memory Buffer data structure, which incorporates a multi-dimensional, hierarchical ordering of events by common metrics, such as time stamp, calling context, event class, and function call duration. This hierarchical ordering allows a low-overhead event encoding, event reduction and event filtering, as well as new hierarchy-aided analysis requests. An experimental evaluation based on real-life applications and a detailed case study underline the capabilities of the concepts presented in this thesis. The new enhanced encoding techniques reduce memory allocation during runtime by a factor of 3.3 to 7.2, while at the same do not introduce any additional overhead. Furthermore, the combined concepts including the enhanced encoding techniques, event reduction, and a new filter based on function duration within the Hierarchical Memory Buffer remarkably reduce the resulting trace size up to three orders of magnitude and keep an entire measurement within a single fixed-size memory buffer, while still providing a coarse but meaningful analysis of the application. This thesis includes a discussion of the state-of-the-art and related work, a detailed presentation of the enhanced encoding techniques, the event reduction strategies, the Hierarchical Memory Buffer data structure, and a extensive experimental evaluation of all concepts

Technische Universität Dresden: Qucosa

Specifications and programs for computer software validation

Author: Browne J. C.
Davis T.
Haller A.
Henneman M.
Kleir R.
Lasseter G. L.
Publication venue
Publication date
Field of study

Three software products developed during the study are reported and include: (1) FORTRAN Automatic Code Evaluation System, (2) the Specification Language System, and (3) the Array Index Validation System

NASA Technical Reports Server

Trace-based Performance Analysis for Hardware Accelerators

Author: Juckeland Guido
Publication venue
Publication date: 05/02/2013
Field of study

This thesis presents how performance data from hardware accelerators can be included in event logs. It extends the capabilities of trace-based performance analysis to also monitor and record data from this novel parallelization layer. The increasing awareness to power consumption of computing devices has led to an interest in hybrid computing architectures as well. High-end computers, workstations, and mobile devices start to employ hardware accelerators to offload computationally intense and parallel tasks, while at the same time retaining a highly efficient scalar compute unit for non-parallel tasks. This execution pattern is typically asynchronous so that the scalar unit can resume other work while the hardware accelerator is busy. Performance analysis tools provided by the hardware accelerator vendors cover the situation of one host using one device very well. Yet, they do not address the needs of the high performance computing community. This thesis investigates ways to extend existing methods for recording events from highly parallel applications to also cover scenarios in which hardware accelerators aid these applications. After introducing a generic approach that is suitable for any API based acceleration paradigm, the thesis derives a suggestion for a generic performance API for hardware accelerators and its implementation with NVIDIA CUPTI. In a next step the visualization of event logs containing data from execution streams on different levels of parallelism is discussed. In order to overcome the limitations of classic performance profiles and timeline displays, a graph-based visualization using Parallel Performance Flow Graphs (PPFGs) is introduced. This novel technical approach is using program states in order to display similarities and differences between the potentially very large number of event streams and, thus, enables a fast way to spot load imbalances. The thesis concludes with the in-depth analysis of a case-study of PIConGPU---a highly parallel, multi-hybrid plasma physics simulation---that benefited greatly from the developed performance analysis methods.Diese Dissertation zeigt, wie der Ablauf von Anwendungsteilen, die auf Hardwarebeschleuniger ausgelagert wurden, als Programmspur mit aufgezeichnet werden kann. Damit wird die bekannte Technik der Leistungsanalyse von Anwendungen mittels Programmspuren so erweitert, dass auch diese neue Parallelitätsebene mit erfasst wird. Die Beschränkungen von Computersystemen bezüglich der elektrischen Leistungsaufnahme hat zu einer steigenden Anzahl von hybriden Computerarchitekturen geführt. Sowohl Hochleistungsrechner, aber auch Arbeitsplatzcomputer und mobile Endgeräte nutzen heute Hardwarebeschleuniger um rechenintensive, parallele Programmteile auszulagern und so den skalaren Hauptprozessor zu entlasten und nur für nicht parallele Programmteile zu verwenden. Dieses Ausführungsschema ist typischerweise asynchron: der Skalarprozessor kann, während der Hardwarebeschleuniger rechnet, selbst weiterarbeiten. Die Leistungsanalyse-Werkzeuge der Hersteller von Hardwarebeschleunigern decken den Standardfall (ein Host-System mit einem Hardwarebeschleuniger) sehr gut ab, scheitern aber an einer Unterstützung von hochparallelen Rechnersystemen. Die vorliegende Dissertation untersucht, in wie weit auch multi-hybride Anwendungen die Aktivität von Hardwarebeschleunigern aufzeichnen können. Dazu wird die vorhandene Methode zur Erzeugung von Programmspuren für hochparallele Anwendungen entsprechend erweitert. In dieser Untersuchung wird zuerst eine allgemeine Methodik entwickelt, mit der sich für jede API-gestützte Hardwarebeschleunigung eine Programmspur erstellen lässt. Darauf aufbauend wird eine eigene Programmierschnittstelle entwickelt, die es ermöglicht weitere leistungsrelevante Daten aufzuzeichnen. Die Umsetzung dieser Schnittstelle wird am Beispiel von NVIDIA CUPTI darstellt. Ein weiterer Teil der Arbeit beschäftigt sich mit der Darstellung von Programmspuren, welche Aufzeichnungen von den unterschiedlichen Parallelitätsebenen enthalten. Um die Einschränkungen klassischer Leistungsprofile oder Zeitachsendarstellungen zu überwinden, wird mit den parallelen Programmablaufgraphen (PPFGs) eine neue graphenbasisierte Darstellungsform eingeführt. Dieser neuartige Ansatz zeigt eine Programmspur als eine Folge von Programmzuständen mit gemeinsamen und unterchiedlichen Abläufen. So können divergierendes Programmverhalten und Lastimbalancen deutlich einfacher lokalisiert werden. Die Arbeit schließt mit der detaillierten Analyse von PIConGPU -- einer multi-hybriden Simulation aus der Plasmaphysik --, die in großem Maße von den in dieser Arbeit entwickelten Analysemöglichkeiten profiert hat

Technische Universität Dresden: Qucosa

Post-Mortem Dynamic analysis for software debugging

Author: WANG TAO
Publication venue
Publication date: 30/03/2008
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Profilage et débogage par prise de traces efficaces d'applications hybrides multi-threadées HPC

Author: Besnard Jean-Baptiste
Publication venue: HAL CCSD
Publication date: 16/07/2014
Field of study

Supercomputers’ evolution is at the source of both hardware and software challenges. In the quest for the highest computing power, the interdependence in-between simulation components is becoming more and more impacting, requiring new approaches. This thesis is focused on the software development aspect and particularly on the observation of parallel software when being run on several thousand cores. This observation aims at providing developers with the necessary feedback when running a program on an execution substrate which has not been modeled yet because of its complexity. In this purpose, we firstly introduce the development process from a global point of view, before describing developer tools and related work. In a second time, we present our contribution which consists in a trace based profiling and debugging tool and its evolution towards an on-line coupling method which as we will show is more scalable as it overcomes IOs limitations. Our contribution also covers our time-stamp synchronisation algorithm for tracing purposes which relies on a probabilistic approach with quantified error. We also present a tool allowing machine characterisation from the MPI aspect and demonstrate the presence of machine noise for both point to point and collectives, justifying the use of an empirical approach. In summary, this work proposes and motivates an alternative approach to trace based event collection while preserving event granularity and a reduced overheadL’évolution des supercalculateurs est à la source de défis logiciels et architecturaux. Dans la quête de puissance de calcul, l’interdépendance des éléments du processus de simulation devient de plus en plus impactante et requiert de nouvelles approches. Cette thèse se concentre sur le développement logiciel et particulièrement sur l’observation des programmes parallèles s’exécutant sur des milliers de cœurs. Dans ce but, nous décrivons d’abord le processus de développement de manière globale avant de présenter les outils existants et les travaux associés. Dans un second temps, nous détaillons notre contribution qui consiste d’une part en des outils de débogage et profilage par prise de traces, et d’autre part en leur évolution vers un couplage en ligne qui palie les limitations d’entrées–sorties. Notre contribution couvre également la synchronisation des horloges pour la prise de traces avec la présentation d’un algorithme de synchronisation probabiliste dont nous avons quantifié l’erreur. En outre, nous décrivons un outil de caractérisation machine qui couvre l’aspect MPI. Un tel outil met en évidence la présence de bruit aussi bien sur les communications de type point-à-point que de type collective. Enfin, nous proposons et motivons une alternative à la collecte d’événements par prise de traces tout en préservant la granularité des événements et un impact réduit sur les performances, tant sur le volet utilisation CPU que sur les entrées–sortie

Firmware Modification Analysis in Programmable Logic Controllers

Author: Garcia Arturo M.
Publication venue: AFIT Scholar
Publication date: 14/03/2014
Field of study

Incorporating security in supervisory control and data acquisition (SCADA) systems and sensor networks has proven to be a pervasive problem due to the constraints and demands placed on these systems. Both attackers and security professionals seek to uncover the inherent roots of trust in a system to achieve opposing goals. With SCADA systems, a battle is being fought at the cyber -- physical level, specifically the programmable logic controller (PLC). The Stuxnet worm, which became increasingly apparent in the summer of 2010, has shown that modifications to a SCADA system can be discovered on infected engineering workstations on the network, to include the ladder logic found in the PLC. However, certain firmware modifications made to a PLC can go undetected due to the lack of effective techniques available for detecting them. Current software auditing tools give an analyst a singular view of assembly code, and binary difference programs can only show simple differences between assembly codes. Additionally, there appears to be no comprehensive software tool that aids an analyst with evaluating a PLC firmware file for modifications and displaying the resulting effects. Manual analysis is time consuming and error prone. Furthermore, there are not enough talented individuals available in the industrial control system (ICS) community with an in-depth knowledge of assembly language and the inner workings of PLC firmware. This research presents a novel analysis technique that compares a suspected-altered firmware to a known good firmware of a specific PLC and performs a static analysis of differences. This technique includes multiple tests to compare both firmware versions, detect differences in size, and code differences such as removing, adding, or modifying existing functions in the original firmware. A proof-of-concept experiment demonstrates the functionality of the analysis tool using different firmware versions from an Allen-Bradley ControlLogix L61 PLC

AFTI Scholar (Air Force Institute of Technology)