22 research outputs found

    Trace-based Performance Analysis for Hardware Accelerators

    Get PDF
    This thesis presents how performance data from hardware accelerators can be included in event logs. It extends the capabilities of trace-based performance analysis to also monitor and record data from this novel parallelization layer. The increasing awareness to power consumption of computing devices has led to an interest in hybrid computing architectures as well. High-end computers, workstations, and mobile devices start to employ hardware accelerators to offload computationally intense and parallel tasks, while at the same time retaining a highly efficient scalar compute unit for non-parallel tasks. This execution pattern is typically asynchronous so that the scalar unit can resume other work while the hardware accelerator is busy. Performance analysis tools provided by the hardware accelerator vendors cover the situation of one host using one device very well. Yet, they do not address the needs of the high performance computing community. This thesis investigates ways to extend existing methods for recording events from highly parallel applications to also cover scenarios in which hardware accelerators aid these applications. After introducing a generic approach that is suitable for any API based acceleration paradigm, the thesis derives a suggestion for a generic performance API for hardware accelerators and its implementation with NVIDIA CUPTI. In a next step the visualization of event logs containing data from execution streams on different levels of parallelism is discussed. In order to overcome the limitations of classic performance profiles and timeline displays, a graph-based visualization using Parallel Performance Flow Graphs (PPFGs) is introduced. This novel technical approach is using program states in order to display similarities and differences between the potentially very large number of event streams and, thus, enables a fast way to spot load imbalances. The thesis concludes with the in-depth analysis of a case-study of PIConGPU---a highly parallel, multi-hybrid plasma physics simulation---that benefited greatly from the developed performance analysis methods.Diese Dissertation zeigt, wie der Ablauf von Anwendungsteilen, die auf Hardwarebeschleuniger ausgelagert wurden, als Programmspur mit aufgezeichnet werden kann. Damit wird die bekannte Technik der Leistungsanalyse von Anwendungen mittels Programmspuren so erweitert, dass auch diese neue ParallelitĂ€tsebene mit erfasst wird. Die BeschrĂ€nkungen von Computersystemen bezĂŒglich der elektrischen Leistungsaufnahme hat zu einer steigenden Anzahl von hybriden Computerarchitekturen gefĂŒhrt. Sowohl Hochleistungsrechner, aber auch Arbeitsplatzcomputer und mobile EndgerĂ€te nutzen heute Hardwarebeschleuniger um rechenintensive, parallele Programmteile auszulagern und so den skalaren Hauptprozessor zu entlasten und nur fĂŒr nicht parallele Programmteile zu verwenden. Dieses AusfĂŒhrungsschema ist typischerweise asynchron: der Skalarprozessor kann, wĂ€hrend der Hardwarebeschleuniger rechnet, selbst weiterarbeiten. Die Leistungsanalyse-Werkzeuge der Hersteller von Hardwarebeschleunigern decken den Standardfall (ein Host-System mit einem Hardwarebeschleuniger) sehr gut ab, scheitern aber an einer UnterstĂŒtzung von hochparallelen Rechnersystemen. Die vorliegende Dissertation untersucht, in wie weit auch multi-hybride Anwendungen die AktivitĂ€t von Hardwarebeschleunigern aufzeichnen können. Dazu wird die vorhandene Methode zur Erzeugung von Programmspuren fĂŒr hochparallele Anwendungen entsprechend erweitert. In dieser Untersuchung wird zuerst eine allgemeine Methodik entwickelt, mit der sich fĂŒr jede API-gestĂŒtzte Hardwarebeschleunigung eine Programmspur erstellen lĂ€sst. Darauf aufbauend wird eine eigene Programmierschnittstelle entwickelt, die es ermöglicht weitere leistungsrelevante Daten aufzuzeichnen. Die Umsetzung dieser Schnittstelle wird am Beispiel von NVIDIA CUPTI darstellt. Ein weiterer Teil der Arbeit beschĂ€ftigt sich mit der Darstellung von Programmspuren, welche Aufzeichnungen von den unterschiedlichen ParallelitĂ€tsebenen enthalten. Um die EinschrĂ€nkungen klassischer Leistungsprofile oder Zeitachsendarstellungen zu ĂŒberwinden, wird mit den parallelen Programmablaufgraphen (PPFGs) eine neue graphenbasisierte Darstellungsform eingefĂŒhrt. Dieser neuartige Ansatz zeigt eine Programmspur als eine Folge von ProgrammzustĂ€nden mit gemeinsamen und unterchiedlichen AblĂ€ufen. So können divergierendes Programmverhalten und Lastimbalancen deutlich einfacher lokalisiert werden. Die Arbeit schließt mit der detaillierten Analyse von PIConGPU -- einer multi-hybriden Simulation aus der Plasmaphysik --, die in großem Maße von den in dieser Arbeit entwickelten Analysemöglichkeiten profiert hat

    Concepts for In-memory Event Tracing: Runtime Event Reduction with Hierarchical Memory Buffers

    Get PDF
    This thesis contributes to the field of performance analysis in High Performance Computing with new concepts for in-memory event tracing. Event tracing records runtime events of an application and stores each with a precise time stamp and further relevant metrics. The high resolution and detailed information allows an in-depth analysis of the dynamic program behavior, interactions in parallel applications, and potential performance issues. For long-running and large-scale parallel applications, event-based tracing faces three challenges, yet unsolved: the number of resulting trace files limits scalability, the huge amounts of collected data overwhelm file systems and analysis capabilities, and the measurement bias, in particular, due to intermediate memory buffer flushes prevents a correct analysis. This thesis proposes concepts for an in-memory event tracing workflow. These concepts include new enhanced encoding techniques to increase memory efficiency and novel strategies for runtime event reduction to dynamically adapt trace size during runtime. An in-memory event tracing workflow based on these concepts meets all three challenges: First, it not only overcomes the scalability limitations due to the number of resulting trace files but eliminates the overhead of file system interaction altogether. Second, the enhanced encoding techniques and event reduction lead to remarkable smaller trace sizes. Finally, an in-memory event tracing workflow completely avoids intermediate memory buffer flushes, which minimizes measurement bias and allows a meaningful performance analysis. The concepts further include the Hierarchical Memory Buffer data structure, which incorporates a multi-dimensional, hierarchical ordering of events by common metrics, such as time stamp, calling context, event class, and function call duration. This hierarchical ordering allows a low-overhead event encoding, event reduction and event filtering, as well as new hierarchy-aided analysis requests. An experimental evaluation based on real-life applications and a detailed case study underline the capabilities of the concepts presented in this thesis. The new enhanced encoding techniques reduce memory allocation during runtime by a factor of 3.3 to 7.2, while at the same do not introduce any additional overhead. Furthermore, the combined concepts including the enhanced encoding techniques, event reduction, and a new filter based on function duration within the Hierarchical Memory Buffer remarkably reduce the resulting trace size up to three orders of magnitude and keep an entire measurement within a single fixed-size memory buffer, while still providing a coarse but meaningful analysis of the application. This thesis includes a discussion of the state-of-the-art and related work, a detailed presentation of the enhanced encoding techniques, the event reduction strategies, the Hierarchical Memory Buffer data structure, and a extensive experimental evaluation of all concepts

    Advanced Memory Data Structures for Scalable Event Trace Analysis

    Get PDF
    The thesis presents a contribution to the analysis and visualization of computational performance based on event traces with a particular focus on parallel programs and High Performance Computing (HPC). Event traces contain detailed information about speciïŹed incidents (events) during run-time of programs and allow minute investigation of dynamic program behavior, various performance metrics, and possible causes of performance ïŹ‚aws. Due to long running and highly parallel programs and very ïŹne detail resolutions, event traces can accumulate huge amounts of data which become a challenge for interactive as well as automatic analysis and visualization tools. The thesis proposes a method of exploiting redundancy in the event traces in order to reduce the memory requirements and the computational complexity of event trace analysis. The sources of redundancy are repeated segments of the original program, either through iterative or recursive algorithms or through SPMD-style parallel programs, which produce equal or similar repeated event sequences. The data reduction technique is based on the novel Complete Call Graph (CCG) data structure which allows domain speciïŹc data compression for event traces in a combination of lossless and lossy methods. All deviations due to lossy data compression can be controlled by constant bounds. The compression of the CCG data structure is incorporated in the construction process, such that at no point substantial uncompressed parts have to be stored. Experiments with real-world example traces reveal the potential for very high data compression. The results range from factors of 3 to 15 for small scale compression with minimum deviation of the data to factors > 100 for large scale compression with moderate deviation. Based on the CCG data structure, new algorithms for the most common evaluation and analysis methods for event traces are presented, which require no explicit decompression. By avoiding repeated evaluation of formerly redundant event sequences, the computational effort of the new algorithms can be reduced in the same extent as memory consumption. The thesis includes a comprehensive discussion of the state-of-the-art and related work, a detailed presentation of the design of the CCG data structure, an elaborate description of algorithms for construction, compression, and analysis of CCGs, and an extensive experimental validation of all components.Diese Dissertation stellt einen neuartigen Ansatz fĂŒr die Analyse und Visualisierung der Berechnungs-Performance vor, der auf dem Ereignis-Tracing basiert und insbesondere auf parallele Programme und das Hochleistungsrechnen (High Performance Computing, HPC) zugeschnitten ist. Ereignis-Traces (Ereignis-Spuren) enthalten detaillierte Informationen ĂŒber spezifizierte Ereignisse wĂ€hrend der Laufzeit eines Programms und erlauben eine sehr genaue Untersuchung des dynamischen Verhaltens, verschiedener Performance-Metriken und potentieller Performance-Probleme. Aufgrund lang laufender und hoch paralleler Anwendungen und dem hohen Detailgrad kann das Ereignis-Tracing sehr große Datenmengen produzieren. Diese stellen ihrerseits eine Herausforderung fĂŒr interaktive und automatische Analyse- und Visualisierungswerkzeuge dar. Die vorliegende Arbeit prĂ€sentiert eine Methode, die Redundanzen in den Ereignis-Traces ausnutzt, um sowohl die Speicheranforderungen als auch die LaufzeitkomplexitĂ€t der Trace-Analyse zu reduzieren. Die Ursachen fĂŒr Redundanzen sind wiederholt ausgefĂŒhrte Programmabschnitte, entweder durch iterative oder rekursive Algorithmen oder durch SPMD-Parallelisierung, die gleiche oder Ă€hnliche Ereignis-Sequenzen erzeugen. Die Datenreduktion basiert auf der neuartigen Datenstruktur der "VollstĂ€ndigen Aufruf-Graphen" (Complete Call Graph, CCG) und erlaubt eine Kombination von verlustfreier und verlustbehafteter Datenkompression. Dabei können konstante Grenzen fĂŒr alle Abweichungen durch verlustbehaftete Kompression vorgegeben werden. Die Datenkompression ist in den Aufbau der Datenstruktur integriert, so dass keine umfangreichen unkomprimierten Teile vor der Kompression im Hauptspeicher gehalten werden mĂŒssen. Das enorme Kompressionsvermögen des neuen Ansatzes wird anhand einer Reihe von Beispielen aus realen Anwendungsszenarien nachgewiesen. Die dabei erzielten Resultate reichen von Kompressionsfaktoren von 3 bis 5 mit nur minimalen Abweichungen aufgrund der verlustbehafteten Kompression bis zu Faktoren > 100 fĂŒr hochgradige Kompression. Basierend auf der CCG_Datenstruktur werden außerdem neue Auswertungs- und Analyseverfahren fĂŒr Ereignis-Traces vorgestellt, die ohne explizite Dekompression auskommen. Damit kann die LaufzeitkomplexitĂ€t der Analyse im selben Maß gesenkt werden wie der Hauptspeicherbedarf, indem komprimierte Ereignis-Sequenzen nicht mehrmals analysiert werden. Die vorliegende Dissertation enthĂ€lt eine ausfĂŒhrliche Vorstellung des Stands der Technik und verwandter Arbeiten in diesem Bereich, eine detaillierte Herleitung der neu eingefĂŒhrten Daten-strukturen, der Konstruktions-, Kompressions- und Analysealgorithmen sowie eine umfangreiche experimentelle Auswertung und Validierung aller Bestandteile

    GTI: A Generic Tools Infrastructure for Event-Based Tools in Parallel Systems

    Full text link
    Abstract not provide

    Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

    Get PDF
    The efficient parallel execution of scientific applications is a key challenge in high-performance computing (HPC). With growing parallelism and heterogeneity of compute resources as well as increasingly complex software, performance analysis has become an indispensable tool in the development and optimization of parallel programs. This thesis presents a framework for systematic performance analysis of scalable, heterogeneous applications. Based on event traces, it automatically detects the critical path and inefficiencies that result in waiting or idle time, e.g. due to load imbalances between parallel execution streams. As a prerequisite for the analysis of heterogeneous programs, this thesis specifies inefficiency patterns for computation offloading. Furthermore, an essential contribution was made to the development of tool interfaces for OpenACC and OpenMP, which enable a portable data acquisition and a subsequent analysis for programs with offload directives. At present, these interfaces are already part of the latest OpenACC and OpenMP API specification. The aforementioned work, existing preliminary work, and established analysis methods are combined into a generic analysis process, which can be applied across programming models. Based on the detection of wait or idle states, which can propagate over several levels of parallelism, the analysis identifies wasted computing resources and their root cause as well as the critical-path share for each program region. Thus, it determines the influence of program regions on the load balancing between execution streams and the program runtime. The analysis results include a summary of the detected inefficiency patterns and a program trace, enhanced with information about wait states, their cause, and the critical path. In addition, a ranking, based on the amount of waiting time a program region caused on the critical path, highlights program regions that are relevant for program optimization. The scalability of the proposed performance analysis and its implementation is demonstrated using High-Performance Linpack (HPL), while the analysis results are validated with synthetic programs. A scientific application that uses MPI, OpenMP, and CUDA simultaneously is investigated in order to show the applicability of the analysis

    Profilage et débogage par prise de traces efficaces d'applications hybrides multi-threadées HPC

    Get PDF
    Supercomputers’ evolution is at the source of both hardware and software challenges. In the quest for the highest computing power, the interdependence in-between simulation components is becoming more and more impacting, requiring new approaches. This thesis is focused on the software development aspect and particularly on the observation of parallel software when being run on several thousand cores. This observation aims at providing developers with the necessary feedback when running a program on an execution substrate which has not been modeled yet because of its complexity. In this purpose, we firstly introduce the development process from a global point of view, before describing developer tools and related work. In a second time, we present our contribution which consists in a trace based profiling and debugging tool and its evolution towards an on-line coupling method which as we will show is more scalable as it overcomes IOs limitations. Our contribution also covers our time-stamp synchronisation algorithm for tracing purposes which relies on a probabilistic approach with quantified error. We also present a tool allowing machine characterisation from the MPI aspect and demonstrate the presence of machine noise for both point to point and collectives, justifying the use of an empirical approach. In summary, this work proposes and motivates an alternative approach to trace based event collection while preserving event granularity and a reduced overheadL’évolution des supercalculateurs est Ă  la source de dĂ©fis logiciels et architecturaux. Dans la quĂȘte de puissance de calcul, l’interdĂ©pendance des Ă©lĂ©ments du processus de simulation devient de plus en plus impactante et requiert de nouvelles approches. Cette thĂšse se concentre sur le dĂ©veloppement logiciel et particuliĂšrement sur l’observation des programmes parallĂšles s’exĂ©cutant sur des milliers de cƓurs. Dans ce but, nous dĂ©crivons d’abord le processus de dĂ©veloppement de maniĂšre globale avant de prĂ©senter les outils existants et les travaux associĂ©s. Dans un second temps, nous dĂ©taillons notre contribution qui consiste d’une part en des outils de dĂ©bogage et profilage par prise de traces, et d’autre part en leur Ă©volution vers un couplage en ligne qui palie les limitations d’entrĂ©es–sorties. Notre contribution couvre Ă©galement la synchronisation des horloges pour la prise de traces avec la prĂ©sentation d’un algorithme de synchronisation probabiliste dont nous avons quantifiĂ© l’erreur. En outre, nous dĂ©crivons un outil de caractĂ©risation machine qui couvre l’aspect MPI. Un tel outil met en Ă©vidence la prĂ©sence de bruit aussi bien sur les communications de type point-Ă -point que de type collective. Enfin, nous proposons et motivons une alternative Ă  la collecte d’évĂ©nements par prise de traces tout en prĂ©servant la granularitĂ© des Ă©vĂ©nements et un impact rĂ©duit sur les performances, tant sur le volet utilisation CPU que sur les entrĂ©es–sortie

    Techniques To Facilitate the Understanding of Inter-process Communication Traces

    Get PDF
    High Performance Computing (HPC) systems play an important role in today’s heavily digitized world, which is in a constant demand for higher speed of calculation and performance. HPC applications are used in multiple domains such as telecommunication, health, scientific research, and more. With the emergence of multi-core and cloud computing platforms, the HPC paradigm is quickly becoming the design of choice of many service providers. HPC systems are also known to be complex to debug and analyze due to the large number of processes they involve and the way these processes communicate with each other to perform specific tasks. As a result, software engineers must spend extensive amount of time understanding the complex interactions among a system’s processes. This is usually done through the analysis of execution traces generated from running the system at hand. Traces, however, are very difficult to work with due to the overwhelming size of typical traces. The objective of this research is to present a set of techniques that facilitates the understanding of the behaviour of HPC applications through the analysis of system traces. The first technique consists of building an exchange format called MTF (MPI Trace Format) for representing and exchanging traces generated from HPC applications based on the MPI (Message Passing Interface) standard, which is a de facto standard for inter-process communication for high performance computing systems. The design of MTF is validated against well-known requirements for a standard exchange format. The second technique aims to facilitate the understanding of large traces of inter-process communication by automatically extracting communication patterns that characterize their main behaviour. Two algorithms are presented. The first one permits the recognition of repeating patterns in traces of MPI (Message Passing Interaction) applications whereas the second algorithm searches if a given communication pattern occurs in a trace. Both algorithms are based on the n-gram extraction technique used in natural language processing. Finally, we developed a technique to abstract MPI traces by detecting the different execution phases in a program based on concepts from information theory. Using this approach, software engineers can examine the trace as a sequence of high-level computational phases instead of a mere flow of low-level events. The techniques presented in this thesis have been tested on traces generated from real HPC programs. The results from several case studies demonstrate the usefulness and effectiveness of our techniques

    Structures de données hautement extensibles pour le stockage sur disque de séries temporelles hétérogÚnes

    Get PDF
    RĂ©sumĂ© Les systĂšmes informatiques deviennent de plus en plus complexes, et les dĂ©veloppeurs ont plus que jamais besoin de comprendre comment interagissent les nombreux composants de leurs systĂšmes. De nombreux outils existent pour instrumenter, mesurer et analyser les comportements et la performance des logiciels. Le traçage est une technique qui enregistre de nombreux points associĂ©s Ă  des Ă©vĂ©nements du systĂšme et l'estampille de temps Ă  laquelle ils se sont produits. L'analyse manuelle des traces produites permet de comprendre diffĂ©rents problĂšmes, mais elle devient fastidieuse lorsque ces historiques contiennent de trĂšs grands nombres de points. Il existe des logiciels pour automatiser ces analyses et fournir des visualisations, mais ces derniers peuvent aussi montrer leurs limites pour se mettre Ă  l'Ă©chelle des systĂšmes les plus Ă©tendus. Dans des travaux prĂ©cĂ©dents, Montplaisir et coll. ont prĂ©sentĂ© une structure de donnĂ©es sur disque, optimisĂ©e pour stocker les rĂ©sultats des analyses de traces sous forme d'intervalles d'Ă©tats: . La structure, nommĂ©e State History Tree (SHT), est un arbre pour lequel chaque nƓud est associĂ©e Ă  un bloc de disque, chaque nƓud dispose donc d'une capacitĂ© fixe pour stocker des intervalles et est dĂ©fini par un intervalle de temps tel que cet intervalle est inclus dans celui du nƓud parent et que les intervalles de deux enfants ne se superposent pas. Cette structure Ă©tait plus efficace que d'autres solutions gĂ©nĂ©riques, mais pouvait dĂ©gĂ©nĂ©rer, dans des cas avec un trĂšs grand nombre de clĂ©s, pour une trace avec de nombreux fils par exemple, la profondeur de l'arbre Ă©tait alors proportionnelle au nombre de fils, et de trĂšs nombreux nƓuds "vides" Ă©taient Ă©crits sur disque, gaspillant de l'espace. De plus, les requĂȘtes pour extraire les informations de la structure Ă©taient souvent le goulot d'Ă©tranglement pour l'affichage des donnĂ©es. Dans ce travail, nous analysons les limites de la base de donnĂ©es actuelle qui la conduisent Ă  dĂ©gĂ©nĂ©rer et nous Ă©tudierons les cas d'utilisation des requĂȘtes. Nous proposons des modifications structurelles permettant d'Ă©liminer les cas de dĂ©gĂ©nĂ©rescence lorsque la trace contient de nombreux attributs, tout en rĂ©duisant la taille sur disque de la structure pour tous types de traces. Nous ajoutons aussi des mĂ©tadonnĂ©es aux nƓuds de l'arbre pour rĂ©duire le nombre de nƓuds lus pendant les requĂȘtes. Ceci permet de rĂ©duire la durĂ©e des requĂȘtes de 50% dans la plupart des cas. Ensuite, nous cherchons Ă  optimiser le processus d'insertion des intervalles dans les nƓuds de l'arbre afin de regrouper les intervalles qui seront demandĂ©s dans une mĂȘme requĂȘte pour limiter le nombre de blocs de disque Ă  lire pour rĂ©pondre. Le nombre d'intervalles pris en compte dans l'optimisation peut augmenter avec le nombre de clĂ©s par exemple, ce qui permet de maintenir un Ă©quilibre entre le temps supplĂ©mentaire requis pour l'optimisation et les gains constatĂ©s sur les requĂȘtes qui deviennent plus flagrants lorsque l'analyse produit de nombreuses clĂ©s. Nous introduirons aussi un nouveau type de requĂȘte profitant de ces optimisations et permettant de retourner en une requĂȘte un ensemble d'intervalles qui prĂ©cĂ©demment prenait plusieurs requĂȘtes. De plus cette requĂȘte assure que chaque nƓud est lu au plus une fois, alors que l'utilisation de plusieurs requĂȘtes impliquait que certains nƓuds Ă©taient lus plusieurs fois. Nous montrons que l'utilisation de cette requĂȘte dans une des vues principales du logiciel de visualisation augmente considĂ©rablement sa rĂ©activitĂ©. Nous profiterons ensuite de ces apprentissages pour faciliter la mise Ă  l'Ă©chelle d'une seconde structure de donnĂ©es du logiciel d'analyse de trace, qui stocke des objets nommĂ©s "segments", sous la forme de . Ces objets Ă©taient prĂ©cĂ©demment stockĂ©s en mĂ©moire et donc le nombre que nous pouvions stocker Ă©tait limitĂ©. Nous utilisons une structure en arbre fortement inspirĂ©e du SHT. Nous montrons que la structure sur disque est au pire un ordre de grandeur plus lent que les structures en mĂ©moire Ă  la lecture. De plus, cette structure est particuliĂšrement efficace pour un cas d'usage qui demande Ă  retourner des segments triĂ©s. En effet, nous utilisons un algorithme rĂ©alisant l'Ă©valuation Ă  la demande et un tri partiel entre les nƓuds, qui utilise moins de mĂ©moire que le tri de tous les segments.----------ABSTRACT Computer systems are becoming more and more complex, and developers need more than ever to be able to understand how different components interact. Many tools have been developed for instrumenting, measuring and analysing the behavior and performance of software. One of these techniques is tracing, which records data-points and a timestamp associated to system events. Trace analysis can help solve a number of problems, but manual analysis becomes a daunting task when these traces contain a large number of points. Automated trace analysis software has been developed for this use case, but they too can face difficulties scaling up to the largest systems. In previous work, Montplaisir et al. presented a disk-based data structure, optimized for storing the results of trace analysis as state intervals: . This State History Tree (SHT) is a tree for which each node is mapped to a block on disk, such that each node has a fixed capacity to store intervals and is defined by a time range that must be included in its parent's range and must not overlap with its siblings' ranges. This structure was demonstrated to be more efficient that other, generic solutions, but could still degenerate for corner cases with many keys, from traces with many threads for example. The tree's depth would then be proportional to the number of threads and many empty nodes would be written to disk, wasting space. Moreover, queries to extract data from the data structure were often the bottleneck for displaying data. In this work, we analyse the limitations of the current database which cause it to degenerate and study the different use cases for queries. We suggest structural changes to the data structure, which eliminate the corner case behavior for traces with many attributes, while reducing the disk usage for all types of traces. We also add meta data to the nodes to reduce the number of nodes searched during queries, speeding them up by 50%. Then, we look into optimizing the nodes into which intervals are inserted, so that those which will be queried together will be grouped. This helps to reduce the number of disk blocks that must be read to answer the query. The number of intervals and nodes taken into account by the optimization process can increase along with the number of attributes, as they are the main cause of query slowdown. This helps to balance the extra time required for the optimized insertion and the gains provided on the queries. We also introduce a new type of query to benefit from these optimizations and return all desired intervals in a single query instead of the many queries previously required. This single query reads each node at most once, while the previous version with many queries would read some nodes several times. We show that using this query for one of the main views in the trace visualization software makes it considerably more reactive. We benefit from all these lessons learned to increase the scalability of another internal backend, the segment store, used for the following type of objects: . These were previously stored in memory, which would strongly limit the maximum capacity. We propose a new tree structure similar to the SHT instead. We show that the disk based structure is, in the worst case, only an order of magnitude slower for reads than the in-memory structures. Moreover, this structure is especially efficient for a typical segment store use case, which is a sorted segment query. Indeed by using partial sorts between nodes, memory usage is dramatically reduced compared to sorting all segments in memory

    Performance Optimization Strategies for Transactional Memory Applications

    Get PDF
    This thesis presents tools for Transactional Memory (TM) applications that cover multiple TM systems (Software, Hardware, and hybrid TM) and use information of all different layers of the TM software stack. Therefore, this thesis addresses a number of challenges to extract static information, information about the run time behavior, and expert-level knowledge to develop these new methods and strategies for the optimization of TM applications

    Verification Witnesses

    Get PDF
    Over the last years, witness-based validation of verification results has become an established practice in software verification: An independent validator re-establishes verification results of a software verifier using verification witnesses, which are stored in a standardized exchange format. In addition to validation, such exchangable information about proofs and alarms found by a verifier can be shared across verification tools, and users can apply independent third-party tools to visualize and explore witnesses to help them comprehend the causes of bugs or the reasons why a given program is correct. To achieve the goal of making verification results more accessible to engineers, it is necessary to consider witnesses as first-class exchangeable objects, stored independently from the source code and checked independently from the verifier that produced them, respecting the important principle of separation of concerns. We present the conceptual principles of verification witnesses, give a description of how to use them, provide a technical specification of the exchange format for witnesses, and perform an extensive experimental study on the application of witness-based result validation, using the validators CPAchecker, UAutomizer, CPA-witness2test, and FShell-witness2test
    corecore