3,338 research outputs found

    Software Performance Engineering using Virtual Time Program Execution

    Get PDF
    In this thesis we introduce a novel approach to software performance engineering that is based on the execution of code in virtual time. Virtual time execution models the timing-behaviour of unmodified applications by scaling observed method times or replacing them with results acquired from performance model simulation. This facilitates the investigation of "what-if" performance predictions of applications comprising an arbitrary combination of real code and performance models. The ability to analyse code and models in a single framework enables performance testing throughout the software lifecycle, without the need to to extract performance models from code. This is accomplished by forcing thread scheduling decisions to take into account the hypothetical time-scaling or model-based performance specifications of each method. The virtual time execution of I/O operations or multicore targets is also investigated. We explore these ideas using a Virtual EXecution (VEX) framework, which provides performance predictions for multi-threaded applications. The language-independent VEX core is driven by an instrumentation layer that notifies it of thread state changes and method profiling events; it is then up to VEX to control the progress of application threads in virtual time on top of the operating system scheduler. We also describe a Java Instrumentation Environment (JINE), demonstrating the challenges involved in virtual time execution at the JVM level. We evaluate the VEX/JINE tools by executing client-side Java benchmarks in virtual time and identifying the causes of deviations from observed real times. Our results show that VEX and JINE transparently provide predictions for the response time of unmodified applications with typically good accuracy (within 5-10%) and low simulation overheads (25-50% additional time). We conclude this thesis with a case study that shows how models and code can be integrated, thus illustrating our vision on how virtual time execution can support performance testing throughout the software lifecycle

    Foo's To Blame: Techniques For Mapping Performance Data To Program Variables

    Get PDF
    Traditional methods of performance analysis offer a code centric view, presenting performance data in terms of blocks of contiguous code (statement, basic block, loop, function, etc.). Existing data centric techniques allow various program properties to be mapped directly to variables. Our approach extends these data centric mappings. Just as code centric techniques allow lower level objects like source lines be mapped up to functions, our inclusive technique allows low level data centric operations like computations on scalars to be mapped up to complex data structures like those found in scientific frameworks. Our system utilizes static analysis to collect information about the program that can be combined with runtime information to perform data centric program analysis. By pushing most of the analysis to pre-run and post-mortem, we can minimize the amount of data collected at runtime. This allows us to perform less instrumentation and also minimizes program perturbation. It also allows us to collect information that would not be possible with existing techniques. We present two applications of this analysis. The first application of our analysis is targeted at mapping performance data to high level data structures with multiple levels of abstraction. We create extended data centric mappings, which we call variable blame, that relates data centric information to these variables. The second application is a method for mapping cache miss information to variables. Existing approaches for this analysis rely on explicit hardware support and extensive program instrumentation. By utilizing our analysis and applying software heuristics, we are able to lessen those requirements. We apply both of these analyses to applications and show what performance information can be provided by our analysis that can not currently be determined. We also discuss how we can use that information to improve program performance

    Control of Cable Robots for Construction Applications

    Get PDF

    On the efficiency of reductions in ”-SIMD media extensions

    Get PDF
    Many important multimedia applications contain a significant fraction of reduction operations. Although, in general, multimedia applications are characterized for having high amounts of Data Level Parallelism, reductions and accumulations are difficult to parallelize and show a poor tolerance to increases in the latency of the instructions. This is specially significant for ”-SIMD extensions such as MMX or AltiVec. To overcome the problem of reductions in ”-SIMD ISAs, designers tend to include more and more complex instructions able to deal with the most common forms of reductions in multimedia. As long as the number of processor pipeline stages grows, the number of cycles needed to execute these multimedia instructions increases with every processor generation, severely compromising performance. The paper presents an in-depth discussion of how reductions/accumulations are performed in current ”-SIMD architectures and evaluates the performance trade-offs for near-future highly aggressive superscalar processors with three different styles of ”-SIMD extensions. We compare a MMX-like alternative to a MDMX-like extension that has packed accumulators to attack the reduction problem, and we also compare it to MOM, a matrix register ISA. We show that while packed accumulators present several advantages, they introduce artificial recurrences that severely degrade performance for processors with high number of registers and long latency operations. On the other hand, the paper demonstrates that longer SIMD media extensions such as MOM can take great advantage of accumulators by exploiting the associative parallelism implicit in reductions.Peer ReviewedPostprint (published version

    Statistical framework for video decoding complexity modeling and prediction

    Get PDF
    Video decoding complexity modeling and prediction is an increasingly important issue for efficient resource utilization in a variety of applications, including task scheduling, receiver-driven complexity shaping, and adaptive dynamic voltage scaling. In this paper we present a novel view of this problem based on a statistical framework perspective. We explore the statistical structure (clustering) of the execution time required by each video decoder module (entropy decoding, motion compensation, etc.) in conjunction with complexity features that are easily extractable at encoding time (representing the properties of each module's input source data). For this purpose, we employ Gaussian mixture models (GMMs) and an expectation-maximization algorithm to estimate the joint execution-time - feature probability density function (PDF). A training set of typical video sequences is used for this purpose in an offline estimation process. The obtained GMM representation is used in conjunction with the complexity features of new video sequences to predict the execution time required for the decoding of these sequences. Several prediction approaches are discussed and compared. The potential mismatch between the training set and new video content is addressed by adaptive online joint-PDF re-estimation. An experimental comparison is performed to evaluate the different approaches and compare the proposed prediction scheme with related resource prediction schemes from the literature. The usefulness of the proposed complexity-prediction approaches is demonstrated in an application of rate-distortion-complexity optimized decoding

    Dynamic Analysis of Embedded Software

    Get PDF
    abstract: Most embedded applications are constructed with multiple threads to handle concurrent events. For optimization and debugging of the programs, dynamic program analysis is widely used to collect execution information while the program is running. Unfortunately, the non-deterministic behavior of multithreaded embedded software makes the dynamic analysis difficult. In addition, instrumentation overhead for gathering execution information may change the execution of a program, and lead to distorted analysis results, i.e., probe effect. This thesis presents a framework that tackles the non-determinism and probe effect incurred in dynamic analysis of embedded software. The thesis largely consists of three parts. First of all, we discusses a deterministic replay framework to provide reproducible execution. Once a program execution is recorded, software instrumentation can be safely applied during replay without probe effect. Second, a discussion of probe effect is presented and a simulation-based analysis is proposed to detect execution changes of a program caused by instrumentation overhead. The simulation-based analysis examines if the recording instrumentation changes the original program execution. Lastly, the thesis discusses data race detection algorithms that help to remove data races for correctness of the replay and the simulation-based analysis. The focus is to make the detection efficient for C/C++ programs, and to increase scalability of the detection on multi-core machines.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Performance Evaluation of Block Acquisition and Tracking Algorithms Using an Open Source GPS Receiver Platform

    Get PDF
    Location technologies have many applications in wireless communications, military and space missions, etc. US Global Positioning System (GPS) and other existing and emerging Global Navigation Satellite Systems (GNSS) are expected to provide accurate location information to enable such applications. While GNSS systems perform very well in strong signal conditions, their operation in many urban, indoor, and space applications is not robust or even impossible due to weak signals and strong distortions. The search for less costly, faster and more sensitive receivers is still in progress. As the research community addresses more and more complicated phenomena there exists a demand on flexible multimode reference receivers, associated SDKs, and development platforms which may accelerate and facilitate the research. One of such concepts is the software GPS/GNSS receiver (GPS SDR) which permits a facilitated access to algorithmic libraries and a possibility to integrate more advanced algorithms without hardware and essential software updates. The GNU-SDR and GPS-SDR open source receiver platforms are such popular examples. This paper evaluates the performance of recently proposed block-corelator techniques for acquisition and tracking of GPS signals using open source GPS-SDR platform

    Trace-based Performance Analysis for Hardware Accelerators

    Get PDF
    This thesis presents how performance data from hardware accelerators can be included in event logs. It extends the capabilities of trace-based performance analysis to also monitor and record data from this novel parallelization layer. The increasing awareness to power consumption of computing devices has led to an interest in hybrid computing architectures as well. High-end computers, workstations, and mobile devices start to employ hardware accelerators to offload computationally intense and parallel tasks, while at the same time retaining a highly efficient scalar compute unit for non-parallel tasks. This execution pattern is typically asynchronous so that the scalar unit can resume other work while the hardware accelerator is busy. Performance analysis tools provided by the hardware accelerator vendors cover the situation of one host using one device very well. Yet, they do not address the needs of the high performance computing community. This thesis investigates ways to extend existing methods for recording events from highly parallel applications to also cover scenarios in which hardware accelerators aid these applications. After introducing a generic approach that is suitable for any API based acceleration paradigm, the thesis derives a suggestion for a generic performance API for hardware accelerators and its implementation with NVIDIA CUPTI. In a next step the visualization of event logs containing data from execution streams on different levels of parallelism is discussed. In order to overcome the limitations of classic performance profiles and timeline displays, a graph-based visualization using Parallel Performance Flow Graphs (PPFGs) is introduced. This novel technical approach is using program states in order to display similarities and differences between the potentially very large number of event streams and, thus, enables a fast way to spot load imbalances. The thesis concludes with the in-depth analysis of a case-study of PIConGPU---a highly parallel, multi-hybrid plasma physics simulation---that benefited greatly from the developed performance analysis methods.Diese Dissertation zeigt, wie der Ablauf von Anwendungsteilen, die auf Hardwarebeschleuniger ausgelagert wurden, als Programmspur mit aufgezeichnet werden kann. Damit wird die bekannte Technik der Leistungsanalyse von Anwendungen mittels Programmspuren so erweitert, dass auch diese neue ParallelitĂ€tsebene mit erfasst wird. Die BeschrĂ€nkungen von Computersystemen bezĂŒglich der elektrischen Leistungsaufnahme hat zu einer steigenden Anzahl von hybriden Computerarchitekturen gefĂŒhrt. Sowohl Hochleistungsrechner, aber auch Arbeitsplatzcomputer und mobile EndgerĂ€te nutzen heute Hardwarebeschleuniger um rechenintensive, parallele Programmteile auszulagern und so den skalaren Hauptprozessor zu entlasten und nur fĂŒr nicht parallele Programmteile zu verwenden. Dieses AusfĂŒhrungsschema ist typischerweise asynchron: der Skalarprozessor kann, wĂ€hrend der Hardwarebeschleuniger rechnet, selbst weiterarbeiten. Die Leistungsanalyse-Werkzeuge der Hersteller von Hardwarebeschleunigern decken den Standardfall (ein Host-System mit einem Hardwarebeschleuniger) sehr gut ab, scheitern aber an einer UnterstĂŒtzung von hochparallelen Rechnersystemen. Die vorliegende Dissertation untersucht, in wie weit auch multi-hybride Anwendungen die AktivitĂ€t von Hardwarebeschleunigern aufzeichnen können. Dazu wird die vorhandene Methode zur Erzeugung von Programmspuren fĂŒr hochparallele Anwendungen entsprechend erweitert. In dieser Untersuchung wird zuerst eine allgemeine Methodik entwickelt, mit der sich fĂŒr jede API-gestĂŒtzte Hardwarebeschleunigung eine Programmspur erstellen lĂ€sst. Darauf aufbauend wird eine eigene Programmierschnittstelle entwickelt, die es ermöglicht weitere leistungsrelevante Daten aufzuzeichnen. Die Umsetzung dieser Schnittstelle wird am Beispiel von NVIDIA CUPTI darstellt. Ein weiterer Teil der Arbeit beschĂ€ftigt sich mit der Darstellung von Programmspuren, welche Aufzeichnungen von den unterschiedlichen ParallelitĂ€tsebenen enthalten. Um die EinschrĂ€nkungen klassischer Leistungsprofile oder Zeitachsendarstellungen zu ĂŒberwinden, wird mit den parallelen Programmablaufgraphen (PPFGs) eine neue graphenbasisierte Darstellungsform eingefĂŒhrt. Dieser neuartige Ansatz zeigt eine Programmspur als eine Folge von ProgrammzustĂ€nden mit gemeinsamen und unterchiedlichen AblĂ€ufen. So können divergierendes Programmverhalten und Lastimbalancen deutlich einfacher lokalisiert werden. Die Arbeit schließt mit der detaillierten Analyse von PIConGPU -- einer multi-hybriden Simulation aus der Plasmaphysik --, die in großem Maße von den in dieser Arbeit entwickelten Analysemöglichkeiten profiert hat
    • 

    corecore