2,051 research outputs found

    BarrierPoint: sampled simulation of multi-threaded applications

    Get PDF
    Sampling is a well-known technique to speed up architectural simulation of long-running workloads while maintaining accurate performance predictions. A number of sampling techniques have recently been developed that extend well- known single-threaded techniques to allow sampled simulation of multi-threaded applications. Unfortunately, prior work is limited to non-synchronizing applications (e.g., server throughput workloads); requires the functional simulation of the entire application using a detailed cache hierarchy which limits the overall simulation speedup potential; leads to different units of work across different processor architectures which complicates performance analysis; or, requires massive machine resources to achieve reasonable simulation speedups. In this work, we propose BarrierPoint, a sampling methodology to accelerate simulation by leveraging globally synchronizing barriers in multi-threaded applications. BarrierPoint collects microarchitecture-independent code and data signatures to determine the most representative inter-barrier regions, called barrierpoints. BarrierPoint estimates total application execution time (and other performance metrics of interest) through detailed simulation of these barrierpoints only, leading to substantial simulation speedups. Barrierpoints can be simulated in parallel, use fewer simulation resources, and define fixed units of work to be used in performance comparisons across processor architectures. Our evaluation of BarrierPoint using NPB and Parsec benchmarks reports average simulation speedups of 24.7x (and up to 866.6x) with an average simulation error of 0.9% and 2.9% at most. On average, BarrierPoint reduces the number of simulation machine resources needed by 78x

    A User Friendly Phase Detection Methodology for HPC Systems' Analysis

    Get PDF
    International audienceA wide array of today's high performance computing (HPC) applications exhibits recurring behaviours or execution phases throughout their run-time. Accurate detection of program phases allows reconfiguring the system for a better power/performance trade off; and can reduce the simulation time of programs by identifying regions of code whose performance is critical to the entire program. Program phases are also reflected in different behaviours the system goes through or system phases, which can be used as an alternative means of program phase detection for users lacking expertise. In this paper, we present an execution vector based (EV-based) phase detection, which is an on-line methodology for detecting phases in the behaviour of a HPC system and determining execution points that correspond to these phases. We also present a methodology for defining a small set of EVs representative of the system's behaviour over a fixed period of time and show that EV-based phase detection identifies recurring phases. Our methodology is illustrated with benchmarks and a real life application

    A Survey of Phase Classification Techniques for Characterizing Variable Application Behavior

    Full text link
    Adaptable computing is an increasingly important paradigm that specializes system resources to variable application requirements, environmental conditions, or user requirements. Adapting computing resources to variable application requirements (or application phases) is otherwise known as phase-based optimization. Phase-based optimization takes advantage of application phases, or execution intervals of an application, that behave similarly, to enable effective and beneficial adaptability. In order for phase-based optimization to be effective, the phases must first be classified to determine when application phases begin and end, and ensure that system resources are accurately specialized. In this paper, we present a survey of phase classification techniques that have been proposed to exploit the advantages of adaptable computing through phase-based optimization. We focus on recent techniques and classify these techniques with respect to several factors in order to highlight their similarities and differences. We divide the techniques by their major defining characteristics---online/offline and serial/parallel. In addition, we discuss other characteristics such as prediction and detection techniques, the characteristics used for prediction, interval type, etc. We also identify gaps in the state-of-the-art and discuss future research directions to enable and fully exploit the benefits of adaptable computing.Comment: To appear in IEEE Transactions on Parallel and Distributed Systems (TPDS

    A survey of machine learning techniques applied to self organizing cellular networks

    Get PDF
    In this paper, a survey of the literature of the past fifteen years involving Machine Learning (ML) algorithms applied to self organizing cellular networks is performed. In order for future networks to overcome the current limitations and address the issues of current cellular systems, it is clear that more intelligence needs to be deployed, so that a fully autonomous and flexible network can be enabled. This paper focuses on the learning perspective of Self Organizing Networks (SON) solutions and provides, not only an overview of the most common ML techniques encountered in cellular networks, but also manages to classify each paper in terms of its learning solution, while also giving some examples. The authors also classify each paper in terms of its self-organizing use-case and discuss how each proposed solution performed. In addition, a comparison between the most commonly found ML algorithms in terms of certain SON metrics is performed and general guidelines on when to choose each ML algorithm for each SON function are proposed. Lastly, this work also provides future research directions and new paradigms that the use of more robust and intelligent algorithms, together with data gathered by operators, can bring to the cellular networks domain and fully enable the concept of SON in the near future

    A Runtime Framework for Energy Efficient HPC Systems Without a Priori Knowledge of Applications

    Get PDF
    International audienceThe rising computing demands of scientific endeavours often require the creation and management of High Performance Computing (HPC) systems for running experiments and processing vast amounts of data. These HPC systems generally operate at peak performance, consuming a large quantity of electricity, even though their workload varies over time. Understanding the behavioural patterns i.e., phases) of HPC systems during their use is key to adjust performance to resource demand and hence improve the energy efficiency. In this paper, we describe (i) a method to detect phases of an HPC system based on its workload, and (ii) a partial phase recognition technique that works cooperatively with on-the-fly dynamic management. We implement a prototype that guides the use of energy saving capabilities to demonstrate the benefits of our approach. Experimental results reveal the effectiveness of the phase detection method under real-life workload and benchmarks. A comparison with baseline unmanaged execution shows that the partial phase recognition technique saves up to 15% of energy with less than 1% performance degradation

    Doctor of Philosophy

    Get PDF
    dissertationComputer programs have complex interactions with their underlying hardware, exhibiting complex behaviors as a result. It is critical to understand these programs, as they serve an importantrole: researchers use them to express new ideas in computer science, while many others derive production value from them. In both cases, program understanding leads to mastery over these functions, adding value to human endeavors. Memory behavior is one of the hallmarks of general program behavior: it represents the critical function of retrieving data for the program to work on; it often reflects the overall actions taken by the program, providing a signature of program behavior; and it is often an important performance bottleneck, as the the memory subsystem is typically much slower than the processor. These reasons justify an investigation into the memory behavior of programs. A memory reference trace is a list of memory transactions performed by a program at runtime, a rich data source capturing the whole of a program's interaction with the memory subsystem, and a clear starting point for investigating program memory behavior. However, such a trace is extremely difficult to interpret by mere inspection, as it consists solely of many, many addresses and operation codes, without any more structure or context. This dissertation proposes to use visualization to construct images and animations of the data within a reference trace, thereby visually transmitting structures and events as encoded in the trace. These visualization approaches are designed with different focuses, meant to expose various aspects of the trace. For instance, the time dimension of the reference traces can be handled either with animation, showing events as they occur, or by laying time out in a spatial dimension, giving a view of the entire history of the trace at once. The approaches also vary in their level of abstraction from the hardware: some are concretely connected to representations of the memory itself, while others are more free-form, using more abstract metaphors to highlight general behaviors and patterns, which in turn characterize the program behavior. Each approach delivers its own set of insights, as demonstrated in this dissertation

    Pac-Sim: Simulation of Multi-threaded Workloads using Intelligent, Live Sampling

    Full text link
    High-performance, multi-core processors are the key to accelerating workloads in several application domains. To continue to scale performance at the limit of Moore's Law and Dennard scaling, software and hardware designers have turned to dynamic solutions that adapt to the needs of applications in a transparent, automatic way. For example, modern hardware improves its performance and power efficiency by changing the hardware configuration, like the frequency and voltage of cores, according to a number of parameters such as the technology used, the workload running, etc. With this level of dynamism, it is essential to simulate next-generation multi-core processors in a way that can both respond to system changes and accurately determine system performance metrics. Currently, no sampled simulation platform can achieve these goals of dynamic, fast, and accurate simulation of multi-threaded workloads. In this work, we propose a solution that allows for fast, accurate simulation in the presence of both hardware and software dynamism. To accomplish this goal, we present Pac-Sim, a novel sampled simulation methodology for fast, accurate sampled simulation that requires no upfront analysis of the workload. With our proposed methodology, it is now possible to simulate long-running dynamically scheduled multi-threaded programs with significant simulation speedups even in the presence of dynamic hardware events. We evaluate Pac-Sim using the multi-threaded SPEC CPU2017, NPB, and PARSEC benchmarks with both static and dynamic thread scheduling. The experimental results show that Pac-Sim achieves a very low sampling error of 1.63% and 3.81% on average for statically and dynamically scheduled benchmarks, respectively. Pac-Sim also demonstrates significant simulation speedups as high as 523.5Ă—\times (210.3Ă—\times on average) for the train input set of SPEC CPU2017.Comment: 14 pages, 14 figure
    • …
    corecore