193 research outputs found

    On the accuracy and usefulness of analytic energy models for contemporary multicore processors

    Full text link
    This paper presents refinements to the execution-cache-memory performance model and a previously published power model for multicore processors. The combination of both enables a very accurate prediction of performance and energy consumption of contemporary multicore processors as a function of relevant parameters such as number of active cores as well as core and Uncore frequencies. Model validation is performed on the Sandy Bridge-EP and Broadwell-EP microarchitectures. Production-related variations in chip quality are demonstrated through a statistical analysis of the fit parameters obtained on one hundred Broadwell-EP CPUs of the same model. Insights from the models are used to explain the performance- and energy-related behavior of the processors for scalable as well as saturating (i.e., memory-bound) codes. In the process we demonstrate the models' capability to identify optimal operating points with respect to highest performance, lowest energy-to-solution, and lowest energy-delay product and identify a set of best practices for energy-efficient execution

    Propagation and Decay of Injected One-Off Delays on Clusters: A Case Study

    Full text link
    Analytic, first-principles performance modeling of distributed-memory applications is difficult due to a wide spectrum of random disturbances caused by the application and the system. These disturbances (commonly called "noise") destroy the assumptions of regularity that one usually employs when constructing simple analytic models. Despite numerous efforts to quantify, categorize, and reduce such effects, a comprehensive quantitative understanding of their performance impact is not available, especially for long delays that have global consequences for the parallel application. In this work, we investigate various traces collected from synthetic benchmarks that mimic real applications on simulated and real message-passing systems in order to pinpoint the mechanisms behind delay propagation. We analyze the dependence of the propagation speed of idle waves emanating from injected delays with respect to the execution and communication properties of the application, study how such delays decay under increased noise levels, and how they interact with each other. We also show how fine-grained noise can make a system immune against the adverse effects of propagating idle waves. Our results contribute to a better understanding of the collective phenomena that manifest themselves in distributed-memory parallel applications.Comment: 10 pages, 9 figures; title change

    Analytic Performance Modeling and Analysis of Detailed Neuron Simulations

    Full text link
    Big science initiatives are trying to reconstruct and model the brain by attempting to simulate brain tissue at larger scales and with increasingly more biological detail than previously thought possible. The exponential growth of parallel computer performance has been supporting these developments, and at the same time maintainers of neuroscientific simulation code have strived to optimally and efficiently exploit new hardware features. Current state of the art software for the simulation of biological networks has so far been developed using performance engineering practices, but a thorough analysis and modeling of the computational and performance characteristics, especially in the case of morphologically detailed neuron simulations, is lacking. Other computational sciences have successfully used analytic performance engineering and modeling methods to gain insight on the computational properties of simulation kernels, aid developers in performance optimizations and eventually drive co-design efforts, but to our knowledge a model-based performance analysis of neuron simulations has not yet been conducted. We present a detailed study of the shared-memory performance of morphologically detailed neuron simulations based on the Execution-Cache-Memory (ECM) performance model. We demonstrate that this model can deliver accurate predictions of the runtime of almost all the kernels that constitute the neuron models under investigation. The gained insight is used to identify the main governing mechanisms underlying performance bottlenecks in the simulation. The implications of this analysis on the optimization of neural simulation software and eventually co-design of future hardware architectures are discussed. In this sense, our work represents a valuable conceptual and quantitative contribution to understanding the performance properties of biological networks simulations.Comment: 18 pages, 6 figures, 15 table

    Mechanistic analytical modeling of superscalar in-order processor performance

    Get PDF
    Superscalar in-order processors form an interesting alternative to out-of-order processors because of their energy efficiency and lower design complexity. However, despite the reduced design complexity, it is nontrivial to get performance estimates or insight in the application--microarchitecture interaction without running slow, detailed cycle-level simulations, because performance highly depends on the order of instructions within the application’s dynamic instruction stream, as in-order processors stall on interinstruction dependences and functional unit contention. To limit the number of detailed cycle-level simulations needed during design space exploration, we propose a mechanistic analytical performance model that is built from understanding the internal mechanisms of the processor. The mechanistic performance model for superscalar in-order processors is shown to be accurate with an average performance prediction error of 3.2% compared to detailed cycle-accurate simulation using gem5. We also validate the model against hardware, using the ARM Cortex-A8 processor and show that it is accurate within 10% on average. We further demonstrate the usefulness of the model through three case studies: (1) design space exploration, identifying the optimum number of functional units for achieving a given performance target; (2) program--machine interactions, providing insight into microarchitecture bottlenecks; and (3) compiler--architecture interactions, visualizing the impact of compiler optimizations on performance

    The Ecological Impact of High-performance Computing in Astrophysics

    Get PDF
    The importance of computing in astronomy continues to increase, and so is its impact on the environment. When analyzing data or performing simulations, most researchers raise concerns about the time to reach a solution rather than its impact on the environment. Luckily, a reduced time-to-solution due to faster hardware or optimizations in the software generally also leads to a smaller carbon footprint. This is not the case when the reduced wall-clock time is achieved by overclocking the processor, or when using supercomputers. The increase in the popularity of interpreted scripting languages, and the general availability of high-performance workstations form a considerable threat to the environment. A similar concern can be raised about the trend of running single-core instead of adopting efficient many-core programming paradigms. In astronomy, computing is among the top producers of green-house gasses, surpassing telescope operations. Here I hope to raise the awareness of the environmental impact of running non-optimized code on overpowered computer hardware.Comment: Originated at EAS 2020 conference, sustainability session by https://astronomersforplanet.earth - published in Nature Astronomy, September 202

    Chapter One – An Overview of Architecture-Level Power- and Energy-Efficient Design Techniques

    Get PDF
    Power dissipation and energy consumption became the primary design constraint for almost all computer systems in the last 15 years. Both computer architects and circuit designers intent to reduce power and energy (without a performance degradation) at all design levels, as it is currently the main obstacle to continue with further scaling according to Moore's law. The aim of this survey is to provide a comprehensive overview of power- and energy-efficient “state-of-the-art” techniques. We classify techniques by component where they apply to, which is the most natural way from a designer point of view. We further divide the techniques by the component of power/energy they optimize (static or dynamic), covering in that way complete low-power design flow at the architectural level. At the end, we conclude that only a holistic approach that assumes optimizations at all design levels can lead to significant savings.Peer ReviewedPostprint (published version

    Das unstetige Galerkinverfahren für Strömungen mit freier Oberfläche und im Grundwasserbereich in geophysikalischen Anwendungen

    Get PDF
    Free surface flows and subsurface flows appear in a broad range of geophysical applications and in many environmental settings situations arise which even require the coupling of free surface and subsurface flows. Many of these application scenarios are characterized by large domain sizes and long simulation times. Hence, they need considerable amounts of computational work to achieve accurate solutions and the use of efficient algorithms and high performance computing resources to obtain results within a reasonable time frame is mandatory. Discontinuous Galerkin methods are a class of numerical methods for solving differential equations that share characteristics with methods from the finite volume and finite element frameworks. They feature high approximation orders, offer a large degree of flexibility, and are well-suited for parallel computing. This thesis consists of eight articles and an extended summary that describe the application of discontinuous Galerkin methods to mathematical models including free surface and subsurface flow scenarios with a strong focus on computational aspects. It covers discretization and implementation aspects, the parallelization of the method, and discrete stability analysis of the coupled model.Für viele geophysikalische Anwendungen spielen Strömungen mit freier Oberfläche und im Grundwasserbereich oder sogar die Kopplung dieser beiden eine zentrale Rolle. Oftmals charakteristisch für diese Anwendungsszenarien sind große Rechengebiete und lange Simulationszeiten. Folglich ist das Berechnen akkurater Lösungen mit beträchtlichem Rechenaufwand verbunden und der Einsatz effizienter Lösungsverfahren sowie von Techniken des Hochleistungsrechnens obligatorisch, um Ergebnisse innerhalb eines annehmbaren Zeitrahmens zu erhalten. Unstetige Galerkinverfahren stellen eine Gruppe numerischer Verfahren zum Lösen von Differentialgleichungen dar, und kombinieren Eigenschaften von Methoden der Finiten Volumen- und Finiten Elementeverfahren. Sie ermöglichen hohe Approximationsordnungen, bieten einen hohen Grad an Flexibilität und sind für paralleles Rechnen gut geeignet. Diese Dissertation besteht aus acht Artikeln und einer erweiterten Zusammenfassung, in diesen die Anwendung unstetiger Galerkinverfahren auf mathematische Modelle inklusive solcher für Strömungen mit freier Oberfläche und im Grundwasserbereich beschrieben wird. Die behandelten Themen umfassen Diskretisierungs- und Implementierungsaspekte, die Parallelisierung der Methode sowie eine diskrete Stabilitätsanalyse des gekoppelten Modells
    • …
    corecore