193 research outputs found
On the accuracy and usefulness of analytic energy models for contemporary multicore processors
This paper presents refinements to the execution-cache-memory performance
model and a previously published power model for multicore processors. The
combination of both enables a very accurate prediction of performance and
energy consumption of contemporary multicore processors as a function of
relevant parameters such as number of active cores as well as core and Uncore
frequencies. Model validation is performed on the Sandy Bridge-EP and
Broadwell-EP microarchitectures. Production-related variations in chip quality
are demonstrated through a statistical analysis of the fit parameters obtained
on one hundred Broadwell-EP CPUs of the same model. Insights from the models
are used to explain the performance- and energy-related behavior of the
processors for scalable as well as saturating (i.e., memory-bound) codes. In
the process we demonstrate the models' capability to identify optimal operating
points with respect to highest performance, lowest energy-to-solution, and
lowest energy-delay product and identify a set of best practices for
energy-efficient execution
Propagation and Decay of Injected One-Off Delays on Clusters: A Case Study
Analytic, first-principles performance modeling of distributed-memory
applications is difficult due to a wide spectrum of random disturbances caused
by the application and the system. These disturbances (commonly called "noise")
destroy the assumptions of regularity that one usually employs when
constructing simple analytic models. Despite numerous efforts to quantify,
categorize, and reduce such effects, a comprehensive quantitative understanding
of their performance impact is not available, especially for long delays that
have global consequences for the parallel application. In this work, we
investigate various traces collected from synthetic benchmarks that mimic real
applications on simulated and real message-passing systems in order to pinpoint
the mechanisms behind delay propagation. We analyze the dependence of the
propagation speed of idle waves emanating from injected delays with respect to
the execution and communication properties of the application, study how such
delays decay under increased noise levels, and how they interact with each
other. We also show how fine-grained noise can make a system immune against the
adverse effects of propagating idle waves. Our results contribute to a better
understanding of the collective phenomena that manifest themselves in
distributed-memory parallel applications.Comment: 10 pages, 9 figures; title change
Analytic Performance Modeling and Analysis of Detailed Neuron Simulations
Big science initiatives are trying to reconstruct and model the brain by
attempting to simulate brain tissue at larger scales and with increasingly more
biological detail than previously thought possible. The exponential growth of
parallel computer performance has been supporting these developments, and at
the same time maintainers of neuroscientific simulation code have strived to
optimally and efficiently exploit new hardware features. Current state of the
art software for the simulation of biological networks has so far been
developed using performance engineering practices, but a thorough analysis and
modeling of the computational and performance characteristics, especially in
the case of morphologically detailed neuron simulations, is lacking. Other
computational sciences have successfully used analytic performance engineering
and modeling methods to gain insight on the computational properties of
simulation kernels, aid developers in performance optimizations and eventually
drive co-design efforts, but to our knowledge a model-based performance
analysis of neuron simulations has not yet been conducted.
We present a detailed study of the shared-memory performance of
morphologically detailed neuron simulations based on the Execution-Cache-Memory
(ECM) performance model. We demonstrate that this model can deliver accurate
predictions of the runtime of almost all the kernels that constitute the neuron
models under investigation. The gained insight is used to identify the main
governing mechanisms underlying performance bottlenecks in the simulation. The
implications of this analysis on the optimization of neural simulation software
and eventually co-design of future hardware architectures are discussed. In
this sense, our work represents a valuable conceptual and quantitative
contribution to understanding the performance properties of biological networks
simulations.Comment: 18 pages, 6 figures, 15 table
Mechanistic analytical modeling of superscalar in-order processor performance
Superscalar in-order processors form an interesting alternative to out-of-order processors because of their energy efficiency and lower design complexity. However, despite the reduced design complexity, it is nontrivial to get performance estimates or insight in the application--microarchitecture interaction without running slow, detailed cycle-level simulations, because performance highly depends on the order of instructions within the application’s dynamic instruction stream, as in-order processors stall on interinstruction dependences and functional unit contention. To limit the number of detailed cycle-level simulations needed during design space exploration, we propose a mechanistic analytical performance model that is built from understanding the internal mechanisms of the processor.
The mechanistic performance model for superscalar in-order processors is shown to be accurate with an average performance prediction error of 3.2% compared to detailed cycle-accurate simulation using gem5. We also validate the model against hardware, using the ARM Cortex-A8 processor and show that it is accurate within 10% on average. We further demonstrate the usefulness of the model through three case studies: (1) design space exploration, identifying the optimum number of functional units for achieving a given performance target; (2) program--machine interactions, providing insight into microarchitecture bottlenecks; and (3) compiler--architecture interactions, visualizing the impact of compiler optimizations on performance
The Ecological Impact of High-performance Computing in Astrophysics
The importance of computing in astronomy continues to increase, and so is its
impact on the environment. When analyzing data or performing simulations, most
researchers raise concerns about the time to reach a solution rather than its
impact on the environment. Luckily, a reduced time-to-solution due to faster
hardware or optimizations in the software generally also leads to a smaller
carbon footprint. This is not the case when the reduced wall-clock time is
achieved by overclocking the processor, or when using supercomputers.
The increase in the popularity of interpreted scripting languages, and the
general availability of high-performance workstations form a considerable
threat to the environment. A similar concern can be raised about the trend of
running single-core instead of adopting efficient many-core programming
paradigms.
In astronomy, computing is among the top producers of green-house gasses,
surpassing telescope operations. Here I hope to raise the awareness of the
environmental impact of running non-optimized code on overpowered computer
hardware.Comment: Originated at EAS 2020 conference, sustainability session by
https://astronomersforplanet.earth - published in Nature Astronomy, September
202
Chapter One – An Overview of Architecture-Level Power- and Energy-Efficient Design Techniques
Power dissipation and energy consumption became the primary design constraint for almost all computer systems in the last 15 years. Both computer architects and circuit designers intent to reduce power and energy (without a performance degradation) at all design levels, as it is currently the main obstacle to continue with further scaling according to Moore's law. The aim of this survey is to provide a comprehensive overview of power- and energy-efficient “state-of-the-art” techniques. We classify techniques by component where they apply to, which is the most natural way from a designer point of view. We further divide the techniques by the component of power/energy they optimize (static or dynamic), covering in that way complete low-power design flow at the architectural level. At the end, we conclude that only a holistic approach that assumes optimizations at all design levels can lead to significant savings.Peer ReviewedPostprint (published version
Das unstetige Galerkinverfahren für Strömungen mit freier Oberfläche und im Grundwasserbereich in geophysikalischen Anwendungen
Free surface flows and subsurface flows appear in a broad range of geophysical applications and in many environmental settings situations arise which even require the coupling of free surface and subsurface flows. Many of these application scenarios are characterized by large domain sizes and long simulation times. Hence, they need considerable amounts of computational work to achieve accurate solutions and the use of efficient algorithms and high performance computing resources to obtain results within a reasonable time frame is mandatory.
Discontinuous Galerkin methods are a class of numerical methods for solving differential equations that share characteristics with methods from the finite volume and finite element frameworks. They feature high approximation orders, offer a large degree of flexibility, and are well-suited for parallel computing.
This thesis consists of eight articles and an extended summary that describe the application of discontinuous Galerkin methods to mathematical models including free surface and subsurface flow scenarios with a strong focus on computational aspects. It covers discretization and implementation aspects, the parallelization of the method, and discrete stability analysis of the coupled model.Für viele geophysikalische Anwendungen spielen Strömungen mit freier Oberfläche und im Grundwasserbereich oder sogar die Kopplung dieser beiden eine zentrale Rolle. Oftmals charakteristisch für diese Anwendungsszenarien sind große Rechengebiete und lange Simulationszeiten. Folglich ist das Berechnen akkurater Lösungen mit beträchtlichem Rechenaufwand verbunden und der Einsatz effizienter Lösungsverfahren sowie von Techniken des Hochleistungsrechnens obligatorisch, um Ergebnisse innerhalb eines annehmbaren Zeitrahmens zu erhalten.
Unstetige Galerkinverfahren stellen eine Gruppe numerischer Verfahren zum Lösen von Differentialgleichungen dar, und kombinieren Eigenschaften von Methoden der Finiten Volumen- und Finiten Elementeverfahren. Sie ermöglichen hohe Approximationsordnungen, bieten einen hohen Grad an Flexibilität und sind für paralleles Rechnen gut geeignet.
Diese Dissertation besteht aus acht Artikeln und einer erweiterten Zusammenfassung, in diesen die Anwendung unstetiger Galerkinverfahren auf mathematische Modelle inklusive solcher für Strömungen mit freier Oberfläche und im Grundwasserbereich beschrieben wird. Die behandelten Themen umfassen Diskretisierungs- und Implementierungsaspekte, die Parallelisierung der Methode sowie eine diskrete Stabilitätsanalyse des gekoppelten Modells
- …