16 research outputs found

    Towards an Application-Specific Thermal Energy Model of Current Processors

    Get PDF
    Recent developments of high-end processors recognize temperature monitoring and tuning as one of the main challenges towards achieving higher performance given the growing power and temperature constraints. To address this challenge, one needs both suitable thermal energy abstraction and corresponding instrumentation. Our model is based on application-specific parameters such as power consumption, execution time, and asymptotic temperature as well as hardware-specific parameters such as half time for thermal rise or fall. As observed with our out-of-band instrumentation and monitoring infrastructure, the temperature changes follow a relatively slow capacitor-style charge-discharge process. Therefore, we use the lumped thermal model that initiates an exponential process whenever there is a change in processor’s power consumption. Initial experiments with two codes – Firestarter and Nekbone – validate our thermal energy model and demonstrate its use for analyzing and potentially improving the application-specific balance between temperature, power, and performance

    Application-Specific Energy Modeling of Multi-Core Processors

    Get PDF
    Recent developments of high-end processors recognize energy monitoring and tuning as one of the main challenges towards achieving higher performance given the growing power and temperature constraints. Our thermal energy model is based on application-specific parameters such as consumed power, execution time, and equilibrium temperature as well as hardware-specific parameters such as half time for thermal rise or fall. As observed with the out-of-band instrumentation and monitoring infrastructure on our experimental cluster with air cooling, the temperature changes follow a relatively slow capacitor-style charge-discharge process. Therefore, we use the lumped thermal model that initiates an exponential process whenever there is a change in processor’s power consumption. Experiments with two codes – Firestarter and Nekbone – validate our approach and demonstrate its use for analyzing and potentially improving the application-specific balance between temperature, power, and performance

    Rapid and accurate energy models through calibration with IPMI and RAPL

    Get PDF
    Energy consumption in Cloud and High Performance Computing platforms is a significant issue and affects aspects such as the cost of energy and the cooling of the data center. Host level monitoring and prediction provides the groundwork for improving energy efficiency through the placement of workloads. Monitoring must be fast and efficient without unnecessary overhead, to enable scalability. This precludes the use of Watt meters attached per host, requiring alternative approaches such as integrated measurements and models. IPMI and RAPL are subject to error and partial measurement, which may be mitigated. Models allow for prediction and more responsive measures of power consumption, but require calibrating. The causes of calibration error are discussed, along with mitigation strategies, without overly complicating the underlying model. An outcome is a Watt meter emulator that provides hosts level power measurement along with estimated power consumption for a given workload, with an average error of 0.20W

    The Power-Optimised Software Envelope

    Get PDF
    Advances in processor design have delivered performance improvements for decades. As physical limits are reached, refinements to the same basic technologies are beginning to yield diminishing returns. Unsustainable increases in energy consumption are forcing hardware manufacturers to prioritise energy efficiency in their designs. Research suggests that software modifications may be needed to exploit the resulting improvements in current and future hardware. New tools are required to capitalise on this new class of optimisation. In this paper, we present the Power Optimised Software Envelope (POSE) model, which allows developers to assess the potential benefits of power optimisation for their applications. The POSE model is metric agnostic and in this paper we provide derivations using the established Energy-Delay Product metric and the novel Energy-Delay Sum and Energy-Delay Distance metrics that we believe are more appropriate for energy-aware optimisation efforts. We demonstrate POSE on three platforms by studying the optimisation characteristics of applications from the Mantevo benchmark suite. Our results show that the Pathfinder application has very little scope for power optimisation while TeaLeaf has the most, with all other applications in the benchmark suite falling between the two. Finally, we extend our POSE model with a formulation known as System Summary POSE – a meta-heuristic that allows developers to assess the scope a system has for energy-aware software optimisation independent of the code being run

    Energy Measurements of High Performance Computing Systems: From Instrumentation to Analysis

    Get PDF
    Energy efficiency is a major criterion for computing in general and High Performance Computing in particular. When optimizing for energy efficiency, it is essential to measure the underlying metric: energy consumption. To fully leverage energy measurements, their quality needs to be well-understood. To that end, this thesis provides a rigorous evaluation of various energy measurement techniques. I demonstrate how the deliberate selection of instrumentation points, sensors, and analog processing schemes can enhance the temporal and spatial resolution while preserving a well-known accuracy. Further, I evaluate a scalable energy measurement solution for production HPC systems and address its shortcomings. Such high-resolution and large-scale measurements present challenges regarding the management of large volumes of generated metric data. I address these challenges with a scalable infrastructure for collecting, storing, and analyzing metric data. With this infrastructure, I also introduce a novel persistent storage scheme for metric time series data, which allows efficient queries for aggregate timelines. To ensure that it satisfies the demanding requirements for scalable power measurements, I conduct an extensive performance evaluation and describe a productive deployment of the infrastructure. Finally, I describe different approaches and practical examples of analyses based on energy measurement data. In particular, I focus on the combination of energy measurements and application performance traces. However, interweaving fine-grained power recordings and application events requires accurately synchronized timestamps on both sides. To overcome this obstacle, I develop a resilient and automated technique for time synchronization, which utilizes crosscorrelation of a specifically influenced power measurement signal. Ultimately, this careful combination of sophisticated energy measurements and application performance traces yields a detailed insight into application and system energy efficiency at full-scale HPC systems and down to millisecond-range regions.:1 Introduction 2 Background and Related Work 2.1 Basic Concepts of Energy Measurements 2.1.1 Basics of Metrology 2.1.2 Measuring Voltage, Current, and Power 2.1.3 Measurement Signal Conditioning and Analog-to-Digital Conversion 2.2 Power Measurements for Computing Systems 2.2.1 Measuring Compute Nodes using External Power Meters 2.2.2 Custom Solutions for Measuring Compute Node Power 2.2.3 Measurement Solutions of System Integrators 2.2.4 CPU Energy Counters 2.2.5 Using Models to Determine Energy Consumption 2.3 Processing of Power Measurement Data 2.3.1 Time Series Databases 2.3.2 Data Center Monitoring Systems 2.4 Influences on the Energy Consumption of Computing Systems 2.4.1 Processor Power Consumption Breakdown 2.4.2 Energy-Efficient Hardware Configuration 2.5 HPC Performance and Energy Analysis 2.5.1 Performance Analysis Techniques 2.5.2 HPC Performance Analysis Tools 2.5.3 Combining Application and Power Measurements 2.6 Conclusion 3 Evaluating and Improving Energy Measurements 3.1 Description of the Systems Under Test 3.2 Instrumentation Points and Measurement Sensors 3.2.1 Analog Measurement at Voltage Regulators 3.2.2 Instrumentation with Hall Effect Transducers 3.2.3 Modular Instrumentation of DC Consumers 3.2.4 Optimal Wiring for Shunt-Based Measurements 3.2.5 Node-Level Instrumentation for HPC Systems 3.3 Analog Signal Conditioning and Analog-to-Digital Conversion 3.3.1 Signal Amplification 3.3.2 Analog Filtering and Analog-To-Digital Conversion 3.3.3 Integrated Solutions for High-Resolution Measurement 3.4 Accuracy Evaluation and Calibration 3.4.1 Synthetic Workloads for Evaluating Power Measurements 3.4.2 Improving and Evaluating the Accuracy of a Single-Node Measuring System 3.4.3 Absolute Accuracy Evaluation of a Many-Node Measuring System 3.5 Evaluating Temporal Granularity and Energy Correctness 3.5.1 Measurement Signal Bandwidth at Different Instrumentation Points 3.5.2 Retaining Energy Correctness During Digital Processing 3.6 Evaluating CPU Energy Counters 3.6.1 Energy Readouts with RAPL 3.6.2 Methodology 3.6.3 RAPL on Intel Sandy Bridge-EP 3.6.4 RAPL on Intel Haswell-EP and Skylake-SP 3.7 Conclusion 4 A Scalable Infrastructure for Processing Power Measurement Data 4.1 Requirements for Power Measurement Data Processing 4.2 Concepts and Implementation of Measurement Data Management 4.2.1 Message-Based Communication between Agents 4.2.2 Protocols 4.2.3 Application Programming Interfaces 4.2.4 Efficient Metric Time Series Storage and Retrieval 4.2.5 Hierarchical Timeline Aggregation 4.3 Performance Evaluation 4.3.1 Benchmark Hardware Specifications 4.3.2 Throughput in Symmetric Configuration with Replication 4.3.3 Throughput with Many Data Sources and Single Consumers 4.3.4 Temporary Storage in Message Queues 4.3.5 Persistent Metric Time Series Request Performance 4.3.6 Performance Comparison with Contemporary Time Series Storage Solutions 4.3.7 Practical Usage of MetricQ 4.4 Conclusion 5 Energy Efficiency Analysis 5.1 General Energy Efficiency Analysis Scenarios 5.1.1 Live Visualization of Power Measurements 5.1.2 Visualization of Long-Term Measurements 5.1.3 Integration in Application Performance Traces 5.1.4 Graphical Analysis of Application Power Traces 5.2 Correlating Power Measurements with Application Events 5.2.1 Challenges for Time Synchronization of Power Measurements 5.2.2 Reliable Automatic Time Synchronization with Correlation Sequences 5.2.3 Creating a Correlation Signal on a Power Measurement Channel 5.2.4 Processing the Correlation Signal and Measured Power Values 5.2.5 Common Oversampling of the Correlation Signals at Different Rates 5.2.6 Evaluation of Correlation and Time Synchronization 5.3 Use Cases for Application Power Traces 5.3.1 Analyzing Complex Power Anomalies 5.3.2 Quantifying C-State Transitions 5.3.3 Measuring the Dynamic Power Consumption of HPC Applications 5.4 Conclusion 6 Summary and Outloo

    Exploiting variability for energy optimization of parallel programs

    Full text link
    In this paper we present optimizations that use DVFS mechanisms to reduce the total energy usage in scientific applications. Our main insight is that noise is intrinsic to large scale parallel executions and it appears whenever shared resources are contended. The presence of noise allows us to identify and manipulate any program regions amenable to DVFS. When compared to previous energy optimizations that make per core decisions using predictions of the running time, our scheme uses a qualitative approach to recognize the signature of executions amenable to DVFS. By recognizing the "shape of variability" we can optimize codes with highly dynamic behavior, which pose challenges to all existing DVFS techniques. We validate our approach using offline and online analyses for one-sided and two-sided communication paradigms. We have applied our methods to NWChem, and we show best case improvements in energy use of 12% at no loss in performance when using online optimizations running on 720 Haswell cores with one-sided communication. With NWChem on MPI two-sided and offline analysis, capturing the initialization, we find energy savings of up to 20%, with less than 1% performance cost

    Radio-Astronomical Imaging on Accelerators

    Get PDF
    Imaging is considered the most compute-intensive and therefore most challenging part of a radio-astronomical data-processing pipeline. To reach the high dynamic ranges imposed by the high sensitivity and large field of view of the new generation of radio telescopes such as the Square Kilometre Array (SKA), we need to be able to correct for direction-independent effects (DIEs) such as the curvature of the earth as well as for direction-dependent time-varying effects (DDEs) such as those caused by the ionosphere during imaging. The novel Image-Domain gridding (IDG) algorithm was designed to avoid the performance bottlenecks of traditional imaging algorithms. We implement, optimize, and analyze the performance and energy efficiency of IDG on a variety of hardware platforms to find which platform is the best for IDG. We analyze traditional CPUs, as well as several accelerators architectures. IDG alleviates the limitations of traditional imaging algorithms while it enables the advantages of GPU acceleration: better performance at lower power consumption. The hardware-software co-design has resulted in a highly efficient imager. This makes IDG on GPUs an ideal candidate for meeting the computational and energy efficiency constraints of the SKA. IDG has been integrated with a widely-used astronomical imager (WSClean) and is now being used in production by a variety of different radio observatories such as LOFAR and the MWA. It is not only faster and more energy-efficient than its competitors, but it also produces better quality images

    Energy Efficiency Models for Scientific Applications on Supercomputers

    Get PDF

    Evaluating technologies and techniques for transitioning hydrodynamics applications to future generations of supercomputers

    Get PDF
    Current supercomputer development trends present severe challenges for scientific codebases. Moore’s law continues to hold, however, power constraints have brought an end to Dennard scaling, forcing significant increases in overall concurrency. The performance imbalance between the processor and memory sub-systems is also increasing and architectures are becoming significantly more complex. Scientific computing centres need to harness more computational resources in order to facilitate new scientific insights and maintaining their codebases requires significant investments. Centres therefore have to decide how best to develop their applications to take advantage of future architectures. To prevent vendor "lock-in" and maximise investments, achieving portableperformance across multiple architectures is also a significant concern. Efficiently scaling applications will be essential for achieving improvements in science and the MPI (Message Passing Interface) only model is reaching its scalability limits. Hybrid approaches which utilise shared memory programming models are a promising approach for improving scalability. Additionally PGAS (Partitioned Global Address Space) models have the potential to address productivity and scalability concerns. Furthermore, OpenCL has been developed with the aim of enabling applications to achieve portable-performance across a range of heterogeneous architectures. This research examines approaches for achieving greater levels of performance for hydrodynamics applications on future supercomputer architectures. The development of a Lagrangian-Eulerian hydrodynamics application is presented together with its utility for conducting such research. Strategies for improving application performance, including PGAS- and hybrid-based approaches are evaluated at large node-counts on several state-of-the-art architectures. Techniques to maximise the performance and scalability of OpenMP-based hybrid implementations are presented together with an assessment of how these constructs should be combined with existing approaches. OpenCL is evaluated as an additional technology for implementing a hybrid programming model and improving performance-portability. To enhance productivity several tools for automatically hybridising applications and improving process-to-topology mappings are evaluated. Power constraints are starting to limit supercomputer deployments, potentially necessitating the use of more energy efficient technologies. Advanced processor architectures are therefore evaluated as future candidate technologies, together with several application optimisations which will likely be necessary. An FPGA-based solution is examined, including an analysis of how effectively it can be utilised via a high-level programming model, as an alternative to the specialist approaches which currently limit the applicability of this technology
    corecore