36 research outputs found
A reference model for integrated energy and power management of HPC systems
Optimizing a computer for highest performance dictates the efficient use of its limited resources.
Computers as a whole are rather complex. Therefore, it is not sufficient to consider optimizing hardware and software components independently. Instead, a holistic view to manage the interactions of all components is essential to achieve system-wide efficiency.
For High Performance Computing (HPC) systems, today, the major limiting resources are energy and power. The hardware mechanisms to measure and control energy and power are exposed to software. The software systems using these mechanisms range from firmware, operating system, system software to tools and applications. Efforts to improve energy and power efficiency of HPC systems and the infrastructure of HPC centers achieve perpetual advances. In isolation, these efforts are unable to cope with the rising energy and power demands of large scale systems. A systematic way to integrate multiple optimization strategies, which build on complementary, interacting hardware and software systems is missing.
This work provides a reference model for integrated energy and power management of HPC systems: the Open Integrated Energy and Power (OIEP) reference model. The goal is to enable the implementation, setup, and maintenance of modular system-wide energy and power management solutions. The proposed model goes beyond current practices, which focus on individual HPC centers or implementations, in that it allows to universally describe any hierarchical energy and power management systems with a multitude of requirements. The model builds solid foundations to be understandable and verifiable, to guarantee stable interaction of hardware and software components, for a known and trusted chain of command. This work identifies the main building blocks of the OIEP reference model, describes their abstract setup, and shows concrete instances thereof. A principal aspect is how the individual components are connected, interface in a hierarchical manner and thus can optimize for the global policy, pursued as a computing center's operating strategy. In addition to the reference model itself, a method for applying the reference model is presented. This method is used to show the practicality of the reference model and its application.
For future research in energy and power management of HPC systems, the OIEP reference model forms a cornerstone to realize --- plan, develop and integrate --- innovative energy and power management solutions. For HPC systems themselves, it supports to transparently manage current systems with their inherent complexity, it allows to integrate novel solutions into existing setups, and it enables to design new systems from scratch. In fact, the OIEP reference model represents a basis for holistic efficient optimization.Computer auf höchstmögliche Rechenleistung zu optimieren bedingt Effizienzmaximierung aller limitierenden Ressourcen. Computer sind komplexe Systeme. Deshalb ist es nicht ausreichend, Hardware und Software isoliert zu betrachten. Stattdessen ist eine Gesamtsicht des Systems notwendig, um die Interaktionen aller Einzelkomponenten zu organisieren und systemweite Optimierungen zu ermöglichen.
Für Höchstleistungsrechner (HLR) ist die limitierende Ressource heute ihre Leistungsaufnahme und der resultierende Gesamtenergieverbrauch. In aktuellen HLR-Systemen sind Energie- und Leistungsaufnahme programmatisch auslesbar als auch direkt und indirekt steuerbar. Diese Mechanismen werden in diversen Softwarekomponenten von Firmware, Betriebssystem, Systemsoftware bis hin zu Werkzeugen und Anwendungen genutzt und stetig weiterentwickelt. Durch die Komplexität der interagierenden Systeme ist eine systematische Optimierung des Gesamtsystems nur schwer durchführbar, als auch nachvollziehbar. Ein methodisches Vorgehen zur Integration verschiedener Optimierungsansätze, die auf komplementäre, interagierende Hardware- und Softwaresysteme aufbauen, fehlt.
Diese Arbeit beschreibt ein Referenzmodell für integriertes Energie- und Leistungsmanagement von HLR-Systemen, das „Open Integrated Energy and Power (OIEP)“ Referenzmodell. Das Ziel ist ein Referenzmodell, dass die Entwicklung von modularen, systemweiten energie- und leistungsoptimierenden Sofware-Verbunden ermöglicht und diese als allgemeines hierarchisches Managementsystem beschreibt. Dies hebt das Modell von bisherigen Ansätzen ab, welche sich auf Einzellösungen, spezifischen Software oder die Bedürfnisse einzelner Rechenzentren beschränken. Dazu beschreibt es Grundlagen für ein planbares und verifizierbares Gesamtsystem und erlaubt nachvollziehbares und sicheres Delegieren von Energie- und Leistungsmanagement an Untersysteme unter Aufrechterhaltung der Befehlskette. Die Arbeit liefert die Grundlagen des Referenzmodells. Hierbei werden die Einzelkomponenten der Software-Verbunde identifiziert, deren abstrakter Aufbau sowie konkrete Instanziierungen gezeigt. Spezielles Augenmerk liegt auf dem hierarchischen Aufbau und der resultierenden Interaktionen der Komponenten. Die allgemeine Beschreibung des Referenzmodells erlaubt den Entwurf von Systemarchitekturen, welche letztendlich die Effizienzmaximierung der Ressource Energie mit den gegebenen Mechanismen ganzheitlich umsetzen können. Hierfür wird ein Verfahren zur methodischen Anwendung des Referenzmodells beschrieben, welches die Modellierung beliebiger Energie- und Leistungsverwaltungssystemen ermöglicht.
Für Forschung im Bereich des Energie- und Leistungsmanagement für HLR bildet das OIEP Referenzmodell Eckstein, um Planung, Entwicklung und Integration von innovativen Lösungen umzusetzen. Für die HLR-Systeme selbst unterstützt es nachvollziehbare Verwaltung der komplexen Systeme und bietet die Möglichkeit, neue Beschaffungen und Entwicklungen erfolgreich zu integrieren. Das OIEP Referenzmodell bietet somit ein Fundament für gesamtheitliche effiziente Systemoptimierung
Energy Measurements of High Performance Computing Systems: From Instrumentation to Analysis
Energy efficiency is a major criterion for computing in general and High Performance Computing in particular. When optimizing for energy efficiency, it is essential to measure the underlying metric: energy consumption. To fully leverage energy measurements, their quality needs to be well-understood. To that end, this thesis provides a rigorous evaluation of various energy measurement techniques. I demonstrate how the deliberate selection of instrumentation points, sensors, and analog processing schemes can enhance the temporal and spatial resolution while preserving a well-known accuracy. Further, I evaluate a scalable energy measurement solution for production HPC systems and address its shortcomings.
Such high-resolution and large-scale measurements present challenges regarding the management of large volumes of generated metric data. I address these challenges with a scalable infrastructure for collecting, storing, and analyzing metric data. With this infrastructure, I also introduce a novel persistent storage scheme for metric time series data, which allows efficient queries for aggregate timelines.
To ensure that it satisfies the demanding requirements for scalable power measurements, I conduct an extensive performance evaluation and describe a productive deployment of the infrastructure.
Finally, I describe different approaches and practical examples of analyses based on energy measurement data. In particular, I focus on the combination of energy measurements and application performance traces. However, interweaving fine-grained power recordings and application events requires accurately synchronized timestamps on both sides. To overcome this obstacle, I develop a resilient and automated technique for time synchronization, which utilizes crosscorrelation of a specifically influenced power measurement signal. Ultimately, this careful combination of sophisticated energy measurements and application performance traces yields a detailed insight into application and system energy efficiency at full-scale HPC systems and down to millisecond-range regions.:1 Introduction
2 Background and Related Work
2.1 Basic Concepts of Energy Measurements
2.1.1 Basics of Metrology
2.1.2 Measuring Voltage, Current, and Power
2.1.3 Measurement Signal Conditioning and Analog-to-Digital Conversion
2.2 Power Measurements for Computing Systems
2.2.1 Measuring Compute Nodes using External Power Meters
2.2.2 Custom Solutions for Measuring Compute Node Power
2.2.3 Measurement Solutions of System Integrators
2.2.4 CPU Energy Counters
2.2.5 Using Models to Determine Energy Consumption
2.3 Processing of Power Measurement Data
2.3.1 Time Series Databases
2.3.2 Data Center Monitoring Systems
2.4 Influences on the Energy Consumption of Computing Systems
2.4.1 Processor Power Consumption Breakdown
2.4.2 Energy-Efficient Hardware Configuration
2.5 HPC Performance and Energy Analysis
2.5.1 Performance Analysis Techniques
2.5.2 HPC Performance Analysis Tools
2.5.3 Combining Application and Power Measurements
2.6 Conclusion
3 Evaluating and Improving Energy Measurements
3.1 Description of the Systems Under Test
3.2 Instrumentation Points and Measurement Sensors
3.2.1 Analog Measurement at Voltage Regulators
3.2.2 Instrumentation with Hall Effect Transducers
3.2.3 Modular Instrumentation of DC Consumers
3.2.4 Optimal Wiring for Shunt-Based Measurements
3.2.5 Node-Level Instrumentation for HPC Systems
3.3 Analog Signal Conditioning and Analog-to-Digital Conversion
3.3.1 Signal Amplification
3.3.2 Analog Filtering and Analog-To-Digital Conversion
3.3.3 Integrated Solutions for High-Resolution Measurement
3.4 Accuracy Evaluation and Calibration
3.4.1 Synthetic Workloads for Evaluating Power Measurements
3.4.2 Improving and Evaluating the Accuracy of a Single-Node Measuring System
3.4.3 Absolute Accuracy Evaluation of a Many-Node Measuring System
3.5 Evaluating Temporal Granularity and Energy Correctness
3.5.1 Measurement Signal Bandwidth at Different Instrumentation Points
3.5.2 Retaining Energy Correctness During Digital Processing
3.6 Evaluating CPU Energy Counters
3.6.1 Energy Readouts with RAPL
3.6.2 Methodology
3.6.3 RAPL on Intel Sandy Bridge-EP
3.6.4 RAPL on Intel Haswell-EP and Skylake-SP
3.7 Conclusion
4 A Scalable Infrastructure for Processing Power Measurement Data
4.1 Requirements for Power Measurement Data Processing
4.2 Concepts and Implementation of Measurement Data Management
4.2.1 Message-Based Communication between Agents
4.2.2 Protocols
4.2.3 Application Programming Interfaces
4.2.4 Efficient Metric Time Series Storage and Retrieval
4.2.5 Hierarchical Timeline Aggregation
4.3 Performance Evaluation
4.3.1 Benchmark Hardware Specifications
4.3.2 Throughput in Symmetric Configuration with Replication
4.3.3 Throughput with Many Data Sources and Single Consumers
4.3.4 Temporary Storage in Message Queues
4.3.5 Persistent Metric Time Series Request Performance
4.3.6 Performance Comparison with Contemporary Time Series Storage Solutions
4.3.7 Practical Usage of MetricQ
4.4 Conclusion
5 Energy Efficiency Analysis
5.1 General Energy Efficiency Analysis Scenarios
5.1.1 Live Visualization of Power Measurements
5.1.2 Visualization of Long-Term Measurements
5.1.3 Integration in Application Performance Traces
5.1.4 Graphical Analysis of Application Power Traces
5.2 Correlating Power Measurements with Application Events
5.2.1 Challenges for Time Synchronization of Power Measurements
5.2.2 Reliable Automatic Time Synchronization with Correlation Sequences
5.2.3 Creating a Correlation Signal on a Power Measurement Channel
5.2.4 Processing the Correlation Signal and Measured Power Values
5.2.5 Common Oversampling of the Correlation Signals at Different Rates
5.2.6 Evaluation of Correlation and Time Synchronization
5.3 Use Cases for Application Power Traces
5.3.1 Analyzing Complex Power Anomalies
5.3.2 Quantifying C-State Transitions
5.3.3 Measuring the Dynamic Power Consumption of HPC Applications
5.4 Conclusion
6 Summary and Outloo
Computer Science 2019 APR Self-Study & Documents
UNM Computer Science APR self-study report and review team report for Spring 2019, fulfilling requirements of the Higher Learning Commission
Performance Analysis of Complex Shared Memory Systems
Systems for high performance computing are getting increasingly complex. On the one hand, the number of processors is increasing. On the other hand, the individual processors are getting more and more powerful. In recent years, the latter is to a large extent achieved by increasing the number of cores per processor. Unfortunately, scientific applications often fail to fully utilize the available computational performance. Therefore, performance analysis tools that help to localize and fix performance problems are indispensable. Large scale systems for high performance computing typically consist of multiple compute nodes that are connected via network. Performance analysis tools that analyze performance problems that arise from using multiple nodes are readily available. However, the increasing number of cores per processor that can be observed within the last decade represents a major change in the node architecture. Therefore, this work concentrates on the analysis of the node performance.
The goal of this thesis is to improve the understanding of the achieved application performance on existing hardware. It can be observed that the scaling of parallel applications on multi-core processors differs significantly from the scaling on multiple processors. Therefore, the properties of shared resources in contemporary multi-core processors as well as remote accesses in multi-processor systems are investigated and their respective impact on the application performance is analyzed. As a first step, a comprehensive suite of highly optimized micro-benchmarks is developed. These benchmarks are able to determine the performance of memory accesses depending on the location and coherence state of the data. They are used to perform an in-depth analysis of the characteristics of memory accesses in contemporary multi-processor systems, which identifies potential bottlenecks. However, in order to localize performance problems, it also has to be determined to which extend the application performance is limited by certain resources.
Therefore, a methodology to derive metrics for the utilization of individual components in the memory hierarchy as well as waiting times caused by memory accesses is developed in the second step. The approach is based on hardware performance counters, which record the number of certain hardware events. The developed micro-benchmarks are used to selectively stress individual components, which can be used to identify the events that provide a reasonable assessment for the utilization of the respective component and the amount of time that is spent waiting for memory accesses to complete. Finally, the knowledge gained from this process is used to implement a visualization of memory related performance issues in existing performance analysis tools.
The results of the micro-benchmarks reveal that the increasing number of cores per processor and the usage of multiple processors per node leads to complex systems with vastly different performance characteristics of memory accesses depending on the location of the accessed data. Furthermore, it can be observed that the aggregated throughput of shared resources in multi-core processors does not necessarily scale linearly with the number of cores that access them concurrently, which limits the scalability of parallel applications. It is shown that the proposed methodology for the identification of meaningful hardware performance counters yields useful metrics for the localization of memory related performance limitations
Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications
Energy efficiency is becoming increasingly important for computing systems,
in particular for large scale HPC facilities. In this work we evaluate, from an
user perspective, the use of Dynamic Voltage and Frequency Scaling (DVFS)
techniques, assisted by the power and energy monitoring capabilities of modern
processors in order to tune applications for energy efficiency. We run selected
kernels and a full HPC application on two high-end processors widely used in
the HPC context, namely an NVIDIA K80 GPU and an Intel Haswell CPU. We evaluate
the available trade-offs between energy-to-solution and time-to-solution,
attempting a function-by-function frequency tuning. We finally estimate the
benefits obtainable running the full code on a HPC multi-GPU node, with respect
to default clock frequency governors. We instrument our code to accurately
monitor power consumption and execution time without the need of any additional
hardware, and we enable it to change CPUs and GPUs clock frequencies while
running. We analyze our results on the different architectures using a simple
energy-performance model, and derive a number of energy saving strategies which
can be easily adopted on recent high-end HPC systems for generic applications
Exploiting variability for energy optimization of parallel programs
In this paper we present optimizations that use DVFS mechanisms to reduce the total energy usage in scientific applications. Our main insight is that noise is intrinsic to large scale parallel executions and it appears whenever shared resources are contended. The presence of noise allows us to identify and manipulate any program regions amenable to DVFS. When compared to previous energy optimizations that make per core decisions using predictions of the running time, our scheme uses a qualitative approach to recognize the signature of executions amenable to DVFS. By recognizing the "shape of variability" we can optimize codes with highly dynamic behavior, which pose challenges to all existing DVFS techniques. We validate our approach using offline and online analyses for one-sided and two-sided communication paradigms. We have applied our methods to NWChem, and we show best case improvements in energy use of 12% at no loss in performance when using online optimizations running on 720 Haswell cores with one-sided communication. With NWChem on MPI two-sided and offline analysis, capturing the initialization, we find energy savings of up to 20%, with less than 1% performance cost
Contributions à la modélisation avec un système multi agent du transfert technologique en Green IT
Over the past 5 to 10 years, research is numerous on energy reduction in IT (mainly electricity reduction). Several studies indeed alerted the stakeholders and environmental agencies on the urgency of the problem of the energy consumption of large scale infrastructures, like data centres, clouds or simply companies running servers and lots of IT equipment. This awareness moved from a non-so-important issue to major constraints on the operation of these infrastructures. In some cases, the operational costs reach the investment costs, urging new methodologies to appear in order to reduce costs and ecological impact. As of today, new hardware are developed by equipment manufacturers to decrease these costs. Only few and basic techniques are offered at the software and middleware levels out-of-the-box: But in laboratories, some techniques have proven on synthetic data, dedicated workflows or selected applications, to be able to save energy during the lifetime of an infrastructure, in several contexts, from Cloud to HPC in particular. Unfortunately, the transfer (or even the knowledge of the existence) of these techniques to industries is limited to project partners, innovative companies or large private research centres, able to invest time (thus money) on this topic. In my thesis, I investigate the reasons restraining the large adoption of several research results, from the simpler ones to more elaborated ones and I model the ties and interactions between the actors of the technological transfer. The target field has been restricted to Green IT but the methodology and the developed models can be extended to other domains as well. The idea is to identify, on the scale of technical maturity for wider adoption, what is missing and how to increase the speed of the transfer of scientific knowledge. The methodology is based on the following path: First, identifying the actors involved in the process of technology transfer, and understanding their motivations, their means of actions and their limitations. After a study of the state of the art in the domain of innovation diffusion and innovation management, this phase involved the production and the analysis of a dedicated survey targeting researchers and companies, from different size and turnover, restricted to those working in the Green IT field. Identifying each actor is not sufficient since they all interact; therefore their links and the potential of these links for technology transfer have also been studied carefully in a second phase so as to identify the most important ones, with the same methodology with the actors' identification. From these two phases, a multi-agent system (MAS) has been designed.Depuis 5 à 10 ans, les recherches sont nombreuses sur la réduction de l'énergie en l'informatique (principalement sur la réduction de l'électricité). Plusieurs études ont en effet alerté les intervenants et les organismes environnementaux sur l'urgence du problème de la consommation d'énergie des infrastructures à grande échelle, comme les centres de données, l'informatique en nuage ??ou simplement les sociétés exploitant des serveurs et de nombreux équipements IT. Cette prise de conscience est passée d'un problème peu important à une contrainte majeure sur le fonctionnement de ces infrastructures. Dans certains cas, les coûts d'exploitation surpassent les coûts d'investissement, et de nouvelles méthodologies sont nécessaires pour réduire les coûts et l'impact écologique. De nouveaux matériels sont développés par les fabricants d'équipements pour diminuer ces coûts. Seules quelques techniques de base sont offertes aux niveaux logiciels et intergiciels, par les éditeurs. Mais dans les laboratoires, certaines techniques ont prouvé leur efficacité sur des données synthétiques, des tâches dédiées ou des applications sélectionnées, pour être en mesure d'économiser de l'énergie au cours de la vie d'une infrastructure, dans plusieurs contexte, depuis le Cloud au HPC. Malheureusement, le transfert (ou même la connaissance de l'existence) de ces techniques aux industries est limité à des partenaires de projets, des entreprises innovantes ou de grands centres de recherche privés, capables d'investir du temps (et donc de l'argent) sur ce sujet. Dans ma thèse, je m'intéresse sur les raisons de la faible adoption de plusieurs résultats de la recherche, des plus simples aux plus élaborés et je modélise les liens et les interactions entre les acteurs du transfert technologique. Le champ cible a été limité au Green IT (ou informatique éco-responsable), mais la méthodologie et les modèles développés peuvent être étendus à d'autres domaines. L'idée est d'identifier ce qui manque et comment augmenter la vitesse du transfert des connaissances scientifiques. La méthodologie est basée sur le cheminement suivant: d'abord, identifier les acteurs impliqués dans le processus de transfert technologique, comprendre leurs motivations, leurs moyens d'actions et leurs limites. Après une étude de l'état de l'art dans le domaine de la diffusion de l'innovation et de la gestion de l'innovation, cette phase a consisté à la production et l'analyse d'une enquête dédiée ciblant des chercheurs et des entreprises, de tailles et de chiffre d'affaire différentes, restreinte à ceux qui travaillent dans le Green IT. Identifier chaque acteur ne suffit pas, car ils interagissent, et par conséquent, leurs liens et le potentiel de ces liens pour le transfert technologique ont également été étudiés avec soin dans une deuxième phase afin d'identifier les plus importants, avec la même méthodologie que l'identification des acteurs. A partir de ces deux phases, un système multi-agents (SMA) a été conçu
Diseño de un sistema de comunicaciones para virtualización remota de aceleradores gráficos sobre sistemas heterogéneos
El consumo de energía es una de las principales preocupaciones en el diseño
de cualquier sistema de HPC y ha sido recientemente reconocido como uno
de los grandes retos para alcanzar el siguiente hito en el rendimiento de los
supercomputadores: un EXAFLOPS. Para lograr este ambicioso
objetivo, es necesario diseñar supercomputadores cada vez más eficientes desde
el punto de vista energético, sin perder de vista el rendimiento.
En este contexto, la incorporación de los aceleradores gráficos a los
sistemas HPC actuales ha dado lugar a clústeres de máquinas con varios
núcleos donde cada nodo está equipado con su propio acelerador. En principio,
esto ha supuesto un aumento de la eficiencia energética de estas
configuraciones. Sin embargo, los aceleradores pueden permanecer inactivos gran
parte del tiempo, durante el cual siguen consumiendo una importante cantidad
de energía. Para conseguir un uso más eficiente de las GPUs
se han desarrollado varias tecnologías de virtualización de GPUs que permiten
ejecutar aplicaciones aceleradas con GPUs accediendo a un acelerador gráfico
instalado en un nodo remoto. En la actualidad, la solución más destacada
por su robustez, flexibilidad y eficiencia es rCUDA.
Otra de las estrategias para aumentar la eficiencia energética de los
clústeres consiste en reemplazar los nodos que incluyen procesadores de
propósito general, con un elevado consumo energético, por un número mayor
de plataformas con núcleos de menor capacidad de cálculo, pero bajo consumo
de potencia eléctrica. Ahora bien, estas configuraciones incrementan el
tiempo de ejecución de las aplicaciones de HPC, lo que a larga puede redundar
en un mayor consumo de energía.
Este trabajo de investigación aborda el diseño, implementación y evaluación
de un sistema de comunicaciones para la virtualización remota de GPUs basado
en rCUDA, utilizando redes de alto rendimiento sobre sistemas heterogéneos.
En concreto, las propuestas desarrolladas en esta tesis permiten aprovechar
las posibilidades de ahorro energético que pueden conseguirse al aplicar la
virtualización de GPUs en un clúster heterogéneo que cuenta con nodos basados
en procesadores propósito general, plataformas multinúcleo de bajo consumo y
arquitecturas híbridas (CPU-GPU) interconectadas por redes de alto rendimiento
que soportan \mbox{el protocolo RDMA}. La evaluación experimental del rendimiento y
del consumo energético se efectúa en base a un conjunto de aplicaciones aceleradas
con GPUs remotas. El marco de trabajo contempla varias configuraciones
representativas de los futuros sistemas de HPC, caracterizados por arquitecturas
heterogéneas dirigidas a aumentar la potencia de cálculo teniendo en cuenta
la eficiencia energética. Los resultados obtenidos demuestran el potencial
de las propuestas desarrolladas en este trabajo para incrementar la eficiencia
energética de la solución de virtualización de rCUDA