1,471 research outputs found

    A Unified Infrastructure for Monitoring and Tuning the Energy Efficiency of HPC Applications

    Get PDF
    High Performance Computing (HPC) has become an indispensable tool for the scientific community to perform simulations on models whose complexity would exceed the limits of a standard computer. An unfortunate trend concerning HPC systems is that their power consumption under high-demanding workloads increases. To counter this trend, hardware vendors have implemented power saving mechanisms in recent years, which has increased the variability in power demands of single nodes. These capabilities provide an opportunity to increase the energy efficiency of HPC applications. To utilize these hardware power saving mechanisms efficiently, their overhead must be analyzed. Furthermore, applications have to be examined for performance and energy efficiency issues, which can give hints for optimizations. This requires an infrastructure that is able to capture both, performance and power consumption information concurrently. The mechanisms that such an infrastructure would inherently support could further be used to implement a tool that is able to do both, measuring and tuning of energy efficiency. This thesis targets all steps in this process by making the following contributions: First, I provide a broad overview on different related fields. I list common performance measurement tools, power measurement infrastructures, hardware power saving capabilities, and tuning tools. Second, I lay out a model that can be used to define and describe energy efficiency tuning on program region scale. This model includes hardware and software dependent parameters. Hardware parameters include the runtime overhead and delay for switching power saving mechanisms as well as a contemplation of their scopes and the possible influence on application performance. Thus, in a third step, I present methods to evaluate common power saving mechanisms and list findings for different x86 processors. Software parameters include their performance and power consumption characteristics as well as the influence of power-saving mechanisms on these. To capture software parameters, an infrastructure for measuring performance and power consumption is necessary. With minor additions, the same infrastructure can later be used to tune software and hardware parameters. Thus, I lay out the structure for such an infrastructure and describe common components that are required for measuring and tuning. Based on that, I implement adequate interfaces that extend the functionality of contemporary performance measurement tools. Furthermore, I use these interfaces to conflate performance and power measurements and further process the gathered information for tuning. I conclude this work by demonstrating that the infrastructure can be used to manipulate power-saving mechanisms of contemporary x86 processors and increase the energy efficiency of HPC applications

    Q-Learning Inspired Self-Tuning for Energy Efficiency in HPC

    Full text link
    System self-tuning is a crucial task to lower the energy consumption of computers. Traditional approaches decrease the processor frequency in idle or synchronisation periods. However, in High-Performance Computing (HPC) this is not sufficient: if the executed code is load balanced, there are neither idle nor synchronisation phases that can be exploited. Therefore, alternative self-tuning approaches are needed, which allow exploiting different compute characteristics of HPC programs. The novel notion of application regions based on function call stacks, introduced in the Horizon 2020 Project READEX, allows us to define such a self-tuning approach. In this paper, we combine these regions with the Q-Learning typical state-action maps, which save information about available states, possible actions to take, and the expected rewards. By exploiting the existing processor power interface, we are able to provide direct feedback to the learning process. This approach allows us to save up to 15% energy, while only adding a minor runtime overhead.Comment: 4 pages short paper, HPCS 2019, AHPC 2019, READEX, HAEC, Horizon2020, H2020 grant agreement number 671657, DFG, CRC 91

    From Facility to Application Sensor Data: Modular, Continuous and Holistic Monitoring with DCDB

    Full text link
    Today's HPC installations are highly-complex systems, and their complexity will only increase as we move to exascale and beyond. At each layer, from facilities to systems, from runtimes to applications, a wide range of tuning decisions must be made in order to achieve efficient operation. This, however, requires systematic and continuous monitoring of system and user data. While many insular solutions exist, a system for holistic and facility-wide monitoring is still lacking in the current HPC ecosystem. In this paper we introduce DCDB, a comprehensive monitoring system capable of integrating data from all system levels. It is designed as a modular and highly-scalable framework based on a plugin infrastructure. All monitored data is aggregated at a distributed noSQL data store for analysis and cross-system correlation. We demonstrate the performance and scalability of DCDB, and describe two use cases in the area of energy management and characterization.Comment: Accepted at the The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) 201

    HPC Cloud for Scientific and Business Applications: Taxonomy, Vision, and Research Challenges

    Full text link
    High Performance Computing (HPC) clouds are becoming an alternative to on-premise clusters for executing scientific applications and business analytics services. Most research efforts in HPC cloud aim to understand the cost-benefit of moving resource-intensive applications from on-premise environments to public cloud platforms. Industry trends show hybrid environments are the natural path to get the best of the on-premise and cloud resources---steady (and sensitive) workloads can run on on-premise resources and peak demand can leverage remote resources in a pay-as-you-go manner. Nevertheless, there are plenty of questions to be answered in HPC cloud, which range from how to extract the best performance of an unknown underlying platform to what services are essential to make its usage easier. Moreover, the discussion on the right pricing and contractual models to fit small and large users is relevant for the sustainability of HPC clouds. This paper brings a survey and taxonomy of efforts in HPC cloud and a vision on what we believe is ahead of us, including a set of research challenges that, once tackled, can help advance businesses and scientific discoveries. This becomes particularly relevant due to the fast increasing wave of new HPC applications coming from big data and artificial intelligence.Comment: 29 pages, 5 figures, Published in ACM Computing Surveys (CSUR

    Extending the Functionality of Score-P through Plugins: Interfaces and Use Cases

    Get PDF
    Performance measurement and runtime tuning tools are both vital in the HPC software ecosystem and use similar techniques: the analyzed application is interrupted at specific events and information on the current system state is gathered to be either recorded or used for tuning. One of the established performance measurement tools is Score-P. It supports numerous HPC platforms and parallel programming paradigms. To extend Score-P with support for different back-ends, create a common framework for measurement and tuning of HPC applications, and to enable the re-use of common software components such as implemented instrumentation techniques, this paper makes the following contributions: (I) We describe the Score-P metric plugin interface, which enables programmers to augment the event stream with metric data from supplementary data sources that are otherwise not accessible for Score-P. (II) We introduce the flexible Score-P substrate plugin interface that can be used for custom processing of the event stream according to the specific requirements of either measurement, analysis, or runtime tuning tasks. (III) We provide examples for both interfaces that extend Score-P’s functionality for monitoring and tuning purposes

    A reference model for integrated energy and power management of HPC systems

    Get PDF
    Optimizing a computer for highest performance dictates the efficient use of its limited resources. Computers as a whole are rather complex. Therefore, it is not sufficient to consider optimizing hardware and software components independently. Instead, a holistic view to manage the interactions of all components is essential to achieve system-wide efficiency. For High Performance Computing (HPC) systems, today, the major limiting resources are energy and power. The hardware mechanisms to measure and control energy and power are exposed to software. The software systems using these mechanisms range from firmware, operating system, system software to tools and applications. Efforts to improve energy and power efficiency of HPC systems and the infrastructure of HPC centers achieve perpetual advances. In isolation, these efforts are unable to cope with the rising energy and power demands of large scale systems. A systematic way to integrate multiple optimization strategies, which build on complementary, interacting hardware and software systems is missing. This work provides a reference model for integrated energy and power management of HPC systems: the Open Integrated Energy and Power (OIEP) reference model. The goal is to enable the implementation, setup, and maintenance of modular system-wide energy and power management solutions. The proposed model goes beyond current practices, which focus on individual HPC centers or implementations, in that it allows to universally describe any hierarchical energy and power management systems with a multitude of requirements. The model builds solid foundations to be understandable and verifiable, to guarantee stable interaction of hardware and software components, for a known and trusted chain of command. This work identifies the main building blocks of the OIEP reference model, describes their abstract setup, and shows concrete instances thereof. A principal aspect is how the individual components are connected, interface in a hierarchical manner and thus can optimize for the global policy, pursued as a computing center's operating strategy. In addition to the reference model itself, a method for applying the reference model is presented. This method is used to show the practicality of the reference model and its application. For future research in energy and power management of HPC systems, the OIEP reference model forms a cornerstone to realize --- plan, develop and integrate --- innovative energy and power management solutions. For HPC systems themselves, it supports to transparently manage current systems with their inherent complexity, it allows to integrate novel solutions into existing setups, and it enables to design new systems from scratch. In fact, the OIEP reference model represents a basis for holistic efficient optimization.Computer auf höchstmögliche Rechenleistung zu optimieren bedingt Effizienzmaximierung aller limitierenden Ressourcen. Computer sind komplexe Systeme. Deshalb ist es nicht ausreichend, Hardware und Software isoliert zu betrachten. Stattdessen ist eine Gesamtsicht des Systems notwendig, um die Interaktionen aller Einzelkomponenten zu organisieren und systemweite Optimierungen zu ermöglichen. FĂŒr Höchstleistungsrechner (HLR) ist die limitierende Ressource heute ihre Leistungsaufnahme und der resultierende Gesamtenergieverbrauch. In aktuellen HLR-Systemen sind Energie- und Leistungsaufnahme programmatisch auslesbar als auch direkt und indirekt steuerbar. Diese Mechanismen werden in diversen Softwarekomponenten von Firmware, Betriebssystem, Systemsoftware bis hin zu Werkzeugen und Anwendungen genutzt und stetig weiterentwickelt. Durch die KomplexitĂ€t der interagierenden Systeme ist eine systematische Optimierung des Gesamtsystems nur schwer durchfĂŒhrbar, als auch nachvollziehbar. Ein methodisches Vorgehen zur Integration verschiedener OptimierungsansĂ€tze, die auf komplementĂ€re, interagierende Hardware- und Softwaresysteme aufbauen, fehlt. Diese Arbeit beschreibt ein Referenzmodell fĂŒr integriertes Energie- und Leistungsmanagement von HLR-Systemen, das „Open Integrated Energy and Power (OIEP)“ Referenzmodell. Das Ziel ist ein Referenzmodell, dass die Entwicklung von modularen, systemweiten energie- und leistungsoptimierenden Sofware-Verbunden ermöglicht und diese als allgemeines hierarchisches Managementsystem beschreibt. Dies hebt das Modell von bisherigen AnsĂ€tzen ab, welche sich auf Einzellösungen, spezifischen Software oder die BedĂŒrfnisse einzelner Rechenzentren beschrĂ€nken. Dazu beschreibt es Grundlagen fĂŒr ein planbares und verifizierbares Gesamtsystem und erlaubt nachvollziehbares und sicheres Delegieren von Energie- und Leistungsmanagement an Untersysteme unter Aufrechterhaltung der Befehlskette. Die Arbeit liefert die Grundlagen des Referenzmodells. Hierbei werden die Einzelkomponenten der Software-Verbunde identifiziert, deren abstrakter Aufbau sowie konkrete Instanziierungen gezeigt. Spezielles Augenmerk liegt auf dem hierarchischen Aufbau und der resultierenden Interaktionen der Komponenten. Die allgemeine Beschreibung des Referenzmodells erlaubt den Entwurf von Systemarchitekturen, welche letztendlich die Effizienzmaximierung der Ressource Energie mit den gegebenen Mechanismen ganzheitlich umsetzen können. HierfĂŒr wird ein Verfahren zur methodischen Anwendung des Referenzmodells beschrieben, welches die Modellierung beliebiger Energie- und Leistungsverwaltungssystemen ermöglicht. FĂŒr Forschung im Bereich des Energie- und Leistungsmanagement fĂŒr HLR bildet das OIEP Referenzmodell Eckstein, um Planung, Entwicklung und Integration von innovativen Lösungen umzusetzen. FĂŒr die HLR-Systeme selbst unterstĂŒtzt es nachvollziehbare Verwaltung der komplexen Systeme und bietet die Möglichkeit, neue Beschaffungen und Entwicklungen erfolgreich zu integrieren. Das OIEP Referenzmodell bietet somit ein Fundament fĂŒr gesamtheitliche effiziente Systemoptimierung

    TEXTAROSSA: Towards EXtreme scale Technologies and Accelerators for euROhpc hw/Sw Supercomputing Applications for exascale

    Get PDF
    To achieve high performance and high energy efficiency on near-future exascale computing systems, three key technology gaps needs to be bridged. These gaps include: energy efficiency and thermal control; extreme computation efficiency via HW acceleration and new arithmetics; methods and tools for seamless integration of reconfigurable accelerators in heterogeneous HPC multi-node platforms. TEXTAROSSA aims at tackling this gap through a co-design approach to heterogeneous HPC solutions, supported by the integration and extension of HW and SW IPs, programming models and tools derived from European research.This work is supported by the TEXTAROSSA project G.A. n.956831, as part of the EuroHPC initiative.Peer ReviewedArticle signat per 51 autors/es: Giovanni Agosta, Daniele Cattaneo, William Fornaciari, Andrea Galimberti, Giuseppe Massari, Federico Reghenzani, Federico Terraneo, Davide Zoni, Carlo Brandolese (DEIB – Politecnico di Milano, Italy, [email protected]) | Massimo Celino, Francesco Iannone, Paolo Palazzari, Giuseppe Zummo (ENEA, Italy, [email protected]) | Massimo Bernaschi, Pasqua D’Ambra (Istituto per le Applicazioni del Calcolo (IAC) - CNR, Italy, [email protected]) | Sergio Saponara, Marco Danelutto, Massimo Torquati (University of Pisa, Italy, [email protected]) | Marco Aldinucci, Yasir Arfat, Barbara Cantalupo, Iacopo Colonnelli, Roberto Esposito, Alberto R. Martinelli, Gianluca Mittone (University of Torino, Italy, [email protected]) | Olivier Beaumont, Berenger Bramas, Lionel Eyraud-Dubois, Brice Goglin, Abdou Guermouche, Raymond Namyst, Samuel Thibault (Inria - France, [email protected]) | Antonio Filgueras, Miquel Vidal, Carlos Alvarez, Xavier Martorell (BSC - Spain, [email protected]) | Ariel Oleksiak, Michal Kulczewski (PSNC, Poland, [email protected], [email protected]) | Alessandro Lonardo, Piero Vicini, Francesca Lo Cicero, Francesco Simula, Andrea Biagioni, Paolo Cretaro, Ottorino Frezza, Pier Stanislao Paolucci, Matteo Turisini (INFN Sezione di Roma - Italy, [email protected]) | Francesco Giacomini (INFN CNAF - Italy, [email protected]) | Tommaso Boccali (INFN Sezione di Pisa - Italy, [email protected]) | Simone Montangero (University of Padova and INFN Sezione di Padova - Italy, [email protected]) | Roberto Ammendola (INFN Sezione di Roma Tor Vergata - Italy, [email protected])Postprint (author's final draft

    Computing server power modeling in a data center: survey,taxonomy and performance evaluation

    Full text link
    Data centers are large scale, energy-hungry infrastructure serving the increasing computational demands as the world is becoming more connected in smart cities. The emergence of advanced technologies such as cloud-based services, internet of things (IoT) and big data analytics has augmented the growth of global data centers, leading to high energy consumption. This upsurge in energy consumption of the data centers not only incurs the issue of surging high cost (operational and maintenance) but also has an adverse effect on the environment. Dynamic power management in a data center environment requires the cognizance of the correlation between the system and hardware level performance counters and the power consumption. Power consumption modeling exhibits this correlation and is crucial in designing energy-efficient optimization strategies based on resource utilization. Several works in power modeling are proposed and used in the literature. However, these power models have been evaluated using different benchmarking applications, power measurement techniques and error calculation formula on different machines. In this work, we present a taxonomy and evaluation of 24 software-based power models using a unified environment, benchmarking applications, power measurement technique and error formula, with the aim of achieving an objective comparison. We use different servers architectures to assess the impact of heterogeneity on the models' comparison. The performance analysis of these models is elaborated in the paper
    • 

    corecore