5 research outputs found

    Resiliency in numerical algorithm design for extreme scale simulations

    Get PDF
    This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.Peer Reviewed"Article signat per 36 autors/es: Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik G ̈oddeke, Marco Heisig, Fabienne Jezequel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Ortiz, Francesco Rizzi, Ulrich Rude, Martin Schulz, Fred Fung, Robert Speck, Linda Stals, Keita Teranishi, Samuel Thibault, Dominik Thonnes, Andreas Wagner and Barbara Wohlmuth"Postprint (author's final draft

    ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms

    Get PDF
    Fault-tolerant distributed applications require mechanisms to recover data lost via a process failure. On modern cluster systems it is typically impractical to request replacement resources after such a failure. Therefore, applications have to continue working with the remaining resources. This requires redistributing the workload and that the non-failed processes reload data. We present an algorithmic framework and its C++ library implementation ReStore for MPI programs that enables recovery of data after process failures. By storing all required data in memory via an appropriate data distribution and replication, recovery is substantially faster than with standard checkpointing schemes that rely on a parallel file system. As the application developer can specify which data to load, we also support shrinking recovery instead of recovery using spare compute nodes. We evaluate ReStore in both controlled, isolated environments and real applications. Our experiments show loading times of lost input data in the range of milliseconds on up to 24576 processors and a substantial speedup of the recovery time for the fault-tolerant version of a widely used bioinformatics application

    Enabling Scalability: Graph Hierarchies and Fault Tolerance

    Get PDF
    In this dissertation, we explore approaches to two techniques for building scalable algorithms. First, we look at different graph problems. We show how to exploit the input graph\u27s inherent hierarchy for scalable graph algorithms. The second technique takes a step back from concrete algorithmic problems. Here, we consider the case of node failures in large distributed systems and present techniques to quickly recover from these. In the first part of the dissertation, we investigate how hierarchies in graphs can be used to scale algorithms to large inputs. We develop algorithms for three graph problems based on two approaches to build hierarchies. The first approach reduces instance sizes for NP-hard problems by applying so-called reduction rules. These rules can be applied in polynomial time. They either find parts of the input that can be solved in polynomial time, or they identify structures that can be contracted (reduced) into smaller structures without loss of information for the specific problem. After solving the reduced instance using an exponential-time algorithm, these previously contracted structures can be uncontracted to obtain an exact solution for the original input. In addition to a simple preprocessing procedure, reduction rules can also be used in branch-and-reduce algorithms where they are successively applied after each branching step to build a hierarchy of problem kernels of increasing computational hardness. We develop reduction-based algorithms for the classical NP-hard problems Maximum Independent Set and Maximum Cut. The second approach is used for route planning in road networks where we build a hierarchy of road segments based on their importance for long distance shortest paths. By only considering important road segments when we are far away from the source and destination, we can substantially speed up shortest path queries. In the second part of this dissertation, we take a step back from concrete graph problems and look at more general problems in high performance computing (HPC). Here, due to the ever increasing size and complexity of HPC clusters, we expect hardware and software failures to become more common in massively parallel computations. We present two techniques for applications to recover from failures and resume computation. Both techniques are based on in-memory storage of redundant information and a data distribution that enables fast recovery. The first technique can be used for general purpose distributed processing frameworks: We identify data that is redundantly available on multiple machines and only introduce additional work for the remaining data that is only available on one machine. The second technique is a checkpointing library engineered for fast recovery using a data distribution method that achieves balanced communication loads. Both our techniques have in common that they work in settings where computation after a failure is continued with less machines than before. This is in contrast to many previous approaches that---in particular for checkpointing---focus on systems that keep spare resources available to replace failed machines. Overall, we present different techniques that enable scalable algorithms. While some of these techniques are specific to graph problems, we also present tools for fault tolerant algorithms and applications in a distributed setting. To show that those can be helpful in many different domains, we evaluate them for graph problems and other applications like phylogenetic tree inference

    A reference model for integrated energy and power management of HPC systems

    Get PDF
    Optimizing a computer for highest performance dictates the efficient use of its limited resources. Computers as a whole are rather complex. Therefore, it is not sufficient to consider optimizing hardware and software components independently. Instead, a holistic view to manage the interactions of all components is essential to achieve system-wide efficiency. For High Performance Computing (HPC) systems, today, the major limiting resources are energy and power. The hardware mechanisms to measure and control energy and power are exposed to software. The software systems using these mechanisms range from firmware, operating system, system software to tools and applications. Efforts to improve energy and power efficiency of HPC systems and the infrastructure of HPC centers achieve perpetual advances. In isolation, these efforts are unable to cope with the rising energy and power demands of large scale systems. A systematic way to integrate multiple optimization strategies, which build on complementary, interacting hardware and software systems is missing. This work provides a reference model for integrated energy and power management of HPC systems: the Open Integrated Energy and Power (OIEP) reference model. The goal is to enable the implementation, setup, and maintenance of modular system-wide energy and power management solutions. The proposed model goes beyond current practices, which focus on individual HPC centers or implementations, in that it allows to universally describe any hierarchical energy and power management systems with a multitude of requirements. The model builds solid foundations to be understandable and verifiable, to guarantee stable interaction of hardware and software components, for a known and trusted chain of command. This work identifies the main building blocks of the OIEP reference model, describes their abstract setup, and shows concrete instances thereof. A principal aspect is how the individual components are connected, interface in a hierarchical manner and thus can optimize for the global policy, pursued as a computing center's operating strategy. In addition to the reference model itself, a method for applying the reference model is presented. This method is used to show the practicality of the reference model and its application. For future research in energy and power management of HPC systems, the OIEP reference model forms a cornerstone to realize --- plan, develop and integrate --- innovative energy and power management solutions. For HPC systems themselves, it supports to transparently manage current systems with their inherent complexity, it allows to integrate novel solutions into existing setups, and it enables to design new systems from scratch. In fact, the OIEP reference model represents a basis for holistic efficient optimization.Computer auf höchstmögliche Rechenleistung zu optimieren bedingt Effizienzmaximierung aller limitierenden Ressourcen. Computer sind komplexe Systeme. Deshalb ist es nicht ausreichend, Hardware und Software isoliert zu betrachten. Stattdessen ist eine Gesamtsicht des Systems notwendig, um die Interaktionen aller Einzelkomponenten zu organisieren und systemweite Optimierungen zu ermöglichen. Für Höchstleistungsrechner (HLR) ist die limitierende Ressource heute ihre Leistungsaufnahme und der resultierende Gesamtenergieverbrauch. In aktuellen HLR-Systemen sind Energie- und Leistungsaufnahme programmatisch auslesbar als auch direkt und indirekt steuerbar. Diese Mechanismen werden in diversen Softwarekomponenten von Firmware, Betriebssystem, Systemsoftware bis hin zu Werkzeugen und Anwendungen genutzt und stetig weiterentwickelt. Durch die Komplexität der interagierenden Systeme ist eine systematische Optimierung des Gesamtsystems nur schwer durchführbar, als auch nachvollziehbar. Ein methodisches Vorgehen zur Integration verschiedener Optimierungsansätze, die auf komplementäre, interagierende Hardware- und Softwaresysteme aufbauen, fehlt. Diese Arbeit beschreibt ein Referenzmodell für integriertes Energie- und Leistungsmanagement von HLR-Systemen, das „Open Integrated Energy and Power (OIEP)“ Referenzmodell. Das Ziel ist ein Referenzmodell, dass die Entwicklung von modularen, systemweiten energie- und leistungsoptimierenden Sofware-Verbunden ermöglicht und diese als allgemeines hierarchisches Managementsystem beschreibt. Dies hebt das Modell von bisherigen Ansätzen ab, welche sich auf Einzellösungen, spezifischen Software oder die Bedürfnisse einzelner Rechenzentren beschränken. Dazu beschreibt es Grundlagen für ein planbares und verifizierbares Gesamtsystem und erlaubt nachvollziehbares und sicheres Delegieren von Energie- und Leistungsmanagement an Untersysteme unter Aufrechterhaltung der Befehlskette. Die Arbeit liefert die Grundlagen des Referenzmodells. Hierbei werden die Einzelkomponenten der Software-Verbunde identifiziert, deren abstrakter Aufbau sowie konkrete Instanziierungen gezeigt. Spezielles Augenmerk liegt auf dem hierarchischen Aufbau und der resultierenden Interaktionen der Komponenten. Die allgemeine Beschreibung des Referenzmodells erlaubt den Entwurf von Systemarchitekturen, welche letztendlich die Effizienzmaximierung der Ressource Energie mit den gegebenen Mechanismen ganzheitlich umsetzen können. Hierfür wird ein Verfahren zur methodischen Anwendung des Referenzmodells beschrieben, welches die Modellierung beliebiger Energie- und Leistungsverwaltungssystemen ermöglicht. Für Forschung im Bereich des Energie- und Leistungsmanagement für HLR bildet das OIEP Referenzmodell Eckstein, um Planung, Entwicklung und Integration von innovativen Lösungen umzusetzen. Für die HLR-Systeme selbst unterstützt es nachvollziehbare Verwaltung der komplexen Systeme und bietet die Möglichkeit, neue Beschaffungen und Entwicklungen erfolgreich zu integrieren. Das OIEP Referenzmodell bietet somit ein Fundament für gesamtheitliche effiziente Systemoptimierung
    corecore