Search CORE

70 research outputs found

CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

Author: Hager Georg
Kreutzer Moritz
Shahzad Faisal
Thies Jonas
Wellein Gerhard
Zeiser Thomas
Publication venue
Publication date: 07/08/2017
Field of study

In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic Fault Tolerance), which serves two purposes. First, it provides an extendable library that significantly eases the implementation of application-level checkpointing. The most basic and frequently used checkpoint data types are already part of CRAFT and can be directly used out of the box. The library can be easily extended to add more data types. As means of overhead reduction, the library offers a build-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing. Second, CRAFT provides an easier interface for User-Level Failure Mitigation (ULFM) based dynamic process recovery, which significantly reduces the complexity and effort of failure detection and communication recovery mechanism. By utilizing both functionalities together, applications can write application-level checkpoints and recover dynamically from process failures with very limited programming effort. This work presents the design and use of our library in detail. The associated overheads are thoroughly analyzed using several benchmarks

arXiv.org e-Print Archive

Institute of Transport Research:Publications

Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications

Author: González Patricia
Losada Nuria
Martín María J.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-016-1863-z[Abstract] The Message Passing Interface (MPI) standard is the most popular parallel programming model for distributed systems. However, it lacks fault-tolerance support and, traditionally, failures are addressed with stop-and-restart checkpointing solutions. The proposal of User Level Failure Mitigation (ULFM) for the inclusion of resilience capabilities in the MPI standard provides new opportunities in this field, allowing the implementation of resilient MPI applications, i.e., applications that are able to detect and react to failures without stopping their execution. This work compares the performance of a traditional stop-and-restart checkpointing solution with its equivalent resilience proposal. Both approaches are built on top of ComPiler for Portable Checkpoiting (CPPC) an application-level checkpointing tool for MPI applications, and they allow to transparently obtain fault-tolerant MPI applications from generic MPI Single Program Multiple Data (SPMD). The evaluation is focused on the scalability of the two solutions, comparing both proposals using up to 3072 cores.Ministerio de Economía y Competitividad; TIN2013-42148-PMinisterio de Economía y Competitividad; BES-2014-068066Galicia.Consellería de Cultura, Educación e Ordenación Universitaria; GRC2013/05

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Fault tolerance of MPI applications in exascale systems: The ULFM solution

Author: Bosilca George
Bouteiller Aurelien
González Patricia
Losada Nuria
Martín María J.
Teranishi Keita
Publication venue: 'Elsevier BV'
Publication date: 01/05/2020
Field of study

[Abstract] The growth in the number of computational resources used by high-performance computing (HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become essential for long-running applications executing in future exascale systems, not only to ensure the completion of their execution in these systems but also to improve their energy consumption. Although the Message Passing Interface (MPI) is the most popular programming model for distributed-memory HPC systems, as of now, it does not provide any fault-tolerant construct for users to handle failures. Thus, the recovery procedure is postponed until the application is aborted and re-spawned. The proposal of the User Level Failure Mitigation (ULFM) interface in the MPI forum provides new opportunities in this field, enabling the implementation of resilient MPI applications, system runtimes, and programming language constructs able to detect and react to failures without aborting their execution. This paper presents a global overview of the resilience interfaces provided by the ULFM specification, covers archetypal usage patterns and building blocks, and surveys the wide variety of application-driven solutions that have exploited them in recent years. The large and varied number of approaches in the literature proves that ULFM provides the necessary flexibility to implement efficient fault-tolerant MPI applications. All the proposed solutions are based on application-driven recovery mechanisms, which allows reducing the overhead and obtaining the required level of efficiency needed in the future exascale platforms.Ministerio de Economía y Competitividad and FEDER; TIN2016-75845-PXunta de Galicia; ED431C 2017/04National Science Foundation of the United States; NSF-SI2 #1664142Exascale Computing Project; 17-SC-20-SCHoneywell International, Inc.; DE-NA000352

Repositorio da Universidade da Coruña

Implicit Actions and Non-blocking Failure Recovery with MPI

Author: Bosilca George
Bouteiller Aurelien
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 16/12/2022
Field of study

Scientific applications have long embraced the MPI as the environment of choice to execute on large distributed systems. The User-Level Failure Mitigation (ULFM) specification extends the MPI standard to address resilience and enable MPI applications to restore their communication capability after a failure. This works builds upon the wide body of experience gained in the field to eliminate a gap between current practice and the ideal, more asynchronous, recovery model in which the fault tolerance activities of multiple components can be carried out simultaneously and overlap. This work proposes to: (1) provide the required consistency in fault reporting to applications (i.e., enable an application to assess the success of a computational phase without incurring an unacceptable performance hit); (2) bring forward the building blocks that permit the effective scoping of fault recovery in an application, so that independent components in an application can recover without interfering with each other, and separate groups of processes in the application can recover independently or in unison; and (3) overlap recovery activities necessary to restore the consistency of the system (e.g., eviction of faulty processes from the communication group) with application recovery activities (e.g., dataset restoration from checkpoints).Comment: Accepted in FTXS'22 https://sites.google.com/view/ftxs202

arXiv.org e-Print Archive

Resilient MPI applications using an application-level checkpointing framework and ULFM

Author: Cores González Iván
González Patricia
Losada Nuria
Martín María J.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-016-1629-7[Abstract] Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. The Fault Tolerance Working Group, within the MPI forum, has presented the User Level Failure Mitigation (ULFM) proposal, providing new functionalities for the implementation of resilient MPI applications. In this work, the CPPC checkpointing framework is extended to exploit the new ULFM functionalities. The proposed solution transparently obtains resilient MPI applications by instrumenting the original application code. Besides, a multithreaded multilevel checkpointing, in which the checkpoint files are saved in different memory levels, improves the scalability of the solution. The experimental evaluation shows a low overhead when tolerating failures in one or several MPI processes.Ministerio de Economía y Competitividad; TIN2013-42148-PMinisterio de Economía y Competitividad; TIN2014-53522-REDTMinisterio de Economía y Competitividad; BES-2014-068066Galicia. Consellería de Cultura, Educación e Ordenación Universitaria; GRC2013/05

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Resilience of Parallel Applications

Author: González Patricia
Losada Nuria
Martín María J.
Publication venue
Publication date: 01/01/2016
Field of study

Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.Future exascale systems are predicted to be formed by millions of cores. This is a great opportunity for HPC applications, however, it is also a hazard for the completion of their execution. Even if one computation node presents a failure every one century, a machine with 100.000 nodes will encounter a failure every 9 hours. Thus, HPC applications need to make use of fault tolerance techniques to ensure they successfully finish their execution. This PhD thesis is focused on fault tolerance solutions for generic parallel applications, more specifically in checkpointing solutions. We have extended CPPC, an MPI application-level portable checkpointing tool developed in our research group, to work with OpenMP applications, and hybrid MPI-OpenMP applications. Currently, we are working on transparently obtaining resilient MPI applications, that is, applications that are able to recover themselves from failures without stopping their execution.European Cooperation in Science and Technology. COSTThis research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Project TIN2013-42148-P, and the predoctoral grant of Nuria Losada ref. BES-2014-068066) and by EU under the COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS)

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo