88 research outputs found

    Local Rollback for Resilient Mpi Applications With Application-Level Checkpointing and Message Logging

    Get PDF
    [Abstract] The resilience approach generally used in high-performance computing (HPC) relies on coordinated checkpoint/restart, a global rollback of all the processes that are running the application. However, in many instances, the failure has a more localized scope and its impact is usually restricted to a subset of the resources being used. Thus, a global rollback would result in unnecessary overhead and energy consumption, since all processes, including those unaffected by the failure, discard their state and roll back to the last checkpoint to repeat computations that were already done. The User Level Failure Mitigation (ULFM) interface – the last proposal for the inclusion of resilience features in the Message Passing Interface (MPI) standard – enables the deployment of more flexible recovery strategies, including localized recovery. This work proposes a local rollback approach that can be generally applied to Single Program, Multiple Data (SPMD) applications by combining ULFM, the ComPiler for Portable Checkpointing (CPPC) tool, and the Open MPI VProtocol system-level message logging component. Only failed processes are recovered from the last checkpoint, while consistency before further progress in the execution is achieved through a two-level message logging process. To further optimize this approach point-to-point communications are logged by the Open MPI VProtocol component, while collective communications are optimally logged at the application level—thereby decoupling the logging protocol from the particular collective implementation. This spatially coordinated protocol applied by CPPC reduces the log size, the log memory requirements and overall the resilience impact on the applications.This research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Projects TIN2016-75845-P and the predoctoral grants of Nuria Losada ref. BES-2014-068066 and ref. EEBB-I-17-12005); by EU under the COST Program Action IC1305 Network for Sustainable Ultrascale Computing (NESUS) and a HiPEAC Collaboration Grant and by the Galician Government (Xunta de Galicia) under the Consolidation Program of Competitive Research (ref. ED431C 2017/04). We gratefully thank Galicia Supercomputing Center for providing access to the FinisTerrae-II supercomputer. This material is also based upon work supported by the US National Science Foundation, Office of Advanced Cyberinfrastructure , under Grants No. #1664142 and #1339763Xunta de Galicia; ED431C 2017/04US National Science Foundation, Office of Advanced Cyberinfrastructure; 1664142US National Science Foundation, Office of Advanced Cyberinfrastructure; 133976

    Application-level Fault Tolerance and Resilience in HPC Applications

    Get PDF
    Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo] As necesidades computacionais das distintas ramas da ciencia medraron enormemente nos últimos anos, o que provocou un gran crecemento no rendemento proporcionado polos supercomputadores. Cada vez constrúense sistemas de computación de altas prestacións de maior tamaño, con máis recursos hardware de distintos tipos, o que fai que as taxas de fallo destes sistemas tamén medren. Polo tanto, o estudo de técnicas de tolerancia a fallos eficientes é indispensábel para garantires que os programas científicos poidan completar a súa execución, evitando ademais que se dispare o consumo de enerxía. O checkpoint/restart é unha das técnicas máis populares. Sen embargo, a maioría da investigación levada a cabo nas últimas décadas céntrase en estratexias stop-and-restart para aplicacións de memoria distribuída tralo acontecemento dun fallo-parada. Esta tese propón técnicas checkpoint/restart a nivel de aplicación para os modelos de programación paralela roáis populares en supercomputación. Implementáronse protocolos de checkpointing para aplicacións híbridas MPI-OpenMP e aplicacións heteroxéneas baseadas en OpenCL, en ámbolos dous casos prestando especial coidado á portabilidade e maleabilidade da solución. En canto a aplicacións de memoria distribuída, proponse unha solución de resiliencia que pode ser empregada de forma xenérica en aplicacións MPI SPMD, permitindo detectar e reaccionar a fallos-parada sen abortar a execución. Neste caso, os procesos fallidos vólvense a lanzar e o estado da aplicación recupérase cunha volta atrás global. A maiores, esta solución de resiliencia optimizouse implementando unha volta atrás local, na que só os procesos fallidos volven atrás, empregando un protocolo de almacenaxe de mensaxes para garantires a consistencia e o progreso da execución. Por último, propónse a extensión dunha librería de checkpointing para facilitares a implementación de estratexias de recuperación ad hoc ante conupcións de memoria. En moitas ocasións, estos erros poden ser xestionados a nivel de aplicación, evitando desencadear un fallo-parada e permitindo unha recuperación máis eficiente.[Resumen] El rápido aumento de las necesidades de cómputo de distintas ramas de la ciencia ha provocado un gran crecimiento en el rendimiento ofrecido por los supercomputadores. Cada vez se construyen sistemas de computación de altas prestaciones mayores, con más recursos hardware de distintos tipos, lo que hace que las tasas de fallo del sistema aumenten. Por tanto, el estudio de técnicas de tolerancia a fallos eficientes resulta indispensable para garantizar que los programas científicos puedan completar su ejecución, evitando además que se dispare el consumo de energía. La técnica checkpoint/restart es una de las más populares. Sin embargo, la mayor parte de la investigación en este campo se ha centrado en estrategias stop-and-restart para aplicaciones de memoria distribuida tras la ocurrencia de fallos-parada. Esta tesis propone técnicas checkpoint/restart a nivel de aplicación para los modelos de programación paralela más populares en supercomputación. Se han implementado protocolos de checkpointing para aplicaciones híbridas MPI-OpenMP y aplicaciones heterogéneas basadas en OpenCL, prestando en ambos casos especial atención a la portabilidad y la maleabilidad de la solución. Con respecto a aplicaciones de memoria distribuida, se propone una solución de resiliencia que puede ser usada de forma genérica en aplicaciones MPI SPMD, permitiendo detectar y reaccionar a fallosparada sin abortar la ejecución. En su lugar, se vuelven a lanzar los procesos fallidos y se recupera el estado de la aplicación con una vuelta atrás global. A mayores, esta solución de resiliencia ha sido optimizada implementando una vuelta atrás local, en la que solo los procesos fallidos vuelven atrás, empleando un protocolo de almacenaje de mensajes para garantizar la consistencia y el progreso de la ejecución. Por último, se propone una extensión de una librería de checkpointing para facilitar la implementación de estrategias de recuperación ad hoc ante corrupciones de memoria. Muchas veces, este tipo de errores puede gestionarse a nivel de aplicación, evitando desencadenar un fallo-parada y permitiendo una recuperación más eficiente.[Abstract] The rapid increase in the computational demands of science has lead to a pronounced growth in the performance offered by supercomputers. As High Performance Computing (HPC) systems grow larger, including more hardware components of different types, the system's failure rate becomes higher. Efficient fault tolerance techniques are essential not only to ensure the execution completion but also to save energy. Checkpoint/restart is one of the most popular fault tolerance techniques. However, most of the research in this field is focused on stop-and-restart strategies for distributed-memory applications in the event of fail-stop failures. Thís thesis focuses on the implementation of application-level checkpoint/restart solutions for the most popular parallel programming models used in HPC. Hence, we have implemented checkpointing solutions to cope with fail-stop failures in hybrid MPI-OpenMP applications and OpenCL-based programs. Both strategies maximize the restart portability and malleability, ie., the recovery can take place on machines with different CPU / accelerator architectures, and/ or operating systems, and can be adapted to the available resources (number of cores/accelerators). Regarding distributed-memory applications, we propose a resilience solution that can be generally applied to SPMD MPI programs. Resilient applications can detect and react to failures without aborting their execution upon fail-stop failures. Instead, failed processes are re-spawned, and the application state is recovered through a global rollback. Moreover, we have optimized this resilience proposal by implementing a local rollback protocol, in which only failed processes rollback to a previous state, while message logging enables global consistency and further progress of the computation. Finally, we have extended a checkpointing library to facilitate the implementation of ad hoc recovery strategies in the event of soft errors) caused by memory corruptions. Many times, these errors can be handled at the software-Ievel, tIms, avoiding fail-stop failures and enabling a more efficient recovery

    Fault tolerance of MPI applications in exascale systems: The ULFM solution

    Get PDF
    [Abstract] The growth in the number of computational resources used by high-performance computing (HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become essential for long-running applications executing in future exascale systems, not only to ensure the completion of their execution in these systems but also to improve their energy consumption. Although the Message Passing Interface (MPI) is the most popular programming model for distributed-memory HPC systems, as of now, it does not provide any fault-tolerant construct for users to handle failures. Thus, the recovery procedure is postponed until the application is aborted and re-spawned. The proposal of the User Level Failure Mitigation (ULFM) interface in the MPI forum provides new opportunities in this field, enabling the implementation of resilient MPI applications, system runtimes, and programming language constructs able to detect and react to failures without aborting their execution. This paper presents a global overview of the resilience interfaces provided by the ULFM specification, covers archetypal usage patterns and building blocks, and surveys the wide variety of application-driven solutions that have exploited them in recent years. The large and varied number of approaches in the literature proves that ULFM provides the necessary flexibility to implement efficient fault-tolerant MPI applications. All the proposed solutions are based on application-driven recovery mechanisms, which allows reducing the overhead and obtaining the required level of efficiency needed in the future exascale platforms.Ministerio de Economía y Competitividad and FEDER; TIN2016-75845-PXunta de Galicia; ED431C 2017/04National Science Foundation of the United States; NSF-SI2 #1664142Exascale Computing Project; 17-SC-20-SCHoneywell International, Inc.; DE-NA000352

    Algorithmic Based Fault Tolerance Applied to High Performance Computing

    Full text link
    We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault tolerant matrix-matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix-matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly

    Soft-fault recovery in MPI applications

    Get PDF
    [Abstract] Current high-performance computing (HPC) systems are comprised of thousands of CPU cores, and this number is expected to grow into the millions in the near future. With such an elevated number of processors, the mean time between failures (MTBF) can become so small that most scientific applications will not have time to complete their execution before a failure occurs. It is therefore critical to develop fault tolerance and resilience mechanisms in order to guarantee the completion and integrity of massively parallel applications. University of A Coruña’s Computer Architecture Group (GAC) proposed a solution (Controller/comPiler for Portable Checkpointing - CPPC) in order to transparently convert generic MPI applications into fault tolerant applications, based on a checkpoint-restart scheme. CPPC was extended by Nuria Losada into CPPC-resilience in order to make resilient MPI applications, that is, those that are capable of detecting and reacting to failures without aborting the application, such that survivor processes don’t have to be restarted. This was accomplished by means of a logging protocol and the usage of a proposed fault tolerance interface addition to the MPI standard (User Level Failure Mitigation). However, this system cannot handle soft errors efficiently, since it kills and respawns the failed processes entirely when it is not necessary, as these errors are transient in nature. The object of this project is to extend and adapt CPPCresilience in order to handle soft errors in a more efficient manner, without having to respawn the failed processes. This proposal has been evaluated using 3 MPI applications with different characteristics, achieving a decrease in recovery times after a soft error ranging from 2 to 44 percent, depending on the total number of processes involved.[Resumo] Os sistemas actuais de computación de altas prestacións (HPC) están formados por miles de núcleos de procesadores, e espérase que este número aumente ata os millóns nun futuro cercano. Cun número tan elevado de procesadores, o tempo medio entre fallos (MTBF) pode chegar a reducirse tanto que a maioría de computacións científicas non terían tempo de completar a súa execución antes de que ocorrese un fallo. Polo tanto, é crítico o desenvolvemento de sistemas tolerantes e resilientes a fallos, para garantizar a finalización e integridade das aplicacións masivamente paralelas. O Grupo de Arquitectura de Computadores (GAC) da UDC propuxo unha solución (Controller/comPiler for Portable Checkpointing - CPPC) para convertir de xeito transparente aplicatcións MPI xenéricas en aplicacións tolerantes a fallos, basándose nun esquema de checkpointing e reinicio. CPPC foi posteriormente extendido por Nuria Losada en CPPC-resilience coa finalidade de crear aplicacións MPI resilientes, é dicir, aquelas que son capaces de detectar e reaccionar a fallos sen abortar a aplicación, de xeito que os procesos superviventes non necesitan ser reiniciados. Isto logrouse mediante un protocolo de logging de mensaxes e o uso dunha interfaz de tolerancia a fallos, ULFM (User Level Failure Mitigation), proposta para adición ao estándar MPI. Sen embargo, este sistema non xestiona os errores soft de maneira eficiente, xa que mata e reinicia os procesos fallados por completo cando non é necesario, xa que este tipo de errores teñen natureza transitoria. A meta deste TFG é extender e adaptar CPPC-resilience para poder manexar os errores soft eficientemente, sen ter que reiniciar os procesos fallados. Esta proposta foi evaluada utilizando 3 aplicacións MPI con diferentes características, conseguindo unha redución nos tempos de recuperación tras un erro soft de entre un 2 e un 44 por cento, dependendo do número total de procesos involucrados

    Extensions of Task-based Runtime for High Performance Dense Linear Algebra Applications

    Get PDF
    On the road to exascale computing, the gap between hardware peak performance and application performance is increasing as system scale, chip density and inherent complexity of modern supercomputers are expanding. Even if we put aside the difficulty to express algorithmic parallelism and to efficiently execute applications at large scale, other open questions remain. The ever-growing scale of modern supercomputers induces a fast decline of the Mean Time To Failure. A generic, low-overhead, resilient extension becomes a desired aptitude for any programming paradigm. This dissertation addresses these two critical issues, designing an efficient unified linear algebra development environment using a task-based runtime, and extending a task-based runtime with fault tolerant capabilities to build a generic framework providing both soft and hard error resilience to task-based programming paradigm. To bridge the gap between hardware peak performance and application perfor- mance, a unified programming model is designed to take advantage of a lightweight task-based runtime to manage the resource-specific workload, and to control the data ow and parallel execution of tasks. Under this unified development, linear algebra tasks are abstracted across different underlying heterogeneous resources, including multicore CPUs, GPUs and Intel Xeon Phi coprocessors. Performance portability is guaranteed and this programming model is adapted to a wide range of accelerators, supporting both shared and distributed-memory environments. To solve the resilient challenges on large scale systems, fault tolerant mechanisms are designed for a task-based runtime to protect applications against both soft and hard errors. For soft errors, three additions to a task-based runtime are explored. The first recovers the application by re-executing minimum number of tasks, the second logs intermediary data between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re- execution. For hard errors, we propose two generic approaches, which augment the data logging mechanism for soft errors. The first utilizes non-volatile storage device to save logged data, while the second saves local logged data on a remote node to protect against node failure. Experimental results have confirmed that our soft and hard error fault tolerant mechanisms exhibit the expected correctness and efficiency

    Unified fault-tolerance framework for hybrid task-parallel message-passing applications

    Get PDF
    We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the FI-DGR 2013 scholarship and the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and TIN2015-65316-P.Peer ReviewedPostprint (author's final draft

    A fault-tolerance protocol for parallel applications with communication imbalance

    Get PDF
    ArticuloThe predicted failure rates of future supercomputers loom the groundbreaking research large machines are expected to foster. Therefore, resilient extreme-scale applications are an absolute necessity to effectively use the new generation of supercomputers. Rollback-recovery techniques have been traditionally used in HPC to provide resilience. Among those techniques, message logging provides the appealing features of saving energy, accelerating recovery, and having low performance penalty. Its increased memory consumption is, however, an important downside. This paper introduces memory-constrained message logging (MCML), a general framework for decreasing the memory footprint of message-logging protocols. In particular, we demonstrate the effectiveness of MCML in maintaining message logging feasible for applications with substantial communication imbalance. This type of applications appear in many scientific fields. We present experimental results with several parallel codes running on up to 4,096 cores. Using those results and an analytical model, we predict MCML can reduce execution time up to 25% and energy consumption up to 15%, at extreme scale
    • …
    corecore