11 research outputs found

    Fault-tolerance and malleability in parallel message-passing applications

    Get PDF
    [Resumo] Esta tese explora solucións para tolerancia a fallos e maleabilidade baseadas en técnicas de checkpoint e reinicio para aplicacións de pase de mensaxes. No campo da tolerancia a fallos, esta tese contribúe melloraudo o factor que máis incrementa a sobrecarga, o custo de E/S no envorcado dos ficheiros de estado, propoñendo diferentes técnicas para reducir o tamaño dos ficheiros de checkpoint. Ademais, tamén se propón un mecanismo de migración de procesos baseado en checkpointing. Esto permite a migración proactiva de procesos desde nodos que están a piques de fallar, evitando un reinicio completo da execución e melloraudo a resistencia a fallos da aplicación. Finalmente, esta tese presenta unha proposta para transformar de forma transparente aplicacións MPI en traballos maleables. Esto é, programas paralelos que en tempo de execución son capaces de adaptarse so número de procesadores dispoñibles no sistema, conseguindo beneficios, como maior productividade, mellor tempo de resposta ou maior resistencia a fallos nos nodos. Todas as solucióru; propostas nesta tese foron implementadas a nivel de aplicación, e son independentes da arquitectura hardware, o sistema operativo, a implementación MPI usada, e de calquera framework de alto nivel, como os utilizados para o envío de traballos.[Resumen] Esta tesis explora soluciones de tolerancia a fallos y maleabilidad basadas en técnicas de checkpoint y reinicio para aplicaciones de pase de mensajes. En el campo de la tolerancia a fallos, contribuye mejorando el factor que más incrementa la sobrecarga, el coste de E/S en el volcado de los ficheros de estado, proponiendo diferentes técnicas para reducir el tamaño de los ficheros de checkpoint. Ademós, también se propone nn mecanismo de migración de procesos basado en checkpointing. Esto permite la migración proactiva de procesos desde nodos que están a punto de fallar, evitando un reinicio completo de la ejecución y mejorando la resistencia a fallos de la aplicación. Finalmente, se presenta una propuesta para transformar de forma transparente aplicaciones MPI en trabajos maleables. Esto es, programas paralelos que en tiempo de ejecución son capaces de adaptarse al número de procesadores disponibles en el sistema, consiguiendo beneficios, como mayor productividad, mejor tiempo de respuesta y mayor resistencia a fallos en los nodos. Todas las soluciones propuestas han sido implementadas a nivel de aplicación, siendo independientes de la arquitectura hardware, el sistema operativo, la implementación MPI usada y de cualquier framework de alto nivel, como los utilizados para el envío de trabajos.[Abstract] This Thesis focuses on exploring fault-tolerant and malleability solutions, based on checkpoint and restart techniques, for parallel message-passing applications. In the fault-tolerant field, tbis Thesis contributes to improving the most important overhead factor in checkpointing perfonnance, that is, the I/O cost of the state file dumping, through the proposal of different techniques to reduce the checkpoint file size. In addition, a process migration based on checkpointing is also proposed, that allows for proactively migrating processes fram nades that are about to fail, avoiding the complete restart of the execution and, thus, improving the application resilience. Finally, this Thesis also includes a proposal to transparently transform MPI applications into malleable jobs, that is, parallel programs that are able to adapt their execution to the number of available processors at runtime, which provides important benefits for the end users and the whole system, such as higher productivity and a better response time, or a greater resilience to node failures. All the solutions proposed in this Thesis have been implemented at the application-level, and they are independent of the hardware architecture, the operating system, or the MPI implementation used, and of any higher-level frameworks, such as job submission frameworks

    Failure Avoidance in MPI Applications Using an Application-Level Approach

    Get PDF
    [Abstract] Execution times of large-scale computational science and engineering parallel applications are usually longer than the mean-time-between-failures. For this reason, hardware failures must be tolerated by the applications to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to provide fault tolerance support to parallel applications. However, when a failure occurs, most checkpointing mechanisms require a complete restart of the parallel application from the last checkpoint. New advances in the prediction of hardware failures have led to the development of proactive process migration approaches, where tasks are migrated in a preventive way when node failures are anticipated, avoiding the restart of the whole application. The work presented in this paper extends an application-level checkpointing framework to proactively migrate message passing interface (MPI) processes when impending failures are notified, without having to restart the entire application. The main features of the proposed solution are: low overhead in failure-free executions, avoiding the checkpoint dumping associated to rolling back strategies; low overhead at migration time, by means of the design of a light and asynchronous protocol to achieve a consistent global state; transparency for the user, thanks to the use of a compiler tool and a runtime library and portability, as it is not locked into a particular architecture, operating system or MPI implementation.Ministerio de Ciencia e Innovación; TIN2010-16735Galicia. Consellería de Economía e Industria; 10PXIB105180P

    Reducing the overhead of an MPI application-level migration approach

    Get PDF
    [Abstract] Process migration provides many benefits for parallel environments including dynamic load balance, data access locality, or fault tolerance. This work proposes a solution that reduces the memory and I/O overhead in an application-level checkpoint-based migration approach. The proposal splits the checkpoint files in order to overlap the writing of the state in the terminating processes with the read and restarting operation in the newly spawned processes. It has been tested using the MPI NAS Parallel Benchmarks, showing encouraging results, both in terms of memory consumption and I/O migration times.Ministerio de Economía y Competitividad; TIN2013-42148-PGalicia. Consellería de Cultura, Educación e Ordenación Universitaria; GRC2013/05

    In-memory application-level checkpoint-based migration for MPI programs

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-014-1120-2[Abstract] Process migration provides many benefits for parallel environments including dynamic load balancing, data access locality or fault tolerance. This paper describes an in-memory application-level checkpoint-based migration solution for MPI codes that uses the Hierarchical Data Format 5 (HDF5) to write the checkpoint files. The main features of the proposed solution are transparency for the user, achieved through the use of CPPC (ComPiler for Portable Checkpointing); portability, as the application-level approach makes the solution adequate for any MPI implementation and operating system, and the use of the HDF5 file format enables the restart on different architectures; and high performance, by saving the checkpoint files to memory instead of to disk through the use of the HDF5 in-memory files. Experimental results prove that the in-memory approach reduces significantly the I/O cost of the migration process.Ministerio de Ciencia e Innovación; TIN2010-16735Galicia. Consellería de Economía e Industria; 10PXIB105180P

    Resilient MPI applications using an application-level checkpointing framework and ULFM

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-016-1629-7[Abstract] Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. The Fault Tolerance Working Group, within the MPI forum, has presented the User Level Failure Mitigation (ULFM) proposal, providing new functionalities for the implementation of resilient MPI applications. In this work, the CPPC checkpointing framework is extended to exploit the new ULFM functionalities. The proposed solution transparently obtains resilient MPI applications by instrumenting the original application code. Besides, a multithreaded multilevel checkpointing, in which the checkpoint files are saved in different memory levels, improves the scalability of the solution. The experimental evaluation shows a low overhead when tolerating failures in one or several MPI processes.Ministerio de Economía y Competitividad; TIN2013-42148-PMinisterio de Economía y Competitividad; TIN2014-53522-REDTMinisterio de Economía y Competitividad; BES-2014-068066Galicia. Consellería de Cultura, Educación e Ordenación Universitaria; GRC2013/05

    Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing. The final authenticated version is available online at: https://doi.org/10.1007/s00354-013-0302-4[Abstract] The execution times of large-scale parallel applications on nowadays multi/many-core systems are usually longer than the mean time between failures. Therefore, parallel applications must tolerate hardware failures to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to implement fault-tolerant applications. However, checkpointing parallel applications is expensive in terms of computing time, network utilization and storage resources. Thus, current checkpoint-recovery techniques should minimize these costs in order to be useful for large scale systems. In this paper three different and complementary techniques to reduce the size of the checkpoints generated by application-level checkpointing are proposed and implemented. Detailed experimental results obtained on a multicore cluster show the effectiveness of the proposed methods to reduce checkpointing cost.Ministerio de Ciencia e Innovación; TIN2010-16735Galicia. Consellería de Economía e Industria; 10PXIB105180P

    An application-level solution for the dynamic reconfiguration of MPI applications

    Get PDF
    International audienceCurrent parallel environments aggregate large numbers of computational resources with a high rate of change in their availability and load conditions. In order to obtain the best performance in this type of infrastructures, parallel applications must be able to adapt to these changing conditions. This paper presents an application-level proposal to automatically and transparently adapt MPI applications to the available resources. Experimental results show a good degree of adaptability and a good performance in di↵erent availability scenarios
    corecore