research

Resilient MPI applications using an application-level checkpointing framework and ULFM

Abstract

This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-016-1629-7[Abstract] Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. The Fault Tolerance Working Group, within the MPI forum, has presented the User Level Failure Mitigation (ULFM) proposal, providing new functionalities for the implementation of resilient MPI applications. In this work, the CPPC checkpointing framework is extended to exploit the new ULFM functionalities. The proposed solution transparently obtains resilient MPI applications by instrumenting the original application code. Besides, a multithreaded multilevel checkpointing, in which the checkpoint files are saved in different memory levels, improves the scalability of the solution. The experimental evaluation shows a low overhead when tolerating failures in one or several MPI processes.Ministerio de Economía y Competitividad; TIN2013-42148-PMinisterio de Economía y Competitividad; TIN2014-53522-REDTMinisterio de Economía y Competitividad; BES-2014-068066Galicia. Consellería de Cultura, Educación e Ordenación Universitaria; GRC2013/05

    Similar works