247 research outputs found

    CPPC: a compiler‐assisted tool for portable checkpointing of message‐passing applications

    Get PDF
    This is the peer reviewed version of the following article: Rodríguez, G. , Martín, M. J., González, P. , Touriño, J. and Doallo, R. (2010), CPPC: a compiler‐assisted tool for portable checkpointing of message‐passing applications. Concurrency Computat.: Pract. Exper., 22: 749-766. doi:10.1002/cpe.1541, which has been published in final form at https://doi.org/10.1002/cpe.1541. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions.[Abstract] With the evolution of high‐performance computing toward heterogeneous, massively parallel systems, parallel applications have developed new checkpoint and restart necessities. Whether due to a failure in the execution or to a migration of the application processes to different machines, checkpointing tools must be able to operate in heterogeneous environments. However, some of the data manipulated by a parallel application are not truly portable. Examples of these include opaque state (e.g. data structures for communications support) or diversity of interfaces for a single feature (e.g. communications, I/O). Directly manipulating the underlying ad hoc representations renders checkpointing tools unable to work on different environments. Portable checkpointers usually work around portability issues at the cost of transparency: the user must provide information such as what data need to be stored, where to store them, or where to checkpoint. CPPC (ComPiler for Portable Checkpointing) is a checkpointing tool designed to feature both portability and transparency. It is made up of a library and a compiler. The CPPC library contains routines for variable level checkpointing, using portable code and protocols. The CPPC compiler helps to achieve transparency by relieving the user from time‐consuming tasks, such as data flow and communications analyses and adding instrumentation code. This paper covers both the operation of the CPPC library and its compiler support. Experimental results using benchmarks and large‐scale real applications are included, demonstrating usability, efficiency, and portability.Miniesterio de Educación y Ciencia; TIN2007‐67537‐C03Xunta de Galicia; 2006/

    Checkpoint-based forward recovery using lookahead execution and rollback validation in parallel and distributed systems

    Get PDF
    This thesis studies a forward recovery strategy using checkpointing and optimistic execution in parallel and distributed systems. The approach uses replicated tasks executing on different processors for forwared recovery and checkpoint comparison for error detection. To reduce overall redundancy, this approach employs a lower static redundancy in the common error-free situation to detect error than the standard N Module Redundancy scheme (NMR) does to mask off errors. For the rare occurrence of an error, this approach uses some extra redundancy for recovery. To reduce the run-time recovery overhead, look-ahead processes are used to advance computation speculatively and a rollback process is used to produce a diagnosis for correct look-ahead processes without rollback of the whole system. Both analytical and experimental evaluation have shown that this strategy can provide a nearly error-free execution time even under faults with a lower average redundancy than NMR

    Compiler-Assisted Checkpointing of Parallel Codes: The Cetus and LLVM Experience

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in International Journal of Parallel Programming. The final authenticated version is available online at: https://doi.org/10.1007/s10766-012-0231-8[Abstract] With the evolution of high-performance computing, parallel applications have developed an increasing necessity for fault tolerance, most commonly provided by checkpoint and restart techniques. Checkpointing tools are typically implemented at one of two different abstraction levels: at the system level or at the application level. The latter has become an interesting alternative due to its flexibility and the possibility of operating in different environments. However, application-level checkpointing tools often require the user to manually insert checkpoints in order to ensure that certain requirements are met (e.g. forcing checkpoints to be taken at the user code and not inside kernel routines). This paper examines the transformations required to enable automatic checkpointing of parallel applications in the CPPC application-level checkpointing framework. These transformations have been implemented on two very different compiler infrastructures: Cetus and LLVM. Cetus is a Java-based compiler infrastructure aiming to provide an easy to use and clean IR and API for program transformation. LLVM is a low-level, SSA-based toolchain. The fundamental differences of both approaches are analyzed from the structural, behavioral and performance perspectives.Galicia. Consellería de Economía e Industria; 10PXIB105180PRMinisterio de Ciencia e Innovación; TIN2010-1673

    A Heuristic Approach for the Automatic Insertion of Checkpoints in Message-Passing Codes

    Get PDF
    [Abstract] Checkpointing tools may be typically implemented at two different abstraction levels: at the system level or at the application level. The latter has become a more popular alternative due to its flexibility and the possibility of operating in different environments. However, application-level checkpointing tools often require the user to manually insert checkpoints in order to ensure that certain requirements are met (e.g. forcing checkpoints to be taken at the user code and not inside kernel routines). The approach presented in this work is twofold. First, a spatial coordination protocol for checkpointing parallel SPMD applications is proposed, based on forcing checkpoints to be taken at the same places in the application code by all processes. Thus, global consistency is achieved without adding any new runtime communications or piggybacked data, and without the need to use specific fault-tolerant message-passing implementations. Second, the paper also introduces a compilation technique for the automatic insertion of checkpoints using the spatial coordination protocol, based on a static analysis of communications and a heuristic analysis of computational load. These analyses can also be used to achieve automatic checkpoint insertion in approaches based on classical protocols, such as uncoordinated checkpointing or distributed snapshots.Ministerio de Ciencia e Innovación; TIN-2007-67537-C03-0

    Compiler-Assisted Multiple Instruction Rollback Recovery Using a Read Buffer

    Get PDF
    Multiple instruction rollback (MIR) is a technique to provide rapid recovery from transient processor failures and was implemented in hardware by researchers and slow in mainframe computers. Hardware-based MIR designs eliminate rollback data hazards by providing data redundancy implemented in hardware. Compiler-based MIR designs were also developed which remove rollback data hazards directly with data flow manipulations, thus eliminating the need for most data redundancy hardware. Compiler-assisted techniques to achieve multiple instruction rollback recovery are addressed. It is observed that data some hazards resulting from instruction rollback can be resolved more efficiently by providing hardware redundancy while others are resolved more efficiently with compiler transformations. A compiler-assisted multiple instruction rollback scheme is developed which combines hardware-implemented data redundancy with compiler-driven hazard removal transformations. Experimental performance evaluations were conducted which indicate improved efficiency over previous hardware-based and compiler-based schemes. Various enhancements to the compiler transformations and to the data redundancy hardware developed for the compiler-assisted MIR scheme are described and evaluated. The final topic deals with the application of compiler-assisted MIR techniques to aid in exception repair and branch repair in a speculative execution architecture

    Analysis of Performance-impacting Factors on Checkpointing Frameworks: The CPPC Case Study

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in The Computer Journal. The final authenticated version is available online at: https://doi.org/10.1093/comjnl/bxr018[Abstract] This paper focuses on the performance evaluation of Compiler for Portable Checkpointing (CPPC), a tool for the checkpointing of parallel message-passing applications. Its performance and the factors that impact it are transparently and rigorously identified and assessed. The tests were performed on a public supercomputing infrastructure, using a large number of very different applications and showing excellent results in terms of performance and effort required for integration into user codes. Statistical analysis techniques have been used to better approximate the performance of the tool. Quantitative and qualitative comparisons with other rollback-recovery approaches to fault tolerance are also included. All these data and comparisons are then discussed in an effort to extract meaningful conclusions about the state-of-the-art and future research trends in the rollback-recovery field.Minsiterio de Ciencia e Innovación; TIN2010-1673

    Extending an Application-Level Checkpointing Tool to Provide Fault Tolerance Support to OpenMP Applications

    Get PDF
    [Abstract] Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing fault tolerance support to shared-memory applications. CPPC (ComPiler for Portable Checkpointing) is an application-level checkpointing tool focused on the insertion of fault tolerance into long-running MPI applications. This paper presents an extension to CPPC to allow the checkpointing of OpenMP applications. The proposed solution maintains the main characteristics of CPPC: portability and reduced checkpoint file size. The performance of the proposal is evaluated using the OpenMP NAS Parallel Benchmarks showing that most of the applications present small checkpoint overheads.Ministerio de Economía y Competitividad; TIN2013-42148-

    Compiler-assisted multiple instruction rollback recovery using a read buffer

    Get PDF
    Multiple instruction rollback (MIR) is a technique that has been implemented in mainframe computers to provide rapid recovery from transient processor failures. Hardware-based MIR designs eliminate rollback data hazards by providing data redundancy implemented in hardware. Compiler-based MIR designs have also been developed which remove rollback data hazards directly with data-flow transformations. This paper describes compiler-assisted techniques to achieve multiple instruction rollback recovery. We observe that some data hazards resulting from instruction rollback can be resolved efficiently by providing an operand read buffer while others are resolved more efficiently with compiler transformations. The compiler-assisted scheme presented consists of hardware that is less complex than shadow files, history files, history buffers, or delayed write buffers, while experimental evaluation indicates performance improvement over compiler-based schemes

    Compiler assisted chekpointing of message-passing applications in heterogeneous environments

    Get PDF
    [Resumen] With the evolution of high performance computing towards heterogeneous, massively parallel systems, parallel applications have developed new checkpoint and restart necessities, Whether due to a failure in the execution or to a migration of the processes to different machines, checkpointing tools must be able to operate in heterogeneous environments. However, some of the data manipulated by a parallel application are not truly portable. Examples of these include opaque state (e.g. data structures for communications support) or diversity of interfaces for a single feature (e.g. communications, I/O). Directly manipulating the underlying ad-hoc representations renders checkpointing tools incapable of working on different environments. Portable checkpointers usually work around portability issues at the cost of transparency: the user must provide information such as what data needs to be stored, where to store it, or where to checkpoint. CPPC (ComPiler for Portable Checkpointing) is a checkpointing tool designed to feature both portability and transparency, while preserving the scalability of the executed applications. It is made up of a library and a compiler. The CPPC library contains routines for variable level checkpointing, using portable code and protocols. The CPPC compiler achieves transparency by relieving the user from time-consuming tasks, such as performing code analyses and adding instrumentation code
    corecore