369 research outputs found

    Checkpointing of parallel applications in a Grid environment

    Get PDF
    The Grid environment is generic, heterogeneous, and dynamic with lots of unreliable resources making it very exposed to failures. The environment is unreliable because it is geographically dispersed involving multiple autonomous administrative domains and it is composed of a large number of components. Examples of failures in the Grid environment can be: application crash, Grid node crash, network failures, and Grid system component failures. These types of failures can affect the execution of parallel/distributed application in the Grid environment and so, protections against these faults are crucial. Therefore, it is essential to develop efficient fault tolerant mechanisms to allow users to successfully execute Grid applications. One of the research challenges in Grid computing is to be able to develop a fault tolerant solution that will ensure Grid applications are executed reliably with minimum overhead incurred. While checkpointing is the most common method to achieve fault tolerance, there is still a lot of work to be done to improve the efficiency of the mechanism. This thesis provides an in-depth description of a novel solution for checkpointing parallel applications executed on a Grid. The checkpointing mechanism implemented allows to checkpoint an application at regions where there is no interprocess communication involved and therefore reducing the checkpointing overhead and checkpoint size

    RADIC II : a fault tolerant architecture with flexible dynamic redundancy

    Get PDF
    The demand for computational power has been leading the improvement of the High Performance Computing (HPC) area, generally represented by the use of distributed systems like clusters of computers running parallel applications. In this area, fault tolerance plays an important role in order to provide high availability isolating the application from the faults effects. Performance and availability form an undissociable binomial for some kind of applications. Therefore, the fault tolerant solutions must take into consideration these two constraints when it has been designed. In this dissertation, we present a few side-effects that some fault tolerant solutions may presents when recovering a failed process. These effects may causes degradation of the system, affecting mainly the overall performance and availability. We introduce RADIC-II, a fault tolerant architecture for message passing based on RADIC (Redundant Array of Distributed Independent Fault Tolerance Controllers) architecture. RADIC-II keeps as maximum as possible the RADIC features of transparency, decentralization, flexibility and scalability, incorporating a flexible dynamic redundancy feature, allowing to mitigate or to avoid some recovery side-effects.La demanda de computadores más veloces ha provocado el incremento del área de computación de altas prestaciones, generalmente representado por el uso de sistemas distribuidos como los clusters de computadores ejecutando aplicaciones paralelas. En esta área, la tolerancia a fallos juega un papel muy importante a la hora de proveer alta disponibilidad, aislando los efectos causados por los fallos. Prestaciones y disponibilidad componen un binomio indisociable para algunos tipos de aplicaciones. Por eso, las soluciones de tolerancia a fallos deben tener en consideración estas dos restricciones desde el momento de su diseño. En esta disertación, presentamos algunos efectos colaterales que se puede presentar en ciertas soluciones tolerantes a fallos cuando recuperan un proceso fallado. Estos efectos pueden causar una degradación del sistema, afectando las prestaciones y disponibilidad finales. Presentamos RADIC-II, una arquitectura tolerante a fallos para paso de mensajes basada en la arquitectura RADIC (Redundant Array of Distributed Independent Fault Tolerance Controllers). RADIC-II mantiene al máximo posible las características de transparencia, descentralización, flexibilidad y escalabilidad existentes en RADIC, e incorpora una flexible funcionalidad de redundancia dinámica, que permite mitigar o evitar algunos efectos colaterales en la recuperación

    An approach to rollback recovery of collaborating mobile agents

    Get PDF
    Fault-tolerance is one of the main problems that must be resolved to improve the adoption of the agents' computing paradigm. In this paper, we analyse the execution model of agent platforms and the significance of the faults affecting their constituent components on the reliable execution of agent-based applications, in order to develop a pragmatic framework for agent systems fault-tolerance. The developed framework deploys a communication-pairs independent check pointing strategy to offer a low-cost, application-transparent model for reliable agent- based computing that covers all possible faults that might invalidate reliable agent execution, migration and communication and maintains the exactly-one execution property

    Rollback recovery with low overhead for fault tolerance in mobile ad hoc networks

    Get PDF
    AbstractMobile ad hoc networks (MANETs) have significantly enhanced the wireless networks by eliminating the need for any fixed infrastructure. Hence, these are increasingly being used for expanding the computing capacity of existing networks or for implementation of autonomous mobile computing Grids. However, the fragile nature of MANETs makes the constituent nodes susceptible to failures and the computing potential of these networks can be utilized only if they are fault tolerant. The technique of checkpointing based rollback recovery has been used effectively for fault tolerance in static and cellular mobile systems; yet, the implementation of existing protocols for MANETs is not straightforward. The paper presents a novel rollback recovery protocol for handling the failures of mobile nodes in a MANET using checkpointing and sender based message logging. The proposed protocol utilizes the routing protocol existing in the network for implementing a low overhead recovery mechanism. The presented recovery procedure at a node is completely domino-free and asynchronous. The protocol is resilient to the dynamic characteristics of the MANET; allowing a distributed application to be executed independently without access to any wired Grid or cellular network access points. We also present an algorithm to record a consistent global snapshot of the MANET

    A Survey of Checkpointing Algorithms in Mobile Ad Hoc Network

    Get PDF
    Checkpoint is defined as a fault tolerant technique that is a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. If there is a failure, computation may be restarted from the current checkpoint instead of repeating the computation from beginning. Checkpoint based rollback recovery is one of the widely used technique used in various areas like scientific computing, database, telecommunication and critical applications in distributed and mobile ad hoc network. The mobile ad hoc network architecture is one consisting of a set of self configure mobile hosts capable of communicating with each other without the assistance of base stations. The main problems of this environment are insufficient power and limited storage capacity, so the checkpointing is major challenge in mobile ad hoc network. This paper presents the review of the algorithms, which have been reported for checkpointing approaches in mobile ad hoc network

    Towards Ad Hoc Recovery for Soft Errors

    Get PDF
    The coming exascale era is a great opportunity for high performance computing (HPC) applications. However, high failure rates on these systems will hazard the successful completion of their execution. Bit-flip errors in dynamic random access memory (DRAM) account for a noticeable share of the failures in supercomputers. Hardware mechanisms, such as error correcting code (ECC), can detect and correct single-bit errors and can detect some multi-bit errors while others can go undiscovered. Unfortunately, detected multi-bit errors will most of the time force the termination of the application and lead to a global restart. Thus, other strategies at the software level are needed to tolerate these type of faults more efficiently and to avoid a global restart. In this work, we extend the FTI checkpointing library to facilitate the implementation of custom recovery strategies for MPI applications, minimizing the overhead introduced when coping with soft errors. The new functionalities are evaluated by implementing local forward recovery on three HPC benchmarks with different reliability requirements. Our results demonstrate a reduction on the recovery times by up to 14%.This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 708566 (DURO). This research is also supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Projects TIN2016-75845-P and the predoctoral grant of Nuria Losada ref. BES-2014-068066), and by the Galician Government (Xunta de Galicia) under the Consolidation Program of Competitive Research (ref. ED431C 2017/04).Peer ReviewedPostprint (author's final draft
    • …
    corecore