169 research outputs found

    CIC : an integrated approach to checkpointing in mobile agent systems

    Get PDF
    Internet and Mobile Computing Lab (in Department of Computing)Refereed conference paper2006-2007 > Academic research: refereed > Refereed conference paperVersion of RecordPublishe

    A Survey of Checkpointing Algorithms in Mobile Ad Hoc Network

    Get PDF
    Checkpoint is defined as a fault tolerant technique that is a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. If there is a failure, computation may be restarted from the current checkpoint instead of repeating the computation from beginning. Checkpoint based rollback recovery is one of the widely used technique used in various areas like scientific computing, database, telecommunication and critical applications in distributed and mobile ad hoc network. The mobile ad hoc network architecture is one consisting of a set of self configure mobile hosts capable of communicating with each other without the assistance of base stations. The main problems of this environment are insufficient power and limited storage capacity, so the checkpointing is major challenge in mobile ad hoc network. This paper presents the review of the algorithms, which have been reported for checkpointing approaches in mobile ad hoc network

    RADIC II : a fault tolerant architecture with flexible dynamic redundancy

    Get PDF
    The demand for computational power has been leading the improvement of the High Performance Computing (HPC) area, generally represented by the use of distributed systems like clusters of computers running parallel applications. In this area, fault tolerance plays an important role in order to provide high availability isolating the application from the faults effects. Performance and availability form an undissociable binomial for some kind of applications. Therefore, the fault tolerant solutions must take into consideration these two constraints when it has been designed. In this dissertation, we present a few side-effects that some fault tolerant solutions may presents when recovering a failed process. These effects may causes degradation of the system, affecting mainly the overall performance and availability. We introduce RADIC-II, a fault tolerant architecture for message passing based on RADIC (Redundant Array of Distributed Independent Fault Tolerance Controllers) architecture. RADIC-II keeps as maximum as possible the RADIC features of transparency, decentralization, flexibility and scalability, incorporating a flexible dynamic redundancy feature, allowing to mitigate or to avoid some recovery side-effects.La demanda de computadores más veloces ha provocado el incremento del área de computación de altas prestaciones, generalmente representado por el uso de sistemas distribuidos como los clusters de computadores ejecutando aplicaciones paralelas. En esta área, la tolerancia a fallos juega un papel muy importante a la hora de proveer alta disponibilidad, aislando los efectos causados por los fallos. Prestaciones y disponibilidad componen un binomio indisociable para algunos tipos de aplicaciones. Por eso, las soluciones de tolerancia a fallos deben tener en consideración estas dos restricciones desde el momento de su diseño. En esta disertación, presentamos algunos efectos colaterales que se puede presentar en ciertas soluciones tolerantes a fallos cuando recuperan un proceso fallado. Estos efectos pueden causar una degradación del sistema, afectando las prestaciones y disponibilidad finales. Presentamos RADIC-II, una arquitectura tolerante a fallos para paso de mensajes basada en la arquitectura RADIC (Redundant Array of Distributed Independent Fault Tolerance Controllers). RADIC-II mantiene al máximo posible las características de transparencia, descentralización, flexibilidad y escalabilidad existentes en RADIC, e incorpora una flexible funcionalidad de redundancia dinámica, que permite mitigar o evitar algunos efectos colaterales en la recuperación

    A Recovery Scheme for Cluster Federations Using Sender-based Message Logging

    Get PDF
    A cluster federation is a union of clusters and is heterogeneous. Each cluster contains a certain number of processes. An application running in such a computing environment is divided into communicating modules so that these modules can run on different clusters. To achieve fault-tolerance different clusters may employ different check pointing schemes. For example, some may use coordinated schemes, while some other may use communication-induced schemes. It may complicate the recovery process. In this paper, we have addressed the complex problem of recovery for cluster computing environment. The proposed approach handles both inter cluster orphan and lost messages unlike the existing works in this area. We first propose an algorithm to determine a recovery line so that there does not exist any inter cluster orphan message between any pair of the cluster level check points belonging to the recovery line. The main feature of the proposed algorithm is that it can be executed simultaneously by all clusters in the cluster federation. Next we apply the sender-based message logging idea to effectively handle all inter cluster lost messages to ensure correctness of computation

    Study and Design of Global Snapshot Compilation Protocols for Rollback-Recovery in Mobile Distributed System

    Get PDF
    Checkpoint is characterized as an assigned place in a program at which ordinary process is intruded on particularly to protect the status data important to permit resumption of handling at a later time. A conveyed framework is an accumulation of free elements that participate to tackle an issue that can't be separately comprehended. A versatile figuring framework is a dispersed framework where some of procedures are running on portable hosts (MHs). The presence of versatile hubs in an appropriated framework presents new issues that need legitimate dealing with while outlining a checkpointing calculation for such frameworks. These issues are portability, detachments, limited power source, helpless against physical harm, absence of stable stockpiling and so forth. As of late, more consideration has been paid to giving checkpointing conventions to portable frameworks. Least process composed checkpointing is an alluring way to deal with present adaptation to internal failure in portable appropriated frameworks straightforwardly. This approach is without domino, requires at most two recovery_points of a procedure on stable stockpiling, and powers just a base number of procedures to recovery_point. In any case, it requires additional synchronization messages, hindering of the basic calculation or taking some futile recovery_points. In this paper, we complete the writing review of some Minimum-process Coordinated Checkpointing Algorithms for Mobile Computing System

    Integrated analysis of error detection and recovery

    Get PDF
    An integrated modeling and analysis of error detection and recovery is presented. When fault latency and/or error latency exist, the system may suffer from multiple faults or error propagations which seriously deteriorate the fault-tolerant capability. Several detection models that enable analysis of the effect of detection mechanisms on the subsequent error handling operations and the overall system reliability were developed. Following detection of the faulty unit and reconfiguration of the system, the contaminated processes or tasks have to be recovered. The strategies of error recovery employed depend on the detection mechanisms and the available redundancy. Several recovery methods including the rollback recovery are considered. The recovery overhead is evaluated as an index of the capabilities of the detection and reconfiguration mechanisms

    Using Rollback Avoidance to Mitigate Failures in Next-Generation Extreme-Scale Systems

    Get PDF
    High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in many important physical systems. The next major milestone in the development of HPC systems is the construction of the first supercomputer capable executing more than an exaflop, 10^18 floating point operations per second. On systems of this scale, failures will occur much more frequently than on current systems. As a result, resilience is a key obstacle to building next-generation extreme-scale systems. Coordinated checkpointing is currently the most widely-used mechanism for handling failures on HPC systems. Although coordinated checkpointing remains effective on current systems, increasing the scale of today\u27s systems to build next-generation systems will increase the cost of fault tolerance as more and more time is taken away from the application to protect against or recover from failure. Rollback avoidance techniques seek to mitigate the cost of checkpoint/restart by allowing an application to continue its execution rather than rolling back to an earlier checkpoint when failures occur. These techniques include failure prediction and preventive migration, replicated computation, fault-tolerant algorithms, and software-based memory fault correction. In this thesis, I examine how rollback avoidance techniques can be used to address failures on extreme-scale systems. Using a combination of analytic modeling and simulation, I evaluate the potential impact of rollback avoidance on these systems. I then present a novel rollback avoidance technique that exploits similarities in application memory. Finally, I examine the feasibility of using this technique to protect against memory faults in kernel memory

    Reliable distributed data stream management in mobile environments

    Get PDF
    The proliferation of sensor technology, especially in the context of embedded systems, has brought forward novel types of applications that make use of streams of continuously generated sensor data. Many applications like telemonitoring in healthcare or roadside traffic monitoring and control particularly require data stream management (DSM) to be provided in a distributed, yet reliable way. This is even more important when DSM applications are deployed in a failure-prone distributed setting including resource-limited mobile devices, for instance in applications which aim at remotely monitoring mobile patients. In this paper, we introduce a model for distributed and reliable DSM. The contribution of this paper is threefold. First, in analogy to the SQL isolation levels, we define levels of reliability and describe necessary consistency constraints for distributed DSM that specify the tolerated loss, delay, or re-ordering of data stream elements, respectively. Second, we use this model to design and analyze an algorithm for reliable distributed DSM, namely efficient coordinated operator checkpointing (ECOC). We show that ECOC provides lossless and delay-limited reliable data stream management and thus can be used in critical application domains such as healthcare, where the loss of data stream elements can not be tolerated. Third, we present detailed performance evaluations of the ECOC algorithm running on mobile, resource-limited devices. In particular, we can show that ECOC provides a high level of reliability while, at the same time, featuring good performance characteristics with moderate resource consumption
    corecore