50 research outputs found

    Study and Design of Global Snapshot Compilation Protocols for Rollback-Recovery in Mobile Distributed System

    Get PDF
    Checkpoint is characterized as an assigned place in a program at which ordinary process is intruded on particularly to protect the status data important to permit resumption of handling at a later time. A conveyed framework is an accumulation of free elements that participate to tackle an issue that can't be separately comprehended. A versatile figuring framework is a dispersed framework where some of procedures are running on portable hosts (MHs). The presence of versatile hubs in an appropriated framework presents new issues that need legitimate dealing with while outlining a checkpointing calculation for such frameworks. These issues are portability, detachments, limited power source, helpless against physical harm, absence of stable stockpiling and so forth. As of late, more consideration has been paid to giving checkpointing conventions to portable frameworks. Least process composed checkpointing is an alluring way to deal with present adaptation to internal failure in portable appropriated frameworks straightforwardly. This approach is without domino, requires at most two recovery_points of a procedure on stable stockpiling, and powers just a base number of procedures to recovery_point. In any case, it requires additional synchronization messages, hindering of the basic calculation or taking some futile recovery_points. In this paper, we complete the writing review of some Minimum-process Coordinated Checkpointing Algorithms for Mobile Computing System

    Analysis of check pointing protocols for mobile distributed systems

    Get PDF
    Mobile Distributed Systems (MDS) are susceptible to faults. It is not easy to predict whether the system will prolong to perform throughout or till approved time. Checkpointing based Fault tolerance enables a system to continue properly, in the event of failure. Checkpoint is defined as a nominated place in a program at which normal process is broken up distinctively to conserve the status information, needed to allow recommencement of processing at a later time in case of a failure.  Checkpointing algorithms for mobile distributed systems come across new issues such as mobility, low bandwidth of wireless channels, disconnections, limited battery power and lack of reliable stable storage on mobile nodes. This paper gives a summary of checkpointing strategies for mobile networks which are categories on the basis of QOS of wireless networks, based on mobile agents, considering the mobility of MHs and transmission of checkpoints

    An Efficient Synchronous Checkpointing Protocol for Mobile Distributed Systems

    Get PDF
    Recent years have witnessed rapid development of mobile communications and become part of everyday life for most people. In order to transparently adding fault tolerance in mobile distributed systems, Minimum-process coordinated checkpointing is preferable but it may require blocking of processes, extra synchronization messages or taking some useless checkpoints. All-process checkpointing may lead to exceedingly high checkpointing overhead. In order to balance the checkpointing overhead and the loss of computation on recovery, we propose a hybrid checkpointing algorithm, wherein an all-process coordinated checkpoint is taken after the execution of minimum-process coordinated checkpointing algorithm for a fixed number of times. In the minimum-process coordinated checkpointing algorithm; an effort has been made to optimize the number of useless checkpoints and blocking of processes using probabilistic approach and by computing an interacting set of processes at beginning. We try to reduce the loss of checkpointing effort when any process fails to take its checkpoint in coordination with others. We reduce the size of checkpoint sequence number piggybacked on each computation messag

    Resource management for extreme scale high performance computing systems in the presence of failures

    Get PDF
    2018 Summer.Includes bibliographical references.High performance computing (HPC) systems, such as data centers and supercomputers, coordinate the execution of large-scale computation of applications over tens or hundreds of thousands of multicore processors. Unfortunately, as the size of HPC systems continues to grow towards exascale complexities, these systems experience an exponential growth in the number of failures occurring in the system. These failures reduce performance and increase energy use, reducing the efficiency and effectiveness of emerging extreme-scale HPC systems. Applications executing in parallel on individual multicore processors also suffer from decreased performance and increased energy use as a result of applications being forced to share resources, in particular, the contention from multiple application threads sharing the last-level cache causes performance degradation. These challenges make it increasingly important to characterize and optimize the performance and behavior of applications that execute in these systems. To address these challenges, in this dissertation we propose a framework for intelligently characterizing and managing extreme-scale HPC system resources. We devise various techniques to mitigate the negative effects of failures and resource contention in HPC systems. In particular, we develop new HPC resource management techniques for intelligently utilizing system resources through the (a) optimal scheduling of applications to HPC nodes and (b) the optimal configuration of fault resilience protocols. These resource management techniques employ information obtained from historical analysis as well as theoretical and machine learning methods for predictions. We use these data to characterize system performance, energy use, and application behavior when operating under the uncertainty of performance degradation from both system failures and resource contention. We investigate how to better characterize and model the negative effects from system failures as well as application co-location on large-scale HPC computing systems. Our analysis of application and system behavior also investigates: the interrelated effects of network usage of applications and fault resilience protocols; checkpoint interval selection and its sensitivity to system parameters for various checkpoint-based fault resilience protocols; and performance comparisons of various promising strategies for fault resilience in exascale-sized systems

    Towards Scalable Parallel Fibonacci Heap Implementation

    Get PDF
    With the advancement of multiple processors, the sequential algorithms are being investigated and gradually substituted for its concurrent equivalent to effectively exploit the parallel architecture. Parallel algorithms speed up the performance by dividing the task into a number of processes (or threads) that can be scheduled and executed simultaneously in independent processing units. Various well-known basic algorithms and data-structures have been explored for its efficient parallel counterparts and have been published as popular libraries. However, advanced data-structures and algorithms have not seen similar investigation mainly because they have many optimization steps mostly backed by many states and finding safe and efficient parallel implementation isn’t an easy endeavor. Safety concerns for shared-memory parallel implementation are of utmost importance as it provides a basis for consistency of any data structure and algorithm. There are well-known tools like locks, semaphores, atomic operations and so on that assist towards safe parallel implementation but using them effectively and in well-defined synchronization are key factors in the overall performance of any data-structures and algorithms. This paper explores an advanced data structure, Fibonacci Heap, and its operations to evaluate its implementation using two different synchronization mechanisms: Coarse-grained and Fine-grained. The analysis in this paper shows that a fine-grained synchronized Fibonacci Heap implementation with certainly relaxed semantics is more scalable with growing number of concurrency in comparison to the coarse-grained synchronized Fibonacci Heap implementation

    Locality-driven checkpoint and recovery

    Get PDF
    Checkpoint and recovery are important fault-tolerance techniques for distributed systems. The two categories of existing strategies incur unacceptable performance cost either at run time or upon failure recovery, when applied to large-scale distributed systems. In particular, the large number of messages and processes in these systems causes either considerable checkpoint as well as logging overhead, or catastrophic global-wise recovery effect. This thesis proposes a locality-driven strategy for efficiently checkpointing and recovering such systems with both affordable runtime cost and controllable failure recoverability. Messages establish dependencies between distributed processes, which can be either preserved by coordinated checkpoints or removed via logging. Existing strategies enforce a uniform handling policy for all message dependencies, and hence gains advantage at one end but bears disadvantage at the other. In this thesis, a generic theory of Quasi-Atomic Recovery has been formulated to accommodate message handling requirements of both kinds, and to allow using different message handling methods together. Quasi-atomicity of recovery blocks implies proper confinement of recoveries, and thus enables localization of checkpointing and recovery around such a block and consequently a hybrid strategy with combined advantages from both ends. A strategy of group checkpointing with selective logging has been proposed, based on the observation of message localization around 'locality regions' in distributed systems. In essence, a group-wise coordinated checkpoint is created around such a region and only the few inter-region messages are logged subsequently. Runtime overhead is optimized due to largely reduced logging efforts, and recovery spread is as localized as region-wise. Various protocols have been developed to provide trade-offs between flexibility and performance. Also proposed is the idea of process clone that can be used to effectively remove program-order recovery dependencies among successive group checkpoints and thus to stop inter-group recovery spread. Distributed executions exhibit locality of message interactions. Such locality originates from resolving distributed dependency localization via message passing, and appears as a hierarchical 'region-transition' pattern. A bottom-up approach has been proposed to identify those regions, by detecting popular recurrence patterns from individual processes as 'locality intervals', and then composing them into 'locality regions' based on their tight message coupling relations between each other. Experiments conducted on real-life applications have shown the existence of hierarchical locality regions and have justified the feasibility of this approach. Performance optimization of group checkpoint strategies has to do with their uses of locality. An abstract performance measure has been-proposed to properly integrate both runtime overhead and failure recoverability in a region-wise marner. Taking this measure as the optimization objective, a greedy heuristic has been introduced to decompose a given distributed execution into optimized regions. Analysis implies that an execution pattern with good locality leads to good optimized performance, and the locality pattern itself can serve as a good candidate for the optimal decomposition. Consequently, checkpoint protocols have been developed to efficiently identify optimized regions in such an execution, with assistance of either design-time or runtime knowledge

    Fault Tolerance for High-Performance Applications Using Structured Parallelism Models

    Get PDF
    In the last years parallel computing has increasingly exploited the high-level models of structured parallel programming, an example of which are algorithmic skeletons. This trend has been motivated by the properties featuring structured parallelism models, which can be used to derive several (static and dynamic) optimizations at various implementation levels. In this thesis we study the properties of structured parallel models useful for attacking the issue of providing a fault tolerance support oriented towards High-Performance applications. This issue has been traditionally faced in two ways: (i) in the context of unstructured parallelism models (e.g. MPI), which computation model is essentially based on a distributed set of processes communicating through message-passing, with an approach based on checkpointing and rollback recovery or software replication; (ii) in the context of high-level models, based on a specific parallelism model (e.g. data-flow) and/or an implementation model (e.g. master-slave), by introducing specific techniques based on the properties of the programming and computation models themselves. In this thesis we make a step towards a more abstract viewpoint and we highlight the properties of structured parallel models interesting for fault tolerance purposes. We consider two classes of parallel programs (namely task parallel and data parallel) and we introduce a fault tolerance support based on checkpointing and rollback recovery. The support is derived according to the high-level properties of the parallel models: we call this derivation specialization of fault tolerance techniques, highlighting the difference with classical solutions supporting structure-unaware computations. As a consequence of this specialization, the introduced fault tolerance techniques can be configured and optimized to meet specific needs at different implementation levels. That is, the supports we present do not target a single computing platform or a specific class of them. Indeed the specializations are the mechanism to target specific issues of the exploited environment and of the implemented applications, as proper choices of the protocols and their configurations

    Efficient Passive Clustering and Gateways selection MANETs

    Get PDF
    Passive clustering does not employ control packets to collect topological information in ad hoc networks. In our proposal, we avoid making frequent changes in cluster architecture due to repeated election and re-election of cluster heads and gateways. Our primary objective has been to make Passive Clustering more practical by employing optimal number of gateways and reduce the number of rebroadcast packets
    corecore