39 research outputs found
Algorithmic Based Fault Tolerance Applied to High Performance Computing
We present a new approach to fault tolerance for High Performance Computing
system. Our approach is based on a careful adaptation of the Algorithmic Based
Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel
distributed computation. We obtain a strongly scalable mechanism for fault
tolerance. We can also detect and correct errors (bit-flip) on the fly of a
computation. To assess the viability of our approach, we have developed a fault
tolerant matrix-matrix multiplication subroutine and we propose some models to
predict its running time. Our parallel fault-tolerant matrix-matrix
multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov)
and returns a correct result while one process failure has happened. This
represents 65% of the machine peak efficiency and less than 12% overhead with
respect to the fastest failure-free implementation. We predict (and have
observed) that, as we increase the processor count, the overhead of the fault
tolerance drops significantly
Operative Merest-undertaking Impeccable Reclamation Line Accretion Ordering for Deterministic Mobile Distributed Computing Systems
Impeccable-RL-accretion (Impeccable Reclamation Line accretion) is one of the ordinarily familiarized approaches to present failing resilience in Distributed Computing setup (DCS) so that the setup can operate even if one or more components have abdicated. However, Mobile DCSs are constrained by small transmittal potentiality, Suppleness, and dearth of stabilized repository, recurrent disruptions and imperfect battery life. From this time Impeccable-RL-accretion orderings which have reduced reestablishment-dots are favored in mobile environments. In this paper, we contemplate a merest-undertaking synchronic ordering for Impeccable-RL-accretion for mobile DCS. We eliminate inoperable reestablishment-dots as well as stalling of undertakings amidst reestablishment-dots at the striving of registering contra-dispatches of very few dispatches amidst Impeccable-RL-accretion. We also organize an effort to subside the depletion of Impeccable-RL-accretion work when any undertaking collapses to stockpile its reestablishment-dot in a founding. In this mode, we handle excessive failings amidst Impeccable-RL-accretion. We organize registering of contra-dispatches of very few dispatches only amidst Impeccable-RL-accretion. We also strive to subside depletion of Impeccable-RL-accretion work.
 
Study and Design of Global Snapshot Compilation Protocols for Rollback-Recovery in Mobile Distributed System
Checkpoint is characterized as an assigned place in a program at which ordinary process is intruded on particularly to protect the status data important to permit resumption of handling at a later time. A conveyed framework is an accumulation of free elements that participate to tackle an issue that can't be separately comprehended. A versatile figuring framework is a dispersed framework where some of procedures are running on portable hosts (MHs). The presence of versatile hubs in an appropriated framework presents new issues that need legitimate dealing with while outlining a checkpointing calculation for such frameworks. These issues are portability, detachments, limited power source, helpless against physical harm, absence of stable stockpiling and so forth. As of late, more consideration has been paid to giving checkpointing conventions to portable frameworks. Least process composed checkpointing is an alluring way to deal with present adaptation to internal failure in portable appropriated frameworks straightforwardly. This approach is without domino, requires at most two recovery_points of a procedure on stable stockpiling, and powers just a base number of procedures to recovery_point. In any case, it requires additional synchronization messages, hindering of the basic calculation or taking some futile recovery_points. In this paper, we complete the writing review of some Minimum-process Coordinated Checkpointing Algorithms for Mobile Computing System
Exploiting operating system services to efficiently checkpoint parallel applications in GENESIS
Recent research efforts of parallel processing on non-dedicated clusters have focused on high execution performance, parallelism management, transparent access to resources, and making clusters easy to use. However, as a collection of independent computers used by multiple users, clusters are susceptible to failure. This paper shows the development of a coordinated checkpointing facility for the GENESIS cluster operating system. This facility was developed by exploiting existing operating system services. High performance and low overheads are achieved by allowing the processes of a parallel application to continue executing during the creation of checkpoints, while maintaining low demands on cluster resources by using coordinated checkpointing.<br /
Network Multicomputing Using Recoverable Distributed Shared Memory
A network multicomputer is a multiprocessor in which the processors are connected by general-purpose networking technology, in contrast to current distributed memory multiprocessors where a dedicated special-purpose interconnect is used. The advent of high-speed general-purpose networks provides the impetus for a new look at the network multiprocessor model, by removing the bottleneck of current slow networks. However, major software issues remain unsolved. It is pointed out that a convenient machine abstraction must be developed that hides from the application programmer low-level details such as message passing or machine failures. Use is made of distributed shared memory as a programming abstraction, and rollback recovery through consistent checkpointing to provide fault tolerance. Measurements of the authors' implementations of distributed shared memory and consistent checkpointing show that these abstractions can be implemented efficientl
Performance Evaluation of Checkpoint/Restart Techniques
Distributed applications running on a large cluster environment, such as the
cloud instances will have shorter execution time. However, the application
might suffer from sudden termination due to unpredicted computing node
failures, thus loosing the whole computation. Checkpoint/restart is a fault
tolerance technique used to solve this problem. In this work we evaluated the
performance of two of the most commonly used checkpoint/restart techniques
(Distributed Multithreaded Checkpointing (DMTCP) and Berkeley Lab
Checkpoint/Restart library (BLCR) integrated into the OpenMPI framework). We
aimed to test their validity and evaluate their performance in both local and
Amazon Elastic Compute Cloud (EC2) environments. The experiments were conducted
on Amazon EC2 as a well-known proprietary cloud computing service provider.
Results obtained were reported and compared to evaluate checkpoint and restart
time values, data scalability and compute processes scalability. The findings
proved that DMTCP performs better than BLCR for checkpoint and restart speed,
data scalability and compute processes scalability experiments
A survey of checkpointing algorithms for parallel and distributed computers
Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. Checkpointing is the process of saving the status information. This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems. It has been observed that most of the algorithms published for checkpointing in message passing systems are based on the seminal article by Chandy and Lamport. A large number of articles have been published in this area by relaxing the assumptions made in this paper and by extending it to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Most of the algorithms assume that there is no knowledge about the programs being executed. It is however felt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery