Search CORE

39 research outputs found

Algorithmic Based Fault Tolerance Applied to High Performance Computing

Author: Bosilca George
Delmas Remi
Dongarra Jack
Langou Julien
Publication venue
Publication date: 01/01/2008
Field of study

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault tolerant matrix-matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix-matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly

arXiv.org e-Print Archive

CiteSeerX

MIMS EPrints

The University of Manchester - Institutional Repository

Operative Merest-undertaking Impeccable Reclamation Line Accretion Ordering for Deterministic Mobile Distributed Computing Systems

Author: S. P. Singh Ruchi Ohri,
Publication venue: Auricle Global Society of Education and Research
Publication date: 31/12/2023
Field of study

Impeccable-RL-accretion   (Impeccable Reclamation Line accretion) is one of the ordinarily familiarized  approaches to present failing resilience  in Distributed Computing  setup (DCS)   so that the setup can operate even if one or more components have abdicated. However, Mobile DCSs are constrained by small transmittal potentiality, Suppleness, and dearth of stabilized repository, recurrent disruptions and imperfect battery life. From this time Impeccable-RL-accretion   orderings which have reduced reestablishment-dots   are favored in mobile environments. In this paper, we contemplate a merest-undertaking synchronic ordering for Impeccable-RL-accretion   for mobile DCS. We eliminate inoperable reestablishment-dots   as well as stalling of undertakings amidst reestablishment-dots   at the striving of registering contra-dispatches of very few dispatches amidst Impeccable-RL-accretion. We also organize an effort to subside the depletion of Impeccable-RL-accretion   work when any undertaking collapses to stockpile its reestablishment-dot in a founding. In this mode, we handle excessive failings amidst Impeccable-RL-accretion. We organize registering of contra-dispatches of very few dispatches only amidst Impeccable-RL-accretion. We also strive to subside depletion of Impeccable-RL-accretion   work. &nbsp

International Journal on Recent and Innovation Trends in Computing and Communication

Study and Design of Global Snapshot Compilation Protocols for Rollback-Recovery in Mobile Distributed System

Author: Hamid Khan Rais Abdul
Kumar Dr. Praveen
Publication venue: IJRE Publisher
Publication date
Field of study

Checkpoint is characterized as an assigned place in a program at which ordinary process is intruded on particularly to protect the status data important to permit resumption of handling at a later time. A conveyed framework is an accumulation of free elements that participate to tackle an issue that can't be separately comprehended. A versatile figuring framework is a dispersed framework where some of procedures are running on portable hosts (MHs). The presence of versatile hubs in an appropriated framework presents new issues that need legitimate dealing with while outlining a checkpointing calculation for such frameworks. These issues are portability, detachments, limited power source, helpless against physical harm, absence of stable stockpiling and so forth. As of late, more consideration has been paid to giving checkpointing conventions to portable frameworks. Least process composed checkpointing is an alluring way to deal with present adaptation to internal failure in portable appropriated frameworks straightforwardly. This approach is without domino, requires at most two recovery_points of a procedure on stable stockpiling, and powers just a base number of procedures to recovery_point. In any case, it requires additional synchronization messages, hindering of the basic calculation or taking some futile recovery_points. In this paper, we complete the writing review of some Minimum-process Coordinated Checkpointing Algorithms for Mobile Computing System

International Journal of Research and Engineering

Exploiting operating system services to efficiently checkpoint parallel applications in GENESIS

Author: Goscinski Andrzej
Rough Justin
Publication venue: IEEE Xplore
Publication date: 01/01/2002
Field of study

Recent research efforts of parallel processing on non-dedicated clusters have focused on high execution performance, parallelism management, transparent access to resources, and making clusters easy to use. However, as a collection of independent computers used by multiple users, clusters are susceptible to failure. This paper shows the development of a coordinated checkpointing facility for the GENESIS cluster operating system. This facility was developed by exploiting existing operating system services. High performance and low overheads are achieved by allowing the processes of a parallel application to continue executing during the creation of checkpoints, while maintaining low demands on cluster resources by using coordinated checkpointing.<br /

Deakin Research Online

Network Multicomputing Using Recoverable Distributed Shared Memory

Author: Carter John B.
Cox Alan L.
Dwarkadas Sandhya
Elnozahy Elmootazbellah N.
Johnson David B.
Keleher Pete
Zwaenepoel Willy
Publication venue
Publication date: 20/10/2005
Field of study

A network multicomputer is a multiprocessor in which the processors are connected by general-purpose networking technology, in contrast to current distributed memory multiprocessors where a dedicated special-purpose interconnect is used. The advent of high-speed general-purpose networks provides the impetus for a new look at the network multiprocessor model, by removing the bottleneck of current slow networks. However, major software issues remain unsolved. It is pointed out that a convenient machine abstraction must be developed that hides from the application programmer low-level details such as message passing or machine failures. Use is made of distributed shared memory as a programming abstraction, and rollback recovery through consistent checkpointing to provide fault tolerance. Measurements of the authors' implementations of distributed shared memory and consistent checkpointing show that these abstractions can be implemented efficientl

Infoscience - École polytechnique fédérale de Lausanne

A Low-Overhead Non-Block Check Pointing and Recovery Approach for Mobile Computing Environment

Author: Bidyut Gupta
Sindoora Koneru
Ziping Liu
Publication venue: 'IntechOpen'
Publication date: 30/03/2012
Field of study

IntechOpen

Performance Evaluation of Checkpoint/Restart Techniques

Author: Azeem Basma Abdel
Helal Manal
Publication venue
Publication date: 29/11/2023
Field of study

Distributed applications running on a large cluster environment, such as the cloud instances will have shorter execution time. However, the application might suffer from sudden termination due to unpredicted computing node failures, thus loosing the whole computation. Checkpoint/restart is a fault tolerance technique used to solve this problem. In this work we evaluated the performance of two of the most commonly used checkpoint/restart techniques (Distributed Multithreaded Checkpointing (DMTCP) and Berkeley Lab Checkpoint/Restart library (BLCR) integrated into the OpenMPI framework). We aimed to test their validity and evaluate their performance in both local and Amazon Elastic Compute Cloud (EC2) environments. The experiments were conducted on Amazon EC2 as a well-known proprietary cloud computing service provider. Results obtained were reported and compared to evaluate checkpoint and restart time values, data scalability and compute processes scalability. The findings proved that DMTCP performs better than BLCR for checkpoint and restart speed, data scalability and compute processes scalability experiments

arXiv.org e-Print Archive

A survey of checkpointing algorithms for parallel and distributed computers

Author: Kalaiselvi S.
Rajaraman V.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/10/2000
Field of study

Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. Checkpointing is the process of saving the status information. This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems. It has been observed that most of the algorithms published for checkpointing in message passing systems are based on the seminal article by Chandy and Lamport. A large number of articles have been published in this area by relaxing the assumptions made in this paper and by extending it to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Most of the algorithms assume that there is no knowledge about the programs being executed. It is however felt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery