Search CORE

373 research outputs found

FADI: a fault-tolerant environment for open distributed computing

Author: A. Bargiela
BARGIELA
ELNOZAHY
JANAKIRAMAN
LAMOTTE
OSMAN
OSMAN
PLANK
POWELL
RIBERIO
SENS
SILVA
T. Osman
Publication venue: 'Institution of Engineering and Technology (IET)'
Publication date: 01/01/2000
Field of study

FADI is a complete programming environment that serves the reliable execution of distributed application programs. FADI encompasses all aspects of modern fault-tolerant distributed computing. The built-in user-transparent error detection mechanism covers processor node crashes and hardware transient failures. The mechanism also integrates user-assisted error checks into the system failure model. The nucleus non-blocking checkpointing mechanism combined with a novel selective message logging technique delivers an efficient, low-overhead backup and recovery mechanism for distributed processes. FADI also provides means for remote automatic process allocation on the distributed system nodes

Crossref

Nottingham Trent Institutional Repository (IRep)

Scalable group-based checkpoint/restart for large-scale message-passing systems

Author: Ho JCY
Lau FCM
Wang CL
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

The ever increasing number of processors used in parallel computers is making fault tolerance support in large-scale parallel systems more and more important. We discuss the inadequacies of existing system-level checkpointing solutions for message-passing applications as the system scales up. We analyze the coordination cost and blocking behavior of two current MPI implementations with checkpointing support. A group-based solution combining coordinated checkpointing and message logging is then proposed. Experiment results demonstrate its better performance and scalability than LAM/MPI and MPICH-VCL. To assist group formation, a method to analyze the communication behaviors of the application is proposed. ©2008 IEEE.published_or_final_versio

HKU Scholars Hub

Algorithmic Based Fault Tolerance Applied to High Performance Computing

Author: Bosilca George
Delmas Remi
Dongarra Jack
Langou Julien
Publication venue
Publication date: 01/01/2008
Field of study

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault tolerant matrix-matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix-matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly

arXiv.org e-Print Archive

CiteSeerX

MIMS EPrints

The University of Manchester - Institutional Repository

A Survey of Checkpointing Algorithms in Mobile Ad Hoc Network

Author: Dr. Ajay Khunteta
Dr. Ajay Khunteta
Dr. Ajay Khunteta
Publication venue: Global Journals Inc. (US)
Publication date: 02/06/2012
Field of study

Checkpoint is defined as a fault tolerant technique that is a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. If there is a failure, computation may be restarted from the current checkpoint instead of repeating the computation from beginning. Checkpoint based rollback recovery is one of the widely used technique used in various areas like scientific computing, database, telecommunication and critical applications in distributed and mobile ad hoc network. The mobile ad hoc network architecture is one consisting of a set of self configure mobile hosts capable of communicating with each other without the assistance of base stations. The main problems of this environment are insufficient power and limited storage capacity, so the checkpointing is major challenge in mobile ad hoc network. This paper presents the review of the algorithms, which have been reported for checkpointing approaches in mobile ad hoc network

Global Journal of Computer Science and Technology (GJCST)

A Low-Overhead Non-Block Check Pointing and Recovery Approach for Mobile Computing Environment

Author: Bidyut Gupta
Sindoora Koneru
Ziping Liu
Publication venue: 'IntechOpen'
Publication date: 30/03/2012
Field of study

IntechOpen

Automated Application-level Checkpointing of MPI Programs

Author: Bronevetsky Greg
Marques Daniel
Pingali Keshav
Stodghill Paul
Publication venue: 'SAGE Publications'
Publication date: 12/02/2003
Field of study

Because of increasing hardware and software complexity, the running time of many computational science applications is now more than the mean-time-to-failure of high-performance computing platforms. Therefore, computational science applications need to tolerate hardware failures. In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols in teh literature are not suitable for implementing this approach. In this paper, we present a suitable protocol, and show how it can be used with a precompiler that instruments C/MPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small

VU Research Portal

Crossref

eCommons@Cornell

The cost of recovery protocols in web-based database systems

Author: Casey John
Hammadi Adil M.
Zhou Wanlei
Publication venue: IEEE Computer Society
Publication date: 01/01/2004
Field of study

The cost of recovery protocols is important with respect to system performance during normal operation and failure in terms of overhead, and time taken to recover failed transactions. The cost of recovery protocols for web database systems has not been addressed much. In this paper, we present a quantitative study of cost of recovery protocols. For this purpose, we use an experiment setup to evaluate the performance of two recovery algorithms, namely the, two-phase commit algorithm and log-based algorithm. Our work is a step towards building reliable protocols for web database systems.<br /

Deakin Research Online