Search CORE

5 research outputs found

Checkpointing of parallel applications in a Grid environment

Author: Sajadah K.
Sajadah K.
Publication venue
Publication date: 01/01/2011
Field of study

The Grid environment is generic, heterogeneous, and dynamic with lots of unreliable resources making it very exposed to failures. The environment is unreliable because it is geographically dispersed involving multiple autonomous administrative domains and it is composed of a large number of components. Examples of failures in the Grid environment can be: application crash, Grid node crash, network failures, and Grid system component failures. These types of failures can affect the execution of parallel/distributed application in the Grid environment and so, protections against these faults are crucial. Therefore, it is essential to develop efficient fault tolerant mechanisms to allow users to successfully execute Grid applications. One of the research challenges in Grid computing is to be able to develop a fault tolerant solution that will ensure Grid applications are executed reliably with minimum overhead incurred. While checkpointing is the most common method to achieve fault tolerance, there is still a lot of work to be done to improve the efficiency of the mechanism. This thesis provides an in-depth description of a novel solution for checkpointing parallel applications executed on a Grid. The checkpointing mechanism implemented allows to checkpoint an application at regions where there is no interprocess communication involved and therefore reducing the checkpointing overhead and checkpoint size

WestminsterResearch

Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems

Author: Wang Yi-Min
Publication venue
Publication date
Field of study

Checkpointing and rollback recovery are techniques that can provide efficient recovery from transient process failures. In a message-passing system, the rollback of a message sender may cause the rollback of the corresponding receiver, and the system needs to roll back to a consistent set of checkpoints called recovery line. If the processes are allowed to take uncoordinated checkpoints, the above rollback propagation may result in the domino effect which prevents recovery line progression. Traditionally, only obsolete checkpoints before the global recovery line can be discarded, and the necessary and sufficient condition for identifying all garbage checkpoints has remained an open problem. A necessary and sufficient condition for achieving optimal garbage collection is derived and it is proved that the number of useful checkpoints is bounded by N(N+1)/2, where N is the number of processes. The approach is based on the maximum-sized antichain model of consistent global checkpoints and the technique of recovery line transformation and decomposition. It is also shown that, for systems requiring message logging to record in-transit messages, the same approach can be used to achieve optimal message log reclamation. As a final topic, a unifying framework is described by considering checkpoint coordination and exploiting piecewise determinism as mechanisms for bounding rollback propagation, and the applicability of the optimal garbage collection algorithm to domino-free recovery protocols is demonstrated

NASA Technical Reports Server

A method for the recovery of data after a computer system failure: the development of Constant Random Access Memory (CRAM) Recovery System

Author: Brevett Renford Adolphus Benito
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/1992
Field of study

An experimental design study was done to investigate three research questions: (1) Can a software system be developed that will provide recovery from a system failure? (2) What problems exist in achieving a software-only recovery system? (3) What is the degradation in application program performance when utilizing a software recovery system?;A software was developed for recovering data that was in memory before a catastrophic failure. It allows for memory retrieval after unfortunate incidents such as keyboard lock-up, software failures, and power outages. The software, named CRAM, (Constant Random Access Memory) operates by using Undocumented DOS functions, memory management tools, disk management, context switching and the timely backup of the system to the hard disk. The main task of CRAM is to operate in the background of the computer, transferring the computer system\u27s memory to disk at specified intervals of time, with limited interruption of the foreground process;Most of the coding was done in the high-level language C . Some codes were done in assembly language to access low level interface to DOS that were either not available in C or provided better data access speed. The most interesting and challenging part of the project was context switching during restoration of the system\u27s memory. Restoration was accomplished through the ingenious use of the information stored in each program\u27s PSP and data in the DOS swappable Data Area. Saving and restoring of data was accomplished by utilizing DOS hardware and software interrupts and replacing some of these routines with new code that do some operations that are specific to CRAM while also allowing other programs to have access to the original routines. The major interrupts used by CRAM are the keyboard interrupt (9h), the clock interrupt (1Ch), and the DOS idle interrupt (28h);The software was tested and analyzed for conflict by executing nine commercial programs. It was noted that about 75% of the time the system was restored and 33% were full restoration. Another analysis was done for the speed degradation. A degradation of 3.3% for the sieve numerical calculation, 1.3% for the random number generation, 5.2% for the disk I/O write operation, and 10.6% for the video display operation was observed. Less than 1% change was noted for most of the other operation except for times when CRAM\u27s presence may delay the clock interrupt by 0.05 seconds

Digital Repository @ Iowa State University (ISU)

Transparent recovery in distributed systems

Author: David F Bacon
Publication venue
Publication date: 01/01/1990
Field of study

We are investigating transparent optimistic solutions to problems in distributed systems such as recovery [6], replication [3], parallelization [2], and concurrent competing alternatives [4]. By a transparent solution to such a problem we mean that a program is transformed automatically, and tha

CiteSeerX

Crossref

Transparent recovery in distributed systems (position paper)

Author: BACON D. F.
BACON D. F.
David F. Bacon
GOLDBERG A. P.
SMITH J. M.
STROM R. E.
YEMINI S. A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref