Checkpointing and message logging are the popular and generalpurpose tools for providing fault tolerance in distributed systems. Diskless checkpointing schemes enable frequent checkpointing without a performance penalty. The present work extends James S Plank"s Diskless checkpointing scheme (N+1 Parity) by introducing 'Timeout' mechanism to checkpoint programs with high locality of reference. This mechanism enables applications with high locality of reference to take checkpoints periodically. The limitation of N+1 Parity scheme is that all the processes freeze their respective computation, while taking synchronous checkpoints. The proposed scheme solves this problem by introducing a new message logging technique namely partial message logging which allows asynchronous checkpointing at both sender and receiver. Correctness of the scheme is established through a set of proofs. This paper includes the performance evaluation of proposed scheme by making use of distributed simulator test-bed. The results indicate that proposed scheme outperforms N+1 Parity Scheme.
INTRODUCTION
Distributed systems are increasingly gaining importance in the present world with the advances in network technology. They provide opportunities for developing high performance parallel and distributed applications. The vast computing potential of these systems is often hampered by their susceptibility to failures. In case of a failure the distributed applications have to be restarted resulting in loss of several hours/ days of computation. Therefore, providing fault tolerance is an important issue in distributed computing. One effective way to recover distributed system failures is to use checkpointing and rollback recovery [5, 6] . Checkpointing refers to saving the address space and state of processes periodically to stable storage. On detection of failures, each process rolls back to its latest checkpoint and resumes the execution from that point [1] . Checkpointing schemes are classified into two categories based on the "storage medium" used for storing the checkpoints. They are i) Disk-based checkpointing ii) Diskless checkpointing.
In diskless checkpointing scheme, each checkpoint is saved to stable storage that is implemented on disk. Though the disk-based checkpointing is most widely used approach, the applications that need most frequent checkpointing lead to several disk accesses which results a performance bottleneck. In diskless checkpointing the stable storage is replaced with main memory and processor redundancy i.e. the checkpoints are taken in the main memory. By eliminating stable storage, diskless checkpointing removes the main source of overhead in checkpointing [3, 8, 10] . The failure coverage of diskless checkpointing is less than disk-based checkpointing. Moreover, the diskless checkpointing includes the memory, processor and network overheads that are absent in diskbased schemes.
DESIGN OF PROPOSED SCHEME
In this paper, we present Efficient Diskless Checkpointing and Log based Recovery Scheme and demonstrate that it can perform better than the N+1 Parity scheme proposed by James S Plank [4] . In Diskless checkpointing scheme there exist a collection of processors with disjoint memories that coordinates to take a checkpoint of the global system state. A consistent global state consists of checkpoints of each processor in the system plus a log of messages in transit at the time of checkpointing. In the proposed scheme the messages are logged using partial massage logging scheme (discussed in detail in Section 3). The message log is part of the checkpoint information of individual processors. The checkpointing and rollback recovery procedures in case of diskless checkpointing are as described:
Let"s consider a multicomputer/ distributed system which consists of n+2 processors/ nodes: P 1 , P 2 , ……..P n , P c and P b . Each node contains its own physical memory and communication devices. Processors P i , 1 ≤ i ≤ n are called application processors and the processors P c and P b are called the "checkpoint processor" and "backup processor" respectively. A computing task is partitioned into "n" subtasks, such that each subtask is executed on a distinct application processor P i in an asynchronous manner. These subtasks communicate with each other by passing messages via the underlying interconnection network. The consistent global state is maintained cooperatively by the application processors P 1 , P 2 , ……..P n , checkpoint processor P c , and backup processor P b using N+1 parity [4, 7] . Specifically, each application processor will have a copy of its own local checkpoint in physical memory. The checkpoint processor will have a copy of the "parity checkpoint", which is constructed as follows: Let S i denote the size of main memory containing local checkpoint at processor P i , 1 ≤ i ≤ n . The checkpoint processor P c reserves a bank of memory M c of size S c which is equal to the maximum of S i "s, i.e. S c = max {S 1 , S 2 ,……., S n }. Let b i,j be the j th byte of P i "s checkpoint if j ≤ S i , and 0 otherwise. The j th byte b c,j of checkpoint processor holds the parity as a result of exclusiveoring all of the j th bytes of the n application processors, that is, b c,j
One copy of the above parity checkpoint which is the exclusive-or of checkpoints of all application processors is stored on backup processor and this backup is useful when the checkpoint processor wants to update its contents. When any application processor, say P f fails, each non-failed processor restores its state to its local checkpoint. The failed processor"s checkpoint can be obtained by exclusive-oring the checkpoints of non-failed application processors with parity checkpoint present in the checkpoint processor as shown:
If the checkpoint processor fails, then it restores its state from the backup processor, or by recalculating the parity checkpoint from scratch. The backup processor may be restored similarly. The above scheme tolerates a single a failure with two additional processors (i.e. P c and P b ). This idea can be extended to simultaneous failure of "n" nodes, where the system needs "2*n" additional/ replacement processors.
Initially, each application processor P i , 1 ≤ i ≤ n, takes a checkpoint 0 by simply setting the access modes of all of the memory pages to read-only state. The content of the memory is then sent to checkpoint processor P c . P c calculates parities of the pages byte-by-byte, and stores the parity data in the corresponding location of M c . In addition, each application processor clears its extra memory. This space is split in half, and each half is used as a checkpointing buffer. We will call them the primary and secondary checkpointing buffers as shown in Figure 1 . Application processors start executing their programs after their initialization phase is completed. Read operations proceed as usual. However, a page fault is generated when a processor attempts to write to a read-only page. Once a page fault occurs, the content of the accessed page is copied to its primary checkpointing buffer, and the page"s access mode is set to readwrite so that it can be written. If any application processor fails during this time, the system can be restored to the most recent checkpoint by copying the pages back from the buffer, reprotecting them as read only, and then restarting. Obviously, if the checkpoint processor fails during this time, it can be restored from the backup processor, and vice-versa.
Timeout Mechanism
Whenever, the space of a processor"s primary check pointing buffer is used up or when it is time to take a new checkpoint, which we simply call Timeout, then it must start a new checkpoint. In other words, if the last completed checkpoint was checkpoint number c, then it starts checkpoint c+1. The processor requests the checkpoint processor for authorization to take the checkpoint. The checkpoint processor in turn checks the global checkpoint number; if the difference is not greater than one, it will grant permission; otherwise it holds the request until the global checkpoint number is incremented.
The significance of Timeout is that, if a user is executing a program with high locality of reference, the chance of primary buffer getting filled is low or the buffer gets filled after consuming significant amount of computational time. If the processor fails during the execution of the programs with high locality of reference, the lost work will be more as the checkpoint is taken after a longer interval of time and sometimes the application even cannot take checkpoint. Because of Timeout mechanism that is added to N+1 parity, the application can take checkpoints periodically, thus reducing maximum work lost is equal to the checkpoint interval.
PATIAL MESSAGE LOGGING SCHEME
Message handling got a significant role in checkpoint and recovery of the distributed system [9] . We can"t assume consistent state of the system without proper message handling. The major concern is for the messages that are in transit at the time of checkpointing.
The N+1 parity scheme (i.e. diskless checkpointing scheme) proposed by J. S. Plank is based on the assumption of zero message state, where all processors halt/ freeze their computation until the in transit messages are delivered to the destination process. Our scheme proposes partial message logging technique which allows asynchronous checkpointing at both sender and receiver, which won"t results into any blocking/ halting overhead. Since, the in-transit message data is saved in the in-memory checkpoint, the proposed message logging technique does not result into any additional overhead. Further, the proposed scheme partially logs the messages by always ensuring the difference among the checkpoint states of individual application processors should never be greater than 1. Therefore, at any time, the message log of each process contains messages that are sent/ received during present and previous checkpoint intervals.
The following example in Figure 2 will show the occurrence of orphan and lost messages when message logging is not done. A system is strongly consistent if and only if there are no orphan and lost messages [15] .
Fig. 2. Lost and Orphan messages
Process P 1 and process P 2 both take in-memory checkpoint "c" asynchronously. Now when the failure occurs they will rollback to checkpoint c. Process P 1 resends message m 1 to process P 2 which is an orphan message. Process P 2 won"t replay the message m 2 resulting in a lost message. Logging of messages introduces memory and computational overhead. Therefore, the proposed scheme partially logs the messages by always ensuring the difference among the checkpoint states of individual application processors should never be greater than 1. At any time, the message log of each process contains messages that are sent/ received during present and previous checkpoint intervals. Each application processor possesses one or more process/es running in its address space [17] .
Log Structure
Every process maintains two message log tables that are sent table and receipt table in the address space of an application processor in which it is present.
Sent Table: The sent table looks The message logging mechanism will modify all the message slots with ack status 1 to 2, at the time of taking in-memory checkpoint. When, the application processor rolls back to the previous checkpoint at the time of failure recovery, all processes will resend the messages in the sent table with ack.2, thus eliminating the lost-messages. Table 2 where Seq. number is the sequence number contained by the received message. Source is the process id from where the message is received.
The messages can have two Ack States: 1 -Message is received and acknowledgement is sent 2 -The messages received during previous checkpoint interval.
The message logging mechanism will make entry for all received messages and mark them with ack. This ensures that at any point of time the message log will have information about messages of present and previous checkpoint interval thus reducing the memory and lookup overhead. As the processes maintain the message log information in the address space of the application processes this partial message logging scheme will work for both for the diskless and disk-based checkpointing schemes. Fig. 3 . Message exchanges between processes P 1 and P 2
An Example of Partial Messaging Logging
The Figure 3 shows the message exchange between two processes. For ease of understanding we consider that the two processes P 1 and P 2 are running on separate application processors. First, both the application processors take in-memory checkpoint almost concurrently at time t 1 and the sent table of the process P 2 is as given in Table 3 . At time t 2 process P 1 had taken in-memory checkpoint c+1. At this moment the status of sent and receipt tables of the process P 1 are as shown in Tables 4 and 5 respectively. After taking the checkpoint c+1 the process P 1 will send message m 3 to process P 2 and receive message m 4 from process P 2 . At time t 3 the process P 2 takes the checkpoint. After taking the checkpoint c+1 the status of sent and receipt tables of process P 2 are given in Tables 6 and 7 respectively. At time t 4 the sent-table and receipttable of process P 1 is as given in Table 8 and Table 9 . After time t 4 a fault occurs in the application processor containing process P 1 . Now process P 2 rolls back to its local checkpoint c+1. Process P 1 is recovered by first-level recovery scheme. After rollback the status of sent-table and receipt-table of process P 2 will be same as that of Table 6 and Table 7 respectively. After rollback the sent-table and receipt-table of process P 1 will be equal to that of Table 4 and Table 5 respectively. Now when the process P 1 sends message m 3 to process P 2 , it makes a lookup into Table 7 and it founds a match. Therefore process P 2 discards the message as a redundant message, thereby eliminating the orphan message problem. Process P 2 replays the message m 4 , as there is a mismatch of ack. status bits of sent -table 6 and receipt-table 9 with respect to message m 4 . Thus the lost message problem is also eliminated. Hence proposed partial message logging scheme handles the messages effectively and with utmost reliability.
PROOF OF CORRECTNESS
This section presents eight theorems to establish the correctness of the proposed scheme Theorem 1: The messages logged can not be of more than two checkpoints old.
Proof : This theorem can be proved by following Proof by contradiction. Assume that the messages logged can be of more than two checkpoints old. It means that any processor p which is residing in the interval c+i is holding a message which was logged during the interval c, where i >1. Now suppose that the processor p is in interval c and has logged the message m. At the time of taking checkpoint c, the message is marked as old, by incrementing its status to 2. At the time of taking the checkpoint c+1, all the messages marked as old (status 2) are deleted, and the messages that are transmitted during that interval are marked as old. Hence, the message m is deleted before the processor enters in to interval c+2. This is contradiction to the assumption made and therefore, the messages logged can not be older by, more than two checkpoint intervals.
Theorem 2: No message can cross more than one LoC (Line of Consistency)
Proof : Assume that processor p sends the message m to p', and m crosses more than one LoC. This means that, the processor p sends a message m in checkpoint interval c, and processor p' receives the message m at checkpoint interval c+i, where i>1. This is contradiction to the partial message logging model i.e. the difference in checkpoint intervals between any two processors can not be more than one. Therefore, the assumption is wrong. Hence, the theorem is proved.
Theorem 3:
Messages can cross LoC from either side.
Proof : This theorem can be proved by following Proof by Cases.
Case 1: In this case, we have to show that the message can cross LoC from left to right. This means processor p which is in interval c can send a message m to processor p' in interval c+1. The description of the checkpointing system states that, the application processors in the system can be at different checkpoint intervals, provided the difference should not exceed one. Suppose both the processors p and p' are in interval c, and if p' primary buffer is exhausted, then p' takes the checkpoint c and enters into the interval c+1. The processor p can now normally send the messages to p' as before. Hence the message can cross the LoC from left to right.
Case 2: In this case, we have to show that the message can cross LoC from right to left. This means processor p which is in interval c can send a message m to processor p' in interval c-1. As per the description of the checkpointing system, processors can be at different checkpoint intervals, provided the difference should not exceed 1. Supposing both processors p and p' are in interval c-1 and if primary buffer of p is exhausted, then p takes the checkpoint i.e. c-1 and enters the interval c and sends message m to processor p'. Hence, the message can cross LoC from right to left.
Theorem 4:
The difference in local checkpoints among any two processors is not more than one.
Proof : Let us assume that, there is a processor p in interval c+2, while other processor p' is in interval c. Now, let us consider one of the previous states s where both p and p' processors are in checkpoint interval c and the checkpoint processor (the global consistent checkpoint) is at interval c-1.
The processor p wants to take checkpoint c, when the primary buffer of the processor p reaches its threshold, and it sends a request message to checkpoint processor for authorization.
The checkpoint processor, upon receiving the request, compares the checkpoint number (c) to the global checkpoint number (c-1), as it allows one interval difference, it sends authorization to p.
The processor p is really on grand prix track and wants to take another checkpoint, and it again sends a request message to checkpoint processor for authorization to take checkpoint c+1.
Upon, receiving the request message from the processor, the checkpoint processor sends a message m c to rest of the application processors to take checkpoint c. Once rest of the processors takes the checkpoint c, it sends authorization signal to the processor p to take checkpoint c+1. Now the processor p is in interval c+2 and the processor p' is in interval c+1. It is contradicting to the assumption made. Hence, the theorem is proved.
Theorem 5: Application Processor P i can recover to a consistent state by rolling back to the previous checkpoint.
Proof :
The theorem can be proved by considering the following two cases. (Proof by Cases) and the proof follows from the definition of the system. Case 1: When fault had occurred in some other part of the system and processor P i has to rollback to its previous checkpoint. The read-only pages of P i are kept as it is. The read-write pages are replaced with their corresponding pages of the primary buffer. Now all these pages reflect the previous consistent checkpoint state of P i in accordance with the definition of recovery mechanism of the system. Hence, P i rolls back to previous consistent checkpoint.
Case 2: When there is a fault with the processor P i itself.
The page k of processor P i is computed by ex-oring the corresponding pages, that is page k j where 1 < j < i-1 and i+1 < j < n. ( i.e., from all other application processors ) and page k of CP as shown. From the definition/system model, we know that, the pages of CP are obtained by ex-oring the corresponding pages of the consistent local checkpoints of all application processors. Hence, the computed page is nothing but the previous consistent checkpointed page of the processor P i . Hence, it is proved that P i rolls back to the previous consistent checkpoint.
Theorem 6:
The Checkpoint Processor P c can recover to the previous global consistent checkpoint state after the failure of the system.
Proof :
The theorem can be proved directly from the definition of the system. The theorem can be proved by considering the following three cases (i.e. Proof by Cases).
Case 1: The fault had occurred in some Application Processor.
In this case, the previous consistent global checkpoint which is stored/ available in the backup processor is copied to the checkpoint processor as it is.
Case 2: The fault had occurred in the backup processor.
In this case, the Checkpoint Processor would have made some modifications in the checkpoint pages from previous global consistent checkpoint. First, all the Application Processors will rollback to their local checkpoint number say C which is equal to the global checkpoint number C of Checkpoint Processor. Then the checkpoint processor will recalculate the parity checkpoint by getting all the local checkpoint pages from all the Application Processors.
Case 3: The fault had occurred in the Checkpoint processor itself.
The Checkpoint Processor is recovered from the fault by restoring its state from the Backup processor.
Theorem 7:
The proposed algorithm constructs a consistent global checkpoint.
Proof :
The global checkpoint is said to be consistent if it represents the checkpoint status of every individual Application Processor at any point of time. The theorem can be proved by considering the following two cases. (i.e. Proof by Cases)
Case 1: At any point of time there is a static global checkpoint.
At the start of execution of the processing, all the Application processors will send a copy of their address space to the checkpoint processor, which will calculate the parity global checkpoint. This global checkpoint is then sent to the backup processor. This consistent global checkpoint available in backup processor is not tampered until another consistent global checkpoint is calculated in the checkpoint processor.
Case 2: The global checkpoint represents all the local checkpoint updates.
Consider the state of system just after completion of global checkpoint c. Let us consider all application processor are also in their local checkpoints c. The application processor i had consumed its primary buffer and ready to take checkpoint c+1. After getting the authorization it sends the difference of all the modified pages to the checkpoint processor.
The checkpoint processor after receiving the diff k , will exclusiveor the same with the corresponding page k , thus reflecting the local checkpoint of the application processor i. Likewise the checkpoint processor will reflect the local checkpoint c+1 of all the application processors.
Meanwhile if any application processor requests for taking local checkpoint c+2, it"s request is kept pending until the consistent global checkpoint state c+1 is reached and copied to backup processor.
Thus at any point of time, there will be a consistent global checkpoint state.
PERFORMANCE EVOLUTION
In order to assess the performance of proposed scheme [17] , we have used distributed simulator testbed.
Distributed Simulator Testbed
The Distributed Simulator testbed facilitates to demonstrate how system reliability can be enhanced as a result of using a particular checkpointing and recovery scheme. The simulator functions by taking a system specification, a task set to be run, and injecting faults to see the performance of the system in the presence of faults. The two important components of Distributed Simulator testbed are
Checkpoint simulator 2. Distributed Virtual Machine
The functionality of Checkpoint Simulator is similar to that of Rapids Simulator [14] .
Distributed Virtual Machine
DVM (Distributed Virtual Machine) is an integrated set of software tools and libraries that emulates a general-purpose, flexible, computing framework on interconnected computers. The overall objective of the DVM system is to enable such a collection of computers to be used cooperatively for concurrent or parallel computation.
The DVM system is composed of two parts. The first part is a daemon that resides on all the computers making up the virtual machine. The second part of the system is a library of DVM interface routines. It contains a functionally complete repertoire of primitives that are needed for cooperation between tasks of an application. This library contains user-callable routines for message passing, spawning processes, coordinating tasks. The DVM computing model is based on the notion that an application consists of several tasks. Each task is responsible for a part of the application's computational workload. Distributed Virtual Machine will take care of fault detection, installation and management, reconfiguration mechanism when a new node is added or when already existing node is removed. Distributed Virtual Machine automatically takes decision regarding the recovery using the information of control file and available backup processors.
Applications considered for Evaluation
In order to assess the performance of proposed scheme and other checkpointing and recovery schemes, the following applications [15] are run on the distributed simulator test-bed.
Merge Sort Matrix Multiplication All Pair Shortest Path Problem
Merge Sort:
The given data/ elements to be sorted are distributed to all the application processors in the system by the coordinator using divide-and-conquer approach. The application processors sort the elements and return the result to the coordinator. If 10,00,000 records need to be sorted, the coordinator which acts as dispatcher distribute these records to application processor in chunks of 1,00,000 records. Each application processor after completion of sorting sends back the data to coordinator which will merge the incoming data in ascending order of their values. This is a computational intensive problem.
Matrix Multiplication:
The matrix multiplication of two square matrices is carried out using cannon"s algorithm where elements are floating point numbers. Matrix size of the order 2629 x 2629 is considered.
All Pairs Shortest Path Problem:
The all pairs shortest path problem computes the shortest distance from each vertex to all other vertices. Here we have considered 15 x 15 connected graph. Each application processor is assumed to be a vertex, and shortest path from it to all other application processors is calculated. This problem is communication intensive.
Performance Metrics
Apart from the correctness, the performance is the most important aspect of a checkpointer. N.H. Vaidya derived equations 1 to 3 for assessing the performance of an application in the presence of checkpointing and recovery. These equations will take the checkpoint overhead, checkpoint latency, and recovery time as input and are as described below: From (1), it can be seen that T opt decreases when overhead O decreases.
Performance Measurements
Before running different applications on distributed simulator testbed, initially consider the basic definitions of various overheads.
Checkpoint overhead: Checkpoint overhead is the time added to the running time of the target program as a result of checkpointing.
Checkpoint latency: Checkpoint latency is the time that it takes for the checkpointer to complete a checkpoint, from start to finish. Checkpoint latency is the duration of time required to save the checkpoint. In many implementations, checkpoint latency is larger than the checkpoint overhead.
Recovery overhead/time: This is the time that it takes the system to restore a checkpointed state following the detection of a failure.
To calculate the above overheads we have considered 4 checkpoint processors and 4 backup processors both incase of N+1 parity and proposed scheme. In this scenario, N+1 Parity supports simultaneous failure of at most four application processors. In order to evaluate the performance of the proposed protocol, the overhead ratio of different checkpointing schemes and proposed scheme is calculated by using the equations 1, 2 and 3 and the results are summarized in Tables 10, 11 and 12 for merge sort, matrix multiplication and all pairs shortest path problem respectively. λ value is assumed as 6.301* 10 -6 . The above results indicate that proposed scheme results into minimal overhead ratio 'r'. Therefore, we can conclude that the proposed scheme outperforms the N+1 Parity checkpointing scheme.
CUNCLUSSIONS
The N+1 Parity Scheme proposed by James S Plank fails to checkpoint applications with high locality of reference. Our proposed scheme introduced the Timeout mechanism to overcome this problem. The proposed message logging technique i.e. partial message logging allows asynchronous checkpointing at both sender and receiver and does not freeze/ halt the state of process. Partial message logging technique provides a better performance even if the communication channels are non-reliable. The garbage collection is made simple and reliable, as deletion of old messages are handled implicitly at the time of checkpointing itself. Correctness of the scheme is established through a set of proofs.
Using distributed simulator testbed, we have evaluated and compared the performance of the proposed scheme with N+1 Parity Scheme. Few application programs have been developed and executed in this virtual environment. The overhead ratio is computed for both of these schemes and it is found to be small in case of the proposed scheme. Hence it has been concluded that the proposed scheme outperforms N+1 Parity scheme.
