Checkpoint-based forward recovery using lookahead execution and rollback validation in parallel and distributed systems

Long, Junsheng

Checkpoint-based forward recovery using lookahead execution and rollback validation in parallel and distributed systems

Authors: Junsheng Long
Publication date
Publisher

Abstract

This thesis studies a forward recovery strategy using checkpointing and optimistic execution in parallel and distributed systems. The approach uses replicated tasks executing on different processors for forwared recovery and checkpoint comparison for error detection. To reduce overall redundancy, this approach employs a lower static redundancy in the common error-free situation to detect error than the standard N Module Redundancy scheme (NMR) does to mask off errors. For the rare occurrence of an error, this approach uses some extra redundancy for recovery. To reduce the run-time recovery overhead, look-ahead processes are used to advance computation speculatively and a rollback process is used to produce a diagnosis for correct look-ahead processes without rollback of the whole system. Both analytical and experimental evaluation have shown that this strategy can provide a nearly error-free execution time even under faults with a lower average redundancy than NMR

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

NASA Technical Reports Server

oai:casi.ntrs.nasa.gov:1994002...

Last time updated on 03/08/2016