Causal distributed breakpoints

Abstract

The authors define a causal distributed breakpoint, which is initiated by a sequential breakpoint in one process of a distributed computation and restores each process in the computation to its earliest state that reflects all events that happened before the breakpoint. An algorithm for finding the causal distributed breakpoint, given a sequential breakpoint in one of the processes, is presented. Approximately consistent checkpoint sets are used for efficiently restoring each process to its state in a causal distributed breakpoint. Causal distributed breakpoints assume deterministic processes that communicate solely by messages. The dependencies that arise from communication between processes are logged. Dependency logging and approximately consistent checkpoint sets are implemented on a network of SUN workstations running the V-System. Overhead on the message-passing primitives varies between 1% and 14% for dependency logging. Execution time overhead for a 200×200 Gaussian elimination is less than 4% and generates a dependency log of 288 kbyte

    Similar works