4,440 research outputs found

    Distributed Computing Systems and Checkpointing

    Get PDF
    This paper examines the performance of synchronous checkpointing in a distributed computing environment with and without load redistribution. Performance models are developed, and optimum checkpoint intervals are determined. The analysis extends earlier work by allowing for multiple nodes, state dependent checkpoint intervals, and a performance metric which is coupled with failure-free performance and the speedup functions associated with implementation of parallel algorithms. Expressions for the optimum checkpoint intervals for synchronous checkpointing with and without load redistribution are derived and the results are then used to determine when load redistribution is advantageous

    THE PERFORMANCE OF SOFT CHEKPOINTING APPROACH IN MOBILE COMPUTING SYSTEMS

    Get PDF
    Mobile computing raises many new issues such as lack of stable storage, low bandwidth of wireless channel, high mobility, and limited battery life. These new issues make traditional checkpointing algorithms unsuitable. Coordinated checkpointing is an attractive approach for transparently adding fault tolerance to distributed applications since it avoids domino effects and minimizes the stable storage requirement. However, it suffers from high overhead associated with the checkpointing process in mobile computing systems. In literature mostly, two approaches have been used to reduce the overhead: First is to minimize the number of synchronization messages and the number of checkpoints; the other is to make the checkpointing process nonblocking. Since MHs are prone to failure, so they have to transfer a large amount of checkpoint data and control information to its local MSS which increases bandwidth overhead. In this paper, we introduce the concept of 201C;Soft checkpoint201D; which is neither a tentative checkpoint nor a permanent checkpoint, to design efficient checkpointing algorithms for mobile computing systems. Soft checkpoints can be saved anywhere, e.g., the main memory or local disk of MHs. Before disconnecting from the MSS, these soft checkpoints are converted to hard checkpoints and are sent to MSSs stable storage. In this way, taking a soft checkpoint avoids the overhead of transferring large amounts of data to the stable storage at MSSs over the wireless network. We have also shown that our soft checkpointing scheme also adapts its behaviour to the characteristics of network

    Efficient checkpointing over local area networks

    Get PDF
    Parallel and distributed computing on clusters of workstations is becoming very popular as it provides a cost effective way for high performance computing. In these systems, the bandwidth of the communication subsystem (Using Ethernet technology) is about an order of magnitude smaller compared to the bandwidth of the storage subsystem. Hence, storing a state in a checkpoint is much more efficient than comparing states over the network. In this paper we present a novel checkpointing approach that enables efficient performance over local area networks. The main idea is that we use two types of checkpoints: compare-checkpoints (comparing the states of the redundant processes to detect faults) and store-checkpoints (where the state is only stored). The store-checkpoints reduce the rollback needed after a fault is detected, without performing many unnecessary comparisons. As a particular example of this approach we analyzed the DMR checkpointing scheme with store-checkpoints. Our main result is that the overhead of the execution time can be significantly reduced when store-checkpoints are introduced. We have implemented a prototype of the new DMR scheme and run it on workstations connected by a LAN. The experimental results we obtained match the analytical results and show that in some cases the overhead of the DMR checkpointing schemes over LAN's can be improved by as much as 20%

    FADI: a fault-tolerant environment for open distributed computing

    Get PDF
    FADI is a complete programming environment that serves the reliable execution of distributed application programs. FADI encompasses all aspects of modern fault-tolerant distributed computing. The built-in user-transparent error detection mechanism covers processor node crashes and hardware transient failures. The mechanism also integrates user-assisted error checks into the system failure model. The nucleus non-blocking checkpointing mechanism combined with a novel selective message logging technique delivers an efficient, low-overhead backup and recovery mechanism for distributed processes. FADI also provides means for remote automatic process allocation on the distributed system nodes

    A Study of Checkpointing in Large Scale Training of Deep Neural Networks

    Full text link
    Deep learning (DL) applications are increasingly being deployed on HPC systems, to leverage the massive parallelism and computing power of those systems for DL model training. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. In this work, we evaluate checkpoint-restart, a common fault tolerance technique in HPC workloads. We perform experiments with three state-of-the-art DL frameworks common in HPC Chainer, PyTorch, and TensorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide takeaway points that framework developers can use to facilitate better checkpointing of DL workloads in HPC

    Improving Performance of Iterative Methods by Lossy Checkponting

    Get PDF
    Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they have to checkpoint the dynamic variables periodically in case of unavoidable fail-stop errors, requiring fast I/O systems and large storage space. To this end, significantly reducing the checkpointing overhead is critical to improving the overall performance of iterative methods. Our contribution is fourfold. (1) We propose a novel lossy checkpointing scheme that can significantly improve the checkpointing performance of iterative methods by leveraging lossy compressors. (2) We formulate a lossy checkpointing performance model and derive theoretically an upper bound for the extra number of iterations caused by the distortion of data in lossy checkpoints, in order to guarantee the performance improvement under the lossy checkpointing scheme. (3) We analyze the impact of lossy checkpointing (i.e., extra number of iterations caused by lossy checkpointing files) for multiple types of iterative methods. (4)We evaluate the lossy checkpointing scheme with optimal checkpointing intervals on a high-performance computing environment with 2,048 cores, using a well-known scientific computation package PETSc and a state-of-the-art checkpoint/restart toolkit. Experiments show that our optimized lossy checkpointing scheme can significantly reduce the fault tolerance overhead for iterative methods by 23%~70% compared with traditional checkpointing and 20%~58% compared with lossless-compressed checkpointing, in the presence of system failures.Comment: 14 pages, 10 figures, HPDC'1
    • …
    corecore