3 research outputs found

    An Optimizing Java Translation Framework for Automated Checkpointing and Strong Mobility

    Get PDF
    Long-running programs, e.g., in high-performance computing, need to write periodic checkpoints of their execution state to disk to allow them to recover from node failure. Manually adding checkpointing code to an application, however, is very tedious. The mechanisms needed for writing the execution state of a program to disk and restoring it are similar to those needed for migrating a running thread or a mobile object. We have extended a source-to-source translation scheme that allows the migration of mobile Java objects with running threads to make it more general and allow it to be used for automated checkpointing. Our translation scheme allows serializable threads to be written to disk or migrated with a mobile agent to a remote machine. The translator generates code that maintains a serializable run-time stack for each thread as a Java data structure. While this results in significant run-time overhead, it allows the checkpointing code to be generated automatically. We improved the locking mechanism that is needed to protect the run-time stack as well as the translation scheme. Our experimental results demonstrate an speedup of the generated code over the original translator and show that the approach is feasible in practice

    Optimizing Checkpoint Restart with Data Deduplication

    No full text
    The increasing scale, such as the size and complexity, of computer systems brings more frequent occurrences of hardware or software faults; thus fault-tolerant techniques become an essential component in high-performance computing systems. In order to achieve the goal of tolerating runtime faults, checkpoint restart is a typical and widely used method. However, the exploding sizes of checkpoint files that need to be saved to external storage pose a major scalability challenge, necessitating the design of efficient approaches to reducing the amount of checkpointing data. In this paper, we first motivate the need of redundancy elimination with a detailed analysis of checkpoint data from real scenarios. Based on the analysis, we apply inline data deduplication to achieve the objective of reducing checkpoint size. We use DMTCP, an open-source checkpoint restart package, to validate our method. Our experiment shows that, by using our method, single-computer programs can reduce the size of checkpoint file by 20% and distributed programs can reduce the size of checkpoint file by 47%
    corecore