Compiler Supported Interval Optimisation for Communication Induced Checkpointing

Abstract

There exist mainly three different approaches of checkpoint-based recovery mechanisms for distributed systems: coordinated checkpointing, uncoordinated checkpointing and communication induced checkpointing. It can be shown that communication induced checkpointing theoretically has the least minimum overhead, but also that the effective overhead depends on the communication behaviour and the resulting forced checkpoints. If the placement of checkpoints and the communication pattern is disadvantageous, the overhead can get arbitrary large due to a high number of forced checkpoints. We introduce a compiler supported approach to avoid unfavourable combinations of communication behaviour and local checkpoint placement. We analyse the application statically and prepare the placement of voluntary checkpoints. These placement decisions are reviewed during runtime. With this approach we optimise the effective checkpoint-intevals of voluntary and forced checkpoints and thus reduce the overhead of communication induced checkpointing

    Similar works

    Full text

    thumbnail-image

    Available Versions