1 research outputs found
A Utilization Model for Optimization of Checkpoint Intervals in Distributed Stream Processing Systems
State-of-the-art distributed stream processing systems such as Apache Flink
and Storm have recently included checkpointing to provide fault-tolerance for
stateful applications. This is a necessary eventuality as these systems head
into the Exascale regime, and is evidently more efficient than replication as
state size grows. However current systems use a nominal value for the
checkpoint interval, indicative of assuming roughly 1 failure every 19 days,
that does not take into account the salient aspects of the checkpoint process,
nor the system scale, which can readily lead to inefficient system operation.
To address this shortcoming, we provide a rigorous derivation of utilization --
the fraction of total time available for the system to do useful work -- that
incorporates checkpoint interval, failure rate, checkpoint cost, failure
detection and restart cost, depth of the system topology and message delay. Our
model yields an elegant expression for utilization and provides an optimal
checkpoint interval given these parameters, interestingly showing it to be
dependent only on checkpoint cost and failure rate. We confirm the accuracy and
efficacy of our model through experiments with Apache Flink, where we obtain
improvements in system utilization for every case, especially as the system
size increases. Our model provides a solid theoretical basis for the analysis
and optimization of more elaborate checkpointing approaches