16,277 research outputs found
An on-line algorithm for checkpoint placement
Checkpointing is a common technique for reducing the time to recover from faults in computer systems. By saving intermediate states of programs in a reliable storage, checkpointing enables to reduce the lost processing time caused by faults. The length of the intervals between checkpoints affects the execution time of programs. Long intervals lead to long re-processing time, while too frequent checkpointing leads to high checkpointing overhead. In this paper we present an on-line algorithm for placement of checkpoints. The algorithm uses on-line knowledge of the current cost of a checkpoint when it decides whether or not to place a checkpoint. We show how the execution time of a program using this algorithm can be analyzed. The total overhead of the execution time when the proposed algorithm is used is smaller than the overhead when fixed intervals are used. Although the proposed algorithm uses only on-line knowledge about the cost of checkpointing, its behavior is close to the off-line optimal algorithm that uses a complete knowledge of checkpointing cost
An Online Algorithm for Checkpointing Placement
Checkpointing is a common technique for reducing the
time to recover from faults in computer systems. By saving
intermediate states of programs in a reliable storage,
check pointing enables to reduce the lost processing time caused
by faults. The length of the intervals between checkpoints
affects the execution time of programs. Long intervals lead
to long re-processing time, while too frequent checkpoint-
iizg leads to high checkpointing overhead. In this paper we
present an on-line algorithm for placement of checkpoints.
The algorithm uses on-line knowledge of the current cost
of a checkpoint when it decides whether or not to place a
checkpoint. We show how the execution time of a program
using this algorithm can be analyzed. The total overhead of
the execution time when the proposed algorithm is used is
smaller than the overhead when fixed intervals are used.
Although the proposed algorithm uses only on-line knowledge
about the cost of checkpointing, its behavior is close to the off-line optimal algorithm that uses a complete knowledge
of checkpointing cost
Optimal Checkpointing for Secure Intermittently-Powered IoT Devices
Energy harvesting is a promising solution to power Internet of Things (IoT)
devices. Due to the intermittent nature of these energy sources, one cannot
guarantee forward progress of program execution. Prior work has advocated for
checkpointing the intermediate state to off-chip non-volatile memory (NVM).
Encrypting checkpoints addresses the security concern, but significantly
increases the checkpointing overheads. In this paper, we propose a new online
checkpointing policy that judiciously determines when to checkpoint so as to
minimize application time to completion while guaranteeing security. Compared
to state-of-the-art checkpointing schemes that do not account for the overheads
of encrypted checkpoints we improve execution time up to 1.4x.Comment: ICCAD 201
Tolerating Correlated Failures in Massively Parallel Stream Processing Engines
Fault-tolerance techniques for stream processing engines can be categorized
into passive and active approaches. A typical passive approach periodically
checkpoints a processing task's runtime states and can recover a failed task by
restoring its runtime state using its latest checkpoint. On the other hand, an
active approach usually employs backup nodes to run replicated tasks. Upon
failure, the active replica can take over the processing of the failed task
with minimal latency. However, both approaches have their own inadequacies in
Massively Parallel Stream Processing Engines (MPSPE). The passive approach
incurs a long recovery latency especially when a number of correlated nodes
fail simultaneously, while the active approach requires extra replication
resources. In this paper, we propose a new fault-tolerance framework, which is
Passive and Partially Active (PPA). In a PPA scheme, the passive approach is
applied to all tasks while only a selected set of tasks will be actively
replicated. The number of actively replicated tasks depends on the available
resources. If tasks without active replicas fail, tentative outputs will be
generated before the completion of the recovery process. We also propose
effective and efficient algorithms to optimize a partially active replication
plan to maximize the quality of tentative outputs. We implemented PPA on top of
Storm, an open-source MPSPE and conducted extensive experiments using both real
and synthetic datasets to verify the effectiveness of our approach
Checkpointing algorithms and fault prediction
This paper deals with the impact of fault prediction techniques on
checkpointing strategies. We extend the classical first-order analysis of Young
and Daly in the presence of a fault prediction system, characterized by its
recall and its precision. In this framework, we provide an optimal algorithm to
decide when to take predictions into account, and we derive the optimal value
of the checkpointing period. These results allow to analytically assess the key
parameters that impact the performance of fault predictors at very large scale.Comment: Supported in part by ANR Rescue. Published in Journal of Parallel and
Distributed Computing. arXiv admin note: text overlap with arXiv:1207.693
- …