Search CORE

17 research outputs found

An on-line algorithm for checkpoint placement

Author: Bruck Jehoshua
Ziv Avi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1996
Field of study

Checkpointing is a common technique for reducing the time to recover from faults in computer systems. By saving intermediate states of programs in a reliable storage, checkpointing enables to reduce the lost processing time caused by faults. The length of the intervals between checkpoints affects the execution time of programs. Long intervals lead to long re-processing time, while too frequent checkpointing leads to high checkpointing overhead. In this paper we present an on-line algorithm for placement of checkpoints. The algorithm uses on-line knowledge of the current cost of a checkpoint when it decides whether or not to place a checkpoint. We show how the execution time of a program using this algorithm can be analyzed. The total overhead of the execution time when the proposed algorithm is used is smaller than the overhead when fixed intervals are used. Although the proposed algorithm uses only on-line knowledge about the cost of checkpointing, its behavior is close to the off-line optimal algorithm that uses a complete knowledge of checkpointing cost

Caltech Authors

Tight Bounds on Online Checkpointing Algorithms

Author: Bar-On Achiya
Dinur Itai
Dunkelman Orr
Hod Rani
Keller Nathan
Ronen Eyal
Shamir Adi
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018)
Publication date: 01/01/2018
Field of study

The problem of online checkpointing is a classical problem with numerous applications which had been studied in various forms for almost 50 years. In the simplest version of this problem, a user has to maintain k memorized checkpoints during a long computation, where the only allowed operation is to move one of the checkpoints from its old time to the current time, and his goal is to keep the checkpoints as evenly spread out as possible at all times. At ICALP\u2713 Bringmann et al. studied this problem as a special case of an online/offline optimization problem in which the deviation from uniformity is measured by the natural discrepancy metric of the worst case ratio between real and ideal segment lengths. They showed this discrepancy is smaller than 1.59-o(1) for all k, and smaller than ln4-o(1)~~1.39 for the sparse subset of k\u27s which are powers of 2. In addition, they obtained upper bounds on the achievable discrepancy for some small values of k. In this paper we solve the main problems left open in the ICALP\u2713 paper by proving that ln4 is a tight upper and lower bound on the asymptotic discrepancy for all large k, and by providing tight upper and lower bounds (in the form of provably optimal checkpointing algorithms, some of which are in fact better than those of Bringmann et al.) for all the small values of k <= 10

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

CPPC: a compiler‐assisted tool for portable checkpointing of message‐passing applications

Author: Doallo Ramón
González Patricia
Martín María J.
Rodríguez Gabriel
Touriño Juan
Publication venue: 'Wiley'
Publication date: 01/01/2010
Field of study

This is the peer reviewed version of the following article: Rodríguez, G. , Martín, M. J., González, P. , Touriño, J. and Doallo, R. (2010), CPPC: a compiler‐assisted tool for portable checkpointing of message‐passing applications. Concurrency Computat.: Pract. Exper., 22: 749-766. doi:10.1002/cpe.1541, which has been published in final form at https://doi.org/10.1002/cpe.1541. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions.[Abstract] With the evolution of high‐performance computing toward heterogeneous, massively parallel systems, parallel applications have developed new checkpoint and restart necessities. Whether due to a failure in the execution or to a migration of the application processes to different machines, checkpointing tools must be able to operate in heterogeneous environments. However, some of the data manipulated by a parallel application are not truly portable. Examples of these include opaque state (e.g. data structures for communications support) or diversity of interfaces for a single feature (e.g. communications, I/O). Directly manipulating the underlying ad hoc representations renders checkpointing tools unable to work on different environments. Portable checkpointers usually work around portability issues at the cost of transparency: the user must provide information such as what data need to be stored, where to store them, or where to checkpoint. CPPC (ComPiler for Portable Checkpointing) is a checkpointing tool designed to feature both portability and transparency. It is made up of a library and a compiler. The CPPC library contains routines for variable level checkpointing, using portable code and protocols. The CPPC compiler helps to achieve transparency by relieving the user from time‐consuming tasks, such as data flow and communications analyses and adding instrumentation code. This paper covers both the operation of the CPPC library and its compiler support. Experimental results using benchmarks and large‐scale real applications are included, demonstrating usability, efficiency, and portability.Miniesterio de Educación y Ciencia; TIN2007‐67537‐C03Xunta de Galicia; 2006/

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Coping with silent errors in HPC applications

Author: Aupy Guillaume
Benoit Anne
Fasi Massimiliano
Robert Yves
Sun Hongyang
Uçar Bora
Publication venue: HAL CCSD
Publication date: 01/12/2015
Field of study

This report describes a unified framework for the detection and correction of silent errors,which constitute a major threat for scientific applications at extreme-scale.We first motivate the problem andexplain why checkpointing must be combined with some verification mechanism.Then we introduce a general-purpose technique based upon computational patterns thatperiodically repeat over time. These patterns interleave verifications and checkpoints, and we show how to determine the pattern minimizing expected execution time.Then we move to application-specific techniques and review dynamic programming algorithms for linear chains of tasks, as well as ABFT-oriented algorithms for iterative methods in sparse linear algebra

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors

Author: Benoit Anne
Cavelan Aurélien
Robert Yves
Sun Hongyang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

International audienceIn this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to address both fail-stop and silent errors. The objective is to minimize either makespan or energy consumption. While DVFS is a popular approach for reducing the energy consumption, using lower speeds/voltages can increase the number of errors, thereby complicating the problem. We consider an application workflow whose dependence graph is a chain of tasks, and we study three execution scenarios: (i) a single speed is used during the whole execution; (ii) a second, possibly higher speed is used for any potential re-execution; (iii) different pairs of speeds can be used throughout the execution. For each scenario, we determine the optimal checkpointing and verification locations (and the optimal speeds for the third scenario) to minimize either objective. The different execution scenarios are then assessed and compared through an extensive set of experiments

HAL-ENS-LYON

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

Assessing general-purpose algorithms to cope with fail-stop and silent errors

Author: Benoit Anne
Cavelan Aurélien
Robert Yves
Sun Hongyang
Publication venue: HAL CCSD
Publication date: 01/09/2014
Field of study

In this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption.For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint,hence extending the classical formula by Young and Daly for fail-stop errors only. We further extendthe approach to include intermediate verifications, and to consider a bi-criteria problem involving both time and energy(linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bi-criteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via dynamic voltage and frequency scaling (DVFS).In this latter scenario, we determine the optimal checkpointing and verification locations, as well as the optimal speed pairs for each task segment between any two consecutive checkpoints.Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performanceof each algorithm, showing that the best overall performance is achieved under the most flexible scenariousing intermediate verifications and different speeds

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

Checkpointing Strategies for Scheduling Computational Workflows

Author: Aupy Guillaume
Benoit Anne
Casanova Henri
Robert Yves
Publication venue: Higashi Hiroshima : Dept. of Computer Engineering, Hiroshima University
Publication date: 01/01/2016
Field of study

International audienceWe study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, rollback and recovery is used to resume the execution from the last checkpointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and deciding for each task whether to checkpoint it or not after it completes. We give a polynomial-time optimal algorithm for fork DAGs (Directed Acyclic Graphs) and show that the problem is NP-complete with join DAGs. We also investigate the complexity of the simple case in which no task is checkpointed. Our main result is a polynomial-time algorithm to compute the expected execution time of a workflow, with a given task execution order and specified to-be-checkpointed tasks. Using this algorithm as a basis, we propose several heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow configurations

HAL-ENS-LYON

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

On the reliability of consensus-based fault-tolerant distributed computing systems

Author: BEN-OR M.
BRACHA G.
CHOR B.
CO N S
CRISTIAN
DI HE W
DOLEV D.
DOLEV D.
FELLER W.
FISCHER M. J.
MAHANEY S. R.
RABIN
SCHNEIDER F. B.
SIEW OREK
STANKOVIC A.
STRONG H. R.
TAY Y. C.
Özalp Babaoğlu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Composing resilience techniques: ABFT, periodic and incremental checkpointing

Author
Publication venue: 'IJNC Editorial Committee'
Publication date: 01/01/2015
Field of study

Crossref

Composing resilience techniques: ABFT, periodic and incremental checkpointing

Author: Anti\u107 Bratislav
Fabi\ue1n Martin
Girault Hubert H
Jovi\u107 Milica
Lesch Andreas
Ognjanovi\u107 Milo\u161
Stankovi\u107 Dalibor M
Publication venue: IJNC Editorial Committee
Publication date: 01/01/2015
Field of study

An electrochemical sensor is described for the determination of L-dopa (levodopa; 3,4-dihydroxyphenylalanine). An inkjet-printed carbon nanotube (IJPCNT) electrode was modified with manganese dioxide microspheres by drop-casting. They coating was characterized by field emission scanning electron microscopy, Fourier-transform infrared spectroscopy and X-ray powder diffraction. The sensor, best operated at a working voltage of 0.3 V, has a linear response in the 0.1 to 10 \u3bcM L-dopa concentration range, a 54 nM detection limit, excellent reproducibility, repeatability and selectivity. The amperometric approach was applied to the determination of L-dopa in spiked biological fluids and displayed satisfactory accuracy and precision. Graphical abstract Schematic representation of an amperometric method for determination L-dopa. It is based on the use of inkjet-printed carbon nanotube electrode (IJPCNT) modified with manganese dioxide (MnO2)

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Repository of the Vinča Nuclear Institute (VinaR)