182 research outputs found

    Assessing general-purpose algorithms to cope with fail-stop and silent errors

    Get PDF
    In this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption.For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint,hence extending the classical formula by Young and Daly for fail-stop errors only. We further extendthe approach to include intermediate verifications, and to consider a bi-criteria problem involving both time and energy(linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bi-criteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via dynamic voltage and frequency scaling (DVFS).In this latter scenario, we determine the optimal checkpointing and verification locations, as well as the optimal speed pairs for each task segment between any two consecutive checkpoints.Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performanceof each algorithm, showing that the best overall performance is achieved under the most flexible scenariousing intermediate verifications and different speeds

    Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors

    Get PDF
    International audienceIn this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to address both fail-stop and silent errors. The objective is to minimize either makespan or energy consumption. While DVFS is a popular approach for reducing the energy consumption, using lower speeds/voltages can increase the number of errors, thereby complicating the problem. We consider an application workflow whose dependence graph is a chain of tasks, and we study three execution scenarios: (i) a single speed is used during the whole execution; (ii) a second, possibly higher speed is used for any potential re-execution; (iii) different pairs of speeds can be used throughout the execution. For each scenario, we determine the optimal checkpointing and verification locations (and the optimal speeds for the third scenario) to minimize either objective. The different execution scenarios are then assessed and compared through an extensive set of experiments

    Two-Level Checkpointing and Verifications for Linear Task Graphs

    Get PDF
    International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience techniques must accommodate both error sources. To cope with the double challenge, a two-level checkpointing and rollback recovery approach can be used, with additional verifications for silent error detection. A fail-stop error leads to the loss of the whole memory content, hence the obligation to checkpoint on a stable storage (e.g., an external disk). On the contrary, it is possible to use in-memory checkpoints for silent errors, which provide a much smaller checkpointing and recovery overhead. Furthermore, recent detectors offer partial verification mechanisms that are less costly than the guaranteed ones but do not detect all silent errors. In this paper, we show how to combine all of these techniques for HPC applications whose dependency graph forms a linear chain. We present a sophisticated dynamic programming algorithm that returns the optimal solution in polynomial time. Simulation results demonstrate that the combined use of multi-level checkpointing and verifications leads to improved performance compared to the standard single-level checkpointing algorithm

    Coping with silent errors in HPC applications

    Get PDF
    This report describes a unified framework for the detection and correction of silent errors,which constitute a major threat for scientific applications at extreme-scale.We first motivate the problem andexplain why checkpointing must be combined with some verification mechanism.Then we introduce a general-purpose technique based upon computational patterns thatperiodically repeat over time. These patterns interleave verifications and checkpoints, and we show how to determine the pattern minimizing expected execution time.Then we move to application-specific techniques and review dynamic programming algorithms for linear chains of tasks, as well as ABFT-oriented algorithms for iterative methods in sparse linear algebra

    Two-level checkpointing and partial verifications for linear task graphs

    Get PDF
    International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficient resilience techniques must accommodate both error sources. A traditional checkpointing and rollback recovery approach can be used, with added veri-fications to detect silent errors. A fail-stop error leads to the loss of the whole memory content, hence the obligation to checkpoint on a stable storage (e.g., an external disk). On the contrary, it is possible to use in-memory checkpoints for silent errors, which provide a much smaller checkpoint and recovery overhead. Furthermore, recent detectors offer partial verification mechanisms, which are less costly than guaranteed verifications but do not detect all silent errors. In this paper, we show how to combine all these techniques for HPC applications whose dependence graph is a chain of tasks, and provide a sophisticated dynamic programming algorithm returning the optimal solution in polynomial time. Simulations demonstrate that the combined use of multi-level checkpointing and partial verifications further improves performance

    La période de checkpoint de Young/Daly n’est pas optimale pour l’exécution de graphes de tâches

    Get PDF
    This paper revisits checkpointing strategies when workflows composed of multiple tasks execute on a parallel platform. The objective is to minimize the expectation of the total execution time. For a single task, the Young/Daly formula provides the optimal checkpointing period. However, when many tasks execute simultaneously, the risk that one of them is severely delayed increases with the number of tasks. To mitigate this risk, a possibility is to checkpoint each task more often than with the Young/Daly strategy. But is it worth slowing each task down with extra checkpoints? Does the extra checkpointing make a difference globally? This paper answers these questions. On the theoretical side, we prove several negative results for keeping the Young/Daly period when many tasks execute concurrently, and we design novel checkpointing strategies that guarantee an efficient execution with high probability. On the practical side, we report comprehensive experiments that demonstrate the need to go beyond the Young/Daly period and to checkpoint more often, for a wide range of application/platform settings.Cet article étudie les stratégies de checkpoint pour l’exécution de graphes de tâches (applications de type workflow). La formule de Young/Daly est optimale pour minimiser l’espérance du temps d’exécution d’une seule tâche. Mais quand plusieurs tâches s’exécutent en parallèle, le risque est grand que l’une d’entre elles soit retardée significativement, et partant, que soit retardée l’exécution de ses successeurs dans le graphe de tâches. Nous étudions la meilleure stratégie de checkpoitnt dans ce contexte, et montrons qu’effectivement il faut prendre des checkpoints plus souvent pour obtenir une solution efficace avec très grande probabilité. Nous conduisons des simulations sur des graphesde tâches de référence, qui confirment les résultats théoriques

    Stratégies de checkpoint pour protéger les tâches parallèles contre des erreurs ayant des distributions générales

    Get PDF
    This paper studies checkpointing strategies for parallel jobs subject to fail-stop errors. The optimal strategy is well known when failure inter-arrival times obey an Exponential law, but it is unknown for non-memoryless failure distributions. We explain why the latter fact is misunderstood in recent literature. We propose a general strategy that maximizes the expected efficiency until the next failure, and we show that this strategy is asymptotically optimal for very long jobs. Through extensive simulations, we show that the new strategy is always at least as good as the Young/Daly strategy for various failure distributions. For distributions with a high infant mortality (such as LogNormal 2.51 or Weibull 0.5), the execution time is divided by a factor 1.9 on average, and up to a factor 4.2 for recently deployed platforms.Cet article étudie les stratégies de checkpoint pour des tâches parallèles sujettes `a des erreurs fatales. La stratégie optimale est bien connue lorsque les temps d’inter-arrivée des pannes obéissent `a une loi exponentielle, mais elle est inconnue pour les distributions d’erreurs générales. Nous expliquons pourquoi ce dernier fait est mal compris dans la littérature récente. Nous proposons une stratégie générale qui maximise l’efficacité attendue jusqu’`a la prochaine d´défaillance, et nous montrons que cette stratégie est asymptotiquement optimale pour les travaux très longs. Par des simulations extensives, nous montrons que la nouvelle stratégie est toujours au moins aussi bonne que la stratégie de Young/Daly pour diverses distributions de pannes. Pour les distributions avec une mortalité infantile élevée (comme LogNormal 2.51 ou Weibull 0.5), le temps d’exécution est divisé par un facteur 1.9 en moyenne, et jusqu’`a un facteur 4.2 pour des plates-formes récemment déployées

    Checkpointing of parallel applications in a Grid environment

    Get PDF
    The Grid environment is generic, heterogeneous, and dynamic with lots of unreliable resources making it very exposed to failures. The environment is unreliable because it is geographically dispersed involving multiple autonomous administrative domains and it is composed of a large number of components. Examples of failures in the Grid environment can be: application crash, Grid node crash, network failures, and Grid system component failures. These types of failures can affect the execution of parallel/distributed application in the Grid environment and so, protections against these faults are crucial. Therefore, it is essential to develop efficient fault tolerant mechanisms to allow users to successfully execute Grid applications. One of the research challenges in Grid computing is to be able to develop a fault tolerant solution that will ensure Grid applications are executed reliably with minimum overhead incurred. While checkpointing is the most common method to achieve fault tolerance, there is still a lot of work to be done to improve the efficiency of the mechanism. This thesis provides an in-depth description of a novel solution for checkpointing parallel applications executed on a Grid. The checkpointing mechanism implemented allows to checkpoint an application at regions where there is no interprocess communication involved and therefore reducing the checkpointing overhead and checkpoint size
    • …
    corecore