32 research outputs found

    Multi-criteria checkpointing strategies: response-time versus resource utilization

    Get PDF
    International audienceFailures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the system batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted computation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application completion time is unchanged, while it delivers near-perfect platform efficiency.Voir le résumé en anglais

    Revisiting credit distribution algorithms for distributed termination detection

    Get PDF
    This paper revisits distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). We analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages.Peer ReviewedPostprint (author's final draft

    Comparing Distributed Termination Detection Algorithms for Task-Based Runtime Systems on HPC platforms

    Get PDF
    International audienceThis paper revisits distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). We analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages. We then compare the implementation of these algorithms over a task-based runtime system, PaRSEC and show the advantages and limitations of each approach on a practical implementation

    Revisiting Credit Distribution Algorithms for Distributed Termination Detection

    Get PDF
    International audienceThis paper revisits distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). We analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages

    Tolérance automatique aux défaillances par points de reprise et retour en arrière dans les systèmes hautes performances à passage de messages

    No full text
    L'augmentation du nombre de composants des architectures hautes performances fait surgir des problèmes de fiabilité : le temps moyen entre deux fautes est désormais de moins de 10 heures. Une solution pour assurer la progression d'un calcul numérique distribué en présence de fautes est d'enregistrer périodiquement des points de reprise. Cependant, l'état de chaque processus subit le non déterminisme des évènements réseau. Aussi, un protocole de tolérance aux fautes doit assurer la possibilité de restaurer un état global légitime depuis un ensemble de points de reprise. Notre travail a pour objectif l'étude des mécanismes automatiques de tolérance aux défaillances par points de reprise pour les applications à passage de messages utilisant le standard MPI. Nous présentons un environnement logiciel permettant l'expression d'algorithmes de tolérance aux défaillances et leur comparaison équitable dans un environnement uniforme. Nous exprimons plusieurs protocoles de tolérance aux défaillances (dont deux originaux) : un utilisant des points de reprise coordonnés, deux par enregistrement de messages pessimiste et trois par enregistrement de message causal. Nous les comparons expérimentalement, identifiant ainsi une fréquence de faute au delà de laquelle les protocoles par enregistrement de messages se comportent mieux que les protocoles coordonnés. Nous décrivons enfin un modélisation du protocole pessimiste adaptée aux réseaux à très haut débit. La performance de ces réseaux implique que l'utilisation de copies mémoires intermédiaires est très pénalisante. Nous présentons les performances d'une implémentation de ce protocole.Increasing the number of components of high performance architectures arises reliability issues: mean time between failures is now less than 10 hours. A solution to ensure progression of a numerical application hit by failures is to periodically save checkpoints. However, the state of each process depends on network's non deterministic events. Thus, a fault tolerance protocol has to ensure the ability to recover to a correct global state from a set of ckeckpoints. Our work aims to study checkpoint based automatic fault tolerance for message passing applications using the MPI standard.First we present a software environnement designed to express various families of fault tolerance algorithms and compare them in an fair and uniform testbed. We implement many fault tolerant protocols in this environment (including two originals) : one using coordinated checkpoints, two pessimistic message logging and three causal message logging. We shows through experimental comparison between all those protocol a fault frequency afterward message logging protocols are performing better than coordinated ones. Last we describe a novel modeling of pessimistic message logging focusing on very high performance networks. In those networks, using intermediate memory buffers and copies leads to high overhead. We present performances of an implementation of this protocol.ORSAY-PARIS 11-BU Sciences (914712101) / SudocSudocFranceF

    Coordinated checkpoint versus message log for fault tolerant MPI

    No full text
    fault tolerant MP

    HYBRID PREEMPTIVE SCHEDULING OF MESSAGE PASSING INTERFACE APPLICATIONS ON GRIDS

    No full text
    Time sharing between cluster resources in a grid is a major issue in cluster and grid integration. Classical grid architecture involves a higher-level scheduler which submits non-overlapping jobs to the independent batch schedulers of each cluster of the grid. The sequentiality induced by this approach does not fit with the expected number of users and job heterogeneity of grids. Time sharing techniques address this issue by allowing simultaneous executions of many applications on the same resources. Co-scheduling and gang scheduling are the two best known techniques for time sharing cluster resources. Coscheduling relies on the operating system of each node to schedule the processes of every application. Gang scheduling ensures that the same application is scheduled on al
    corecore