Search CORE

32 research outputs found

Multi-criteria checkpointing strategies: response-time versus resource utilization

Author: Bouteiller Aurélien
Cappello Franck
Dongarra Jack
Guermouche Amina
Herault Thomas
Robert Yves
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

International audienceFailures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the system batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted computation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application completion time is unchanged, while it delivers near-perfect platform efficiency.Voir le résumé en anglais

INRIA a CCSD electronic archive server

Hal-Diderot

Revisiting credit distribution algorithms for distributed termination detection

Author: Bosilca George
Bouteiller Aurélien
Dongarra Jack
Fèvre Valentin Le
Herault Thomas
Robert Yves
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

This paper revisits distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). We analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages.Peer ReviewedPostprint (author's final draft

HAL-ENS-LYON

UPCommons. Portal del coneixement obert de la UPC

INRIA a CCSD electronic archive server

Comparing Distributed Termination Detection Algorithms for Task-Based Runtime Systems on HPC platforms

Author: Bosilca George
Bouteiller Aurélien
Dongarra Jack,
Hérault Thomas
Le Fèvre Valentin
Robert Yves
Publication venue: Higashi Hiroshima : Dept. of Computer Engineering, Hiroshima University
Publication date: 01/01/2022
Field of study

International audienceThis paper revisits distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). We analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages. We then compare the implementation of these algorithms over a task-based runtime system, PaRSEC and show the advantages and limitations of each approach on a practical implementation

INRIA a CCSD electronic archive server

Revisiting Credit Distribution Algorithms for Distributed Termination Detection

Author: Bosilca George
Bouteiller Aurélien
Dongarra Jack
Herault Thomas
Le Fèvre Valentin
Robert Yves
Publication venue: HAL CCSD
Publication date: 01/01/2021
Field of study

INRIA a CCSD electronic archive server

Tolérance automatique aux défaillances par points de reprise et retour en arrière dans les systèmes hautes performances à passage de messages

Author: BOUTEILLER Aurélien
CAPPELLO Franck
Publication venue
Publication date: 01/01/2006
Field of study

L'augmentation du nombre de composants des architectures hautes performances fait surgir des problèmes de fiabilité : le temps moyen entre deux fautes est désormais de moins de 10 heures. Une solution pour assurer la progression d'un calcul numérique distribué en présence de fautes est d'enregistrer périodiquement des points de reprise. Cependant, l'état de chaque processus subit le non déterminisme des évènements réseau. Aussi, un protocole de tolérance aux fautes doit assurer la possibilité de restaurer un état global légitime depuis un ensemble de points de reprise. Notre travail a pour objectif l'étude des mécanismes automatiques de tolérance aux défaillances par points de reprise pour les applications à passage de messages utilisant le standard MPI. Nous présentons un environnement logiciel permettant l'expression d'algorithmes de tolérance aux défaillances et leur comparaison équitable dans un environnement uniforme. Nous exprimons plusieurs protocoles de tolérance aux défaillances (dont deux originaux) : un utilisant des points de reprise coordonnés, deux par enregistrement de messages pessimiste et trois par enregistrement de message causal. Nous les comparons expérimentalement, identifiant ainsi une fréquence de faute au delà de laquelle les protocoles par enregistrement de messages se comportent mieux que les protocoles coordonnés. Nous décrivons enfin un modélisation du protocole pessimiste adaptée aux réseaux à très haut débit. La performance de ces réseaux implique que l'utilisation de copies mémoires intermédiaires est très pénalisante. Nous présentons les performances d'une implémentation de ce protocole.Increasing the number of components of high performance architectures arises reliability issues: mean time between failures is now less than 10 hours. A solution to ensure progression of a numerical application hit by failures is to periodically save checkpoints. However, the state of each process depends on network's non deterministic events. Thus, a fault tolerance protocol has to ensure the ability to recover to a correct global state from a set of ckeckpoints. Our work aims to study checkpoint based automatic fault tolerance for message passing applications using the MPI standard.First we present a software environnement designed to express various families of fault tolerance algorithms and compare them in an fair and uniform testbed. We implement many fault tolerant protocols in this environment (including two originals) : one using coordinated checkpoints, two pessimistic message logging and three causal message logging. We shows through experimental comparison between all those protocol a fault frequency afterward message logging protocols are performing better than coordinated ones. Last we describe a novel modeling of pessimistic message logging focusing on very high performance networks. In those networks, using intermediate memory buffers and copies leads to high overhead. We present performances of an implementation of this protocol.ORSAY-PARIS 11-BU Sciences (914712101) / SudocSudocFranceF

OpenGrey Repository

Coordinated checkpoint versus message log for fault tolerant MPI

Author: Aurélien Bouteiller Pierre Lemarinier
Publication venue: Press
Publication date: 01/01/2003
Field of study

fault tolerant MP

CiteSeerX

HYBRID PREEMPTIVE SCHEDULING OF MESSAGE PASSING INTERFACE APPLICATIONS ON GRIDS

Author: Aurélien Bouteiller
Aurélien Bouteiller
Hinde-lilia Bouziane
Hinde-lilia Bouziane
Pierre Lemarinier
Thomas Herault
Thomas Herault
Publication venue
Publication date
Field of study

Time sharing between cluster resources in a grid is a major issue in cluster and grid integration. Classical grid architecture involves a higher-level scheduler which submits non-overlapping jobs to the independent batch schedulers of each cluster of the grid. The sequentiality induced by this approach does not fit with the expected number of users and job heterogeneity of grids. Time sharing techniques address this issue by allowing simultaneous executions of many applications on the same resources. Co-scheduling and gang scheduling are the two best known techniques for time sharing cluster resources. Coscheduling relies on the operating system of each node to schedule the processes of every application. Gang scheduling ensures that the same application is scheduled on al

CiteSeerX