Search CORE

6 research outputs found

Checkpointing algorithms and fault prediction

Author: Aupy Guillaume
Robert Yves
Vivien Frédéric
Zaidouni Dounia
Publication venue: 'Elsevier BV'
Publication date: 14/02/2013
Field of study

This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical first-order analysis of Young and Daly in the presence of a fault prediction system, characterized by its recall and its precision. In this framework, we provide an optimal algorithm to decide when to take predictions into account, and we derive the optimal value of the checkpointing period. These results allow to analytically assess the key parameters that impact the performance of fault predictors at very large scale.Comment: Supported in part by ANR Rescue. Published in Journal of Parallel and Distributed Computing. arXiv admin note: text overlap with arXiv:1207.693

arXiv.org e-Print Archive

HAL-ENS-LYON

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1

Sizing and Partitioning Strategies for Burst-Buffers to Reduce IO Contention

Author: Aupy Guillaume
Beaumont Olivier
Eyraud-Dubois Lionel
Publication venue: HAL CCSD
Publication date: 20/05/2019
Field of study

International audienceBurst-Buffers are high throughput and small size storage which are being used as an intermediate storage between the PFS (Parallel File System) and the computational nodes of modern HPC systems. They can allow to hinder to contention to the PFS, a shared resource whose read and write performance increase slower than processing power in HPC systems. A second usage is to accelerate data transfers and to hide the latency to the PFS. In this paper, we concentrate on the first usage. We propose a model for Burst-Buffers and application transfers. We consider the problem of dimensioning and sharing the Burst-Buffers between several applications. This dimensioning can be done either dynamically or statically. The dynamic allocation considers that any application can use any available portion of the Burst-Buffers. The static allocation considers that when a new application enters the system, it is assigned some portion of the Burst-Buffers, which cannot be used by the other applications until that application leaves the system and its data is purged from it. We show that the general sharing problem to guarantee fair performance for all applications is an NP-Complete problem. We propose a polynomial time algorithms for the special case of finding the optimal buffer size such that no application is slowed down due to PFS contention, both in the static and dynamic cases. Finally, we provide evaluations of our algorithms in realistic settings. We use those to discuss how to minimize the overhead of the static allocation of buffers compared to the dynamic allocation

INRIA a CCSD electronic archive server

Modeling High-throughput Applications for in situ Analytics

Author: Aupy Guillaume
Goglin Brice
Honoré Valentin
Raffin Bruno
Publication venue: 'SAGE Publications'
Publication date: 05/04/2019
Field of study

International audienceWith the goal of performing exascale computing, the importance of I/Omanagement becomes more and more critical to maintain system performance.While the computing capacities of machines are getting higher, the I/O capa-bilities of systems do not increase as fast. We are able to generate more databut unable to manage them eciently due to variability of I/O performance.Limiting the requests to the Parallel File System (PFS) becomes necessary. Toaddress this issue, new strategies are being developed such as online in situanalysis. The idea is to overcome the limitations of basic post-mortem dataanalysis where the data have to be stored on PFS rst and processed later.There are several software solutions that allow users to specically dedicatenodes for analysis of data and distribute the computation tasks over dier-ent sets of nodes. Thus far, they rely on a manual resource partitioning andallocation by the user of tasks (simulations, analysis).In this work, we propose a memory-constraint modelization for in situ anal-ysis. We use this model to provide dierent scheduling policies to determineboth the number of resources that should be dedicated to analysis functions,and that schedule eciently these functions. We evaluate them and show theimportance of considering memory constraints in the model. Finally, we discussthe dierent challenges that have to be addressed in order to build automatictools for in situ analytics

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

HAL-Rennes 1

Dimensionnement de Burst-Buffers pour réduire la contention Entrées-Sorties

Author: Aupy Guillaume
Beaumont Olivier
Eyraud-Dubois Lionel
Publication venue: HAL CCSD
Publication date: 01/10/2018
Field of study

Burst-Buffers are high throughput and small size storage which are being used as an intermediate storage between the Parallel File System (Parallel File System) and the computational nodes of modern HPC systems. They can allow to hinder to contention to the Parallel File System, a shared resource whose read and write performance increase slower than processing power in HPC systems. A second usage is to accelerate data transfers and to hide the latency to the Parallel File System. In this paper, we concentrate on the first usage. We propose a model for Burst-Buffers and application transfers.We consider the problem of dimensioning and sharing the Burst-Buffers between several applications. This dimensioning can be done either dynamically or statically. The dynamic allocation considers that any application can use any available portion of the Burst-Buffers. The static allocation considers that when a new application enters the system, it is assigned some portion of the Burst-Buffers which cannot be used by the other applications until that application leaves the system and its data is purged from it. We show that the general sharing problem to guarantee fair performance for all applications is an NP-Complete problem. We give a polynomial time algorithms for the special case of finding the optimal buffer size such that no application is slowed down due to Parallel File System contention, both in the static and dynamic cases. Finally, we provide evaluations of our algorithms in realistic settings. We use those to discuss how to minimize the overhead of the static allocation of buffers compared to the dynamic allocation.Nous nous intéressons à l’utilisation de Burst-Buffers en temps qu’espace de stockage intermédiaire entre les nœuds de calcul et le Système de Fichiers Parallèles (PFS). Ce dimensionnement peut être statique (à l’arrivée d’une application dans le système), ou dynamique (en fonction des demandes Entrées-Sorties).Nous montrons que le problème général de partager équitablement les buffers entre applications est NP-complet. Nous montrons que dans le cas particulier où l’on cherche à minimiser la taille totale du buffer pour qu’aucune application ne soit ralentie est résolvable en temps polynomial. Pour résoudre ce problème nous proposons un programme linéaire.Finalement nous proposons des évaluations à taille de buffer fixé pour montrer la performance de certains algorithmes naifs communs

INRIA a CCSD electronic archive server

Improving resilience of scientific software through a domain-specific approach

Author: Giles M. B.
Maheswaran S.
Mudalige G. R.
Reguly I. Z.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2019
Field of study

In this paper we present research on improving the resilience of the execution of scientific software, an increasingly important concern in High Performance Computing (HPC). We build on an existing high-level abstraction framework, the Oxford Parallel library for Structured meshes (OPS), developed for the solution of multi-block structured mesh-based applications, and implement an algorithm in the library to carry out checkpointing automatically, without the intervention of the user. The target applications are a hydrodynamics benchmark application from the Mantevo Suite, CloverLeaf 3D, the sparse linear solver proxy application TeaLeaf, and the OpenSBLI compressible Navier–Stokes direct numerical simulation (DNS) solver. We present (1) the basic algorithm that OPS relies on to determine the optimal checkpoint in terms of size and location, (2) improvements that supply additional information to improve the decision, (3) techniques that reduce the cost of writing the checkpoints to non-volatile storage, (4) a performance analysis of the developed techniques on a single workstation and on several supercomputers, including ORNL’s Titan. Our results demonstrate the utility of the high-level abstractions approach in automating the checkpointing process and show that performance is comparable to, or better than the reference in all cases

Warwick Research Archives Portal Repository

Oxford University Research Archive

Repository of the Academy's Library

Checkpointing algorithms and fault prediction

Author: Cappello
Castelli
Daly
Dounia Zaidouni
Frédéric Vivien
Guillaume Aupy
Heath
Heien
Hong
Kolettis
Li
Liu
Mitzenmacher
Robert
Ross
Young
Yves Robert
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref