Reliability-oriented resource management for High-Performance Computing

Agosta, Giovanni; Campi, Alessandro; Ciesielski, Sebastian; Fornaciari, William; Kulczewski, Michal; Massari, Giuseppe; Peta, Miriam; Piatek, Wojciech; Reghenzani, Federico; Terraneo, Federico

Reliability-oriented resource management for High-Performance Computing

Authors: Giovanni Agosta
Alessandro Campi
Sebastian Ciesielski
William Fornaciari
Michal Kulczewski
Giuseppe Massari
Miriam Peta
Wojciech Piatek
Federico Reghenzani
Federico Terraneo
Publication date: 1 January 2023
Publisher: 'Elsevier BV'
Doi

Abstract

Reliability is an increasingly pressing issue for High-Performance Computing systems, as failures are a threat to large-scale applications, for which an even single run may incur significant energy and billing costs. Currently, application developers need to address reliability explicitly, by integrating application-specific checkpoint/restore mechanisms. However, the application alone cannot exploit system knowledge, which is not the case for system-wide resource management systems. In this paper, we propose a reliability-oriented policy that can increase significantly component reliability by combining checkpoint/restore mechanisms exploitation and proactive resource management policies

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Archivio istituzionale della ricerca - Politecnico di Milano

oai:re.public.polimi.it:11311/...

Last time updated on 07/06/2023