Cost-aware prediction of uncorrected DRAM errors in the field

Ayguadé Parra, Eduard; Bartolomé Rodríguez, Javier; Boixaderas Coderch, Isaac; Carpenter, Paul Matthew; Casas Guix, Marc; Moré Codina, Sergi; Radojković, Petar; Vicente Dorca, David; Živanovič, Darko

Cost-aware prediction of uncorrected DRAM errors in the field

Authors: Eduard Ayguadé Parra
Javier Bartolomé Rodríguez
Isaac Boixaderas Coderch
Paul Matthew Carpenter
Marc Casas Guix
Sergi Moré Codina
Petar Radojković
David Vicente Dorca
Darko Živanovič
Publication date: 1 January 2020
Publisher: 'Institute of Electrical and Electronics Engineers (IEEE)'
Doi

Abstract

This paper presents and evaluates a method to predict DRAM uncorrected errors, a leading cause of hardware failures in large-scale HPC clusters. The method uses a random forest classifier, which was trained and evaluated using error logs from two years of production of the MareNostrum 3 supercomputer. By enabling the system to take measures to mitigate node failures, our method reduces lost compute time by up to 57%, a net saving of 21,000 node–hours per year. We release all source code as open source. We also discuss and clarify aspects of methodology that are essential for a DRAM prediction method to be useful in practice. We explain why standard evaluation metrics, such as precision and recall, are insufficient, and base the evaluation on a cost–benefit analysis. This methodology can help ensure that any DRAM error predictor is clear from training bias and has a clear cost–benefit calculation.This work was supported by the Spanish Ministry of Science and Technology (project PID2019-107255GB), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and the European Union’s Horizon 2020 research and innovation programme and EuroEXA project (grant agreement No 754337). Paul Carpenter and Marc Casas hold the Ramon y Cajal fellowship under contracts RYC2018-025628-I and RYC2017-23269, respectively, of the Ministry of Economy and Competitiveness of Spain.Peer ReviewedPostprint (author's final draft

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

UPCommons. Portal del coneixement obert de la UPC

oai:upcommons.upc.edu:2117/341...

Last time updated on 23/03/2021