Search CORE

4 research outputs found

Detecting thread clusters in high-performance computing applications

Author: Boixaderas Coderch Isaac
Publication venue: Universitat Politècnica de Catalunya
Publication date: 25/05/2015
Field of study

[CATALÀ] Aquest projecte proposa una manera de detectar si existeixen diferències significatives entre els threads involucrats en l'execució d'una aplicació de "hihg-performance computing (HPC)", així com també un algorisme eficient per agrupar els threads en funció de les seves diferències.This project proposes a way of detecting whether significant differences among the threads involved in an execution of a high-performance computing (HPC) application exist, as well as an efficient algorithm for clustering the threads based on such differences

UPCommons. Portal del coneixement obert de la UPC

Detecting thread clusters in high-performance computing applications

Author: Boixaderas Coderch Isaac
Publication venue: Universitat Politècnica de Catalunya
Publication date: 25/05/2015
Field of study

Detecting thread clusters in high-performance computing applications

Author: Boixaderas Coderch Isaac
Publication venue: Universitat Politècnica de Catalunya
Publication date
Field of study

RECERCAT

Cost-aware prediction of uncorrected DRAM errors in the field

Author: Ayguadé Parra Eduard
Bartolomé Rodríguez Javier
Boixaderas Coderch Isaac
Carpenter Paul Matthew
Casas Guix Marc
Moré Codina Sergi
Radojković Petar
Vicente Dorca David
Živanovič Darko
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

This paper presents and evaluates a method to predict DRAM uncorrected errors, a leading cause of hardware failures in large-scale HPC clusters. The method uses a random forest classifier, which was trained and evaluated using error logs from two years of production of the MareNostrum 3 supercomputer. By enabling the system to take measures to mitigate node failures, our method reduces lost compute time by up to 57%, a net saving of 21,000 node–hours per year. We release all source code as open source. We also discuss and clarify aspects of methodology that are essential for a DRAM prediction method to be useful in practice. We explain why standard evaluation metrics, such as precision and recall, are insufficient, and base the evaluation on a cost–benefit analysis. This methodology can help ensure that any DRAM error predictor is clear from training bias and has a clear cost–benefit calculation.This work was supported by the Spanish Ministry of Science and Technology (project PID2019-107255GB), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and the European Union’s Horizon 2020 research and innovation programme and EuroEXA project (grant agreement No 754337). Paul Carpenter and Marc Casas hold the Ramon y Cajal fellowship under contracts RYC2018-025628-I and RYC2017-23269, respectively, of the Ministry of Economy and Competitiveness of Spain.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC