4 research outputs found

    Detecting thread clusters in high-performance computing applications

    Get PDF
    [CATALÀ] Aquest projecte proposa una manera de detectar si existeixen diferències significatives entre els threads involucrats en l'execució d'una aplicació de "hihg-performance computing (HPC)", així com també un algorisme eficient per agrupar els threads en funció de les seves diferències.This project proposes a way of detecting whether significant differences among the threads involved in an execution of a high-performance computing (HPC) application exist, as well as an efficient algorithm for clustering the threads based on such differences

    Detecting thread clusters in high-performance computing applications

    No full text
    [CATALÀ] Aquest projecte proposa una manera de detectar si existeixen diferències significatives entre els threads involucrats en l'execució d'una aplicació de "hihg-performance computing (HPC)", així com també un algorisme eficient per agrupar els threads en funció de les seves diferències.This project proposes a way of detecting whether significant differences among the threads involved in an execution of a high-performance computing (HPC) application exist, as well as an efficient algorithm for clustering the threads based on such differences

    Detecting thread clusters in high-performance computing applications

    No full text
    [CATALÀ] Aquest projecte proposa una manera de detectar si existeixen diferències significatives entre els threads involucrats en l'execució d'una aplicació de "hihg-performance computing (HPC)", així com també un algorisme eficient per agrupar els threads en funció de les seves diferències.This project proposes a way of detecting whether significant differences among the threads involved in an execution of a high-performance computing (HPC) application exist, as well as an efficient algorithm for clustering the threads based on such differences

    Cost-aware prediction of uncorrected DRAM errors in the field

    Get PDF
    This paper presents and evaluates a method to predict DRAM uncorrected errors, a leading cause of hardware failures in large-scale HPC clusters. The method uses a random forest classifier, which was trained and evaluated using error logs from two years of production of the MareNostrum 3 supercomputer. By enabling the system to take measures to mitigate node failures, our method reduces lost compute time by up to 57%, a net saving of 21,000 node–hours per year. We release all source code as open source. We also discuss and clarify aspects of methodology that are essential for a DRAM prediction method to be useful in practice. We explain why standard evaluation metrics, such as precision and recall, are insufficient, and base the evaluation on a cost–benefit analysis. This methodology can help ensure that any DRAM error predictor is clear from training bias and has a clear cost–benefit calculation.This work was supported by the Spanish Ministry of Science and Technology (project PID2019-107255GB), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and the European Union’s Horizon 2020 research and innovation programme and EuroEXA project (grant agreement No 754337). Paul Carpenter and Marc Casas hold the Ramon y Cajal fellowship under contracts RYC2018-025628-I and RYC2017-23269, respectively, of the Ministry of Economy and Competitiveness of Spain.Peer ReviewedPostprint (author's final draft
    corecore