    Towards resilient EU HPC systems: A blueprint

    This document aims to spearhead a Europe-wide discussion on HPC system resilience and to help the European HPC community define best practices for resilience. We analyse a wide range of state-of-the-art resilience mechanisms and recommend the most effective approaches to employ in large-scale HPC systems. Our guidelines will be useful in the allocation of available resources, as well as guiding researchers and research funding towards the enhancement of resilience approaches with the highest priority and utility. Although our work is focused on the needs of next generation HPC systems in Europe, the principles and evaluations are applicable globally.

    Reliable power and time-constraints-aware predictive management of heterogeneous exascale systems

    The transition to Exascale computing is going to be characterised by an increased range of application classes. In addition to traditional massively parallel "number crunching" applications, new classes are emerging such as real-time HPC and data-intensive scalable computing. Furthermore, Exascale computing is characterised by a "democratisation" of HPC: to fully exploit the capabilities of Exascale-level facilities, HPC is moving towards enabling access to its resources to a wider range of new players, including SMEs, through cloud-based approaches [1]. Finally, the need for much higher energy efficiency is pushing towards deep heterogeneity, widening the range of options for acceleration, moving from the traditional CPU-only organization, to the CPU plus GPU which currently dominates the Green5001, to more complex options including programmable accelerators and even (reconfigurable) hardware accelerators [2]

    Challenges in deeply heterogeneous high performance systems

    RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project, which span run-time management, heterogeneous computing architectures, HPC memory/interconnection infrastructures, thermal modelling, reliability, programming models, and timing analysis. For each of these areas, the paper describes the relevant state of the art as well as the specific actions that the project will take to effectively address the identified technological challenges.

    The RECIPE approach to challenges in deeply heterogeneous high performance systems

    [EN] RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project. In particular, the need for predictive reliability approaches to maximizing hardware lifetime and guarantee application performance is identified as the key concern for RECIPE. We address it through hierarchical resource management of the heterogeneous architectural components of the system, driven by estimates of the application latency and hardware reliability obtained respectively through timing analysis and modeling thermal properties and mean-time-to-failure of subsystems. We show the impact of prediction accuracy on the overheads imposed by the checkpointing policy, as well as a possible application to a weather forecasting use case. 