6 research outputs found

    Towards resilient EU HPC systems: A blueprint

    Get PDF
    This document aims to spearhead a Europe-wide discussion on HPC system resilience and to help the European HPC community define best practices for resilience. We analyse a wide range of state-of-the-art resilience mechanisms and recommend the most effective approaches to employ in large-scale HPC systems. Our guidelines will be useful in the allocation of available resources, as well as guiding researchers and research funding towards the enhancement of resilience approaches with the highest priority and utility. Although our work is focused on the needs of next generation HPC systems in Europe, the principles and evaluations are applicable globally.This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the projects ECOSCALE (grant agreement No 671632), EPI (grant agreement No 826647), EuroEXA (grant agreement No 754337), Eurolab4HPC (grant agreement No 800962), EVOLVE (grant agreement No 825061), EXA2PRO (grant agreement No 801015), ExaNest (grant agreement No 671553), ExaNoDe (grant agreement No 671578), EXDCI-2 (grant agreement No 800957), LEGaTO (grant agreement No 780681), MB2020 (grant agreement No 779877), RECIPE (grant agreement No 801137) and SDK4ED (grant agreement No 780572). The work was also supported by the European Commission’s Seventh Framework Programme under the projects CLERECO (grant agreement No 611404), the NCSA-Inria-ANL-BSC-JSCRiken-UTK Joint-Laboratory for Extreme Scale Computing – JLESC (https://jlesc.github.io/), OMPI-X project (No ECP-2.3.1.17) and the Spanish Government through Severo Ochoa programme (SEV-2015-0493). This work was sponsored in part by the U.S. Department of Energy's Office of Advanced Scientific Computing Research, program managers Robinson Pino and Lucy Nowell. This manuscript has been authored by UT-Battelle, LLC under Contract No DE-AC05-00OR22725 with the U.S. Department of Energy.Preprin

    Reliable power and time-constraints-aware predictive management of heterogeneous exascale systems

    No full text
    The transition to Exascale computing is going to be characterised by an increased range of application classes. In addition to traditional massively parallel "number crunching" applications, new classes are emerging such as real-time HPC and data-intensive scalable computing. Furthermore, Exascale computing is characterised by a "democratisation" of HPC: to fully exploit the capabilities of Exascale-level facilities, HPC is moving towards enabling access to its resources to a wider range of new players, including SMEs, through cloud-based approaches [1]. Finally, the need for much higher energy efficiency is pushing towards deep heterogeneity, widening the range of options for acceleration, moving from the traditional CPU-only organization, to the CPU plus GPU which currently dominates the Green5001, to more complex options including programmable accelerators and even (reconfigurable) hardware accelerators [2]

    Challenges in deeply heterogeneous high performance systems

    No full text
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project, which span run-time management, heterogeneous computing architectures, HPC memory/interconnection infrastructures, thermal modelling, reliability, programming models, and timing analysis. For each of these areas, the paper describes the relevant state of the art as well as the specific actions that the project will take to effectively address the identified technological challenges.Peer Reviewe

    The RECIPE approach to challenges in deeply heterogeneous high performance systems

    Get PDF
    [EN] RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project. In particular, the need for predictive reliability approaches to maximizing hardware lifetime and guarantee application performance is identified as the key concern for RECIPE. We address it through hierarchical resource management of the heterogeneous architectural components of the system, driven by estimates of the application latency and hardware reliability obtained respectively through timing analysis and modeling thermal properties and mean-time-to-failure of subsystems. We show the impact of prediction accuracy on the overheads imposed by the checkpointing policy, as well as a possible application to a weather forecasting use case.The activities described in this article received funding from the European Union's Horizon 2020 research and innovation programme under the FETHPC grant agreement no. 801137 RECIPE: REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems.Agosta, G.; Fornaciari, W.; Atienza, D.; Canal, R.; Cilardo, A.; Flich Cardo, J.; Hernández Luz, C.... (2020). The RECIPE approach to challenges in deeply heterogeneous high performance systems. Microprocessors and Microsystems. 77:1-13. https://doi.org/10.1016/j.micpro.2020.103185S11377Flich, J., Agosta, G., Ampletzer, P., Alonso, D. A., Brandolese, C., Cappe, E., … Zoni, D. (2018). Exploring manycore architectures for next-generation HPC systems through the MANGO approach. Microprocessors and Microsystems, 61, 154-170. doi:10.1016/j.micpro.2018.05.011https://euroexa.eu.https://www.altera.com/products/sip/memory/stratix-10-mx/overview.html.http://www.mango-project.eu.https://www.infinibandta.org/infiniband-roadmap/.Reghenzani, F., Massari, G., & Fornaciari, W. (2018). chronovise: Measurement-Based Probabilistic Timing Analysis framework. Journal of Open Source Software, 3(28), 711. doi:10.21105/joss.00711Abella, J., Padilla, M., Castillo, J. D., & Cazorla, F. J. (2017). Measurement-Based Worst-Case Execution Time Estimation Using the Coefficient of Variation. ACM Transactions on Design Automation of Electronic Systems, 22(4), 1-29. doi:10.1145/3065924https://lanl.gov/projects/trinity/specifications.php.https://www.bsc.es/marenostrum/marenostrum/technical-information.https://www.olcf.ornl.gov/olcf-resources/compute-systems/titan/.Bellasi, P., Massari, G., & Fornaciari, W. (2015). Effective Runtime Resource Management Using Linux Control Groups with the BarbequeRTRM Framework. ACM Transactions on Embedded Computing Systems, 14(2), 1-17. doi:10.1145/2658990Egwutuoha, I. P., Levy, D., Selic, B., & Chen, S. (2013). A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 65(3), 1302-1326. doi:10.1007/s11227-013-0884-0Lee, K., & Wong, S. S. (2017). Fault-Tolerant FPGA with Column-Based Redundancy and Power Gating Using RRAM. IEEE Transactions on Computers, 66(6), 946-956. doi:10.1109/tc.2016.2634533Cheatham, J. A., Emmert, J. M., & Baumgart, S. (2006). A survey of fault tolerant methodologies for FPGAs. ACM Transactions on Design Automation of Electronic Systems, 11(2), 501-533. doi:10.1145/1142155.1142167Parris, M. G., Sharma, C. A., & Demara, R. F. (2011). Progress in autonomous fault recovery of field programmable gate arrays. ACM Computing Surveys, 43(4), 1-30. doi:10.1145/1978802.1978810A. Iranfar, F. Terraneo, W.A. Simon, L. Dragic, I. Pilji, M. Zapater Sancho, W. Fornaciari, M. Kovac, D. Atienza Alonso, Thermal characterization of next-generation workloads on heterogeneous MPSoCs (2017).Zoni, D., & Fornaciari, W. (2015). Modeling DVFS and Power-Gating Actuators for Cycle-Accurate NoC-Based Simulators. ACM Journal on Emerging Technologies in Computing Systems, 12(3), 1-24. doi:10.1145/2751561Curtsinger, C., & Berger, E. D. (2013). STABILIZER. ACM SIGARCH Computer Architecture News, 41(1), 219-228. doi:10.1145/2490301.2451141Kormann, J., Rodríguez, J. E., Gutierrez, N., Ferrer, M., Rojas, O., de la Puente, J., … Cela, J. M. (2016). Toward an automatic full-wave inversion: Synthetic study cases. The Leading Edge, 35(12), 1047-1052. doi:10.1190/tle35121047.1Fusi, M., Mazzocchetti, F., Farres, A., Kosmidis, L., Canal, R., Cazorla, F. J., & Abella, J. (2020). On the Use of Probabilistic Worst-Case Execution Time Estimation for Parallel Applications in High Performance Systems. Mathematics, 8(3), 314. doi:10.3390/math8030314D.W. Wright, R.A. Richardson, W. Edeling, J. Lakhlili, R.C. Sinclair, V. Jacauskas, D. Suleimenova, B. Bosak, M. Kulczewski, T. Piontek, P. Kopta, I. Chirca, H. Arabnejad, O.O. Luk, O. Hoenen, J. Weglarz, D. Crommelin, D. Groen, Building confidence in simulation: Application of easyvvuq, Submitted to Journal of Advanced Theory and Simulations on 12/12/2019
    corecore