11 research outputs found

    Towards resilient EU HPC systems: A blueprint

    Get PDF
    This document aims to spearhead a Europe-wide discussion on HPC system resilience and to help the European HPC community define best practices for resilience. We analyse a wide range of state-of-the-art resilience mechanisms and recommend the most effective approaches to employ in large-scale HPC systems. Our guidelines will be useful in the allocation of available resources, as well as guiding researchers and research funding towards the enhancement of resilience approaches with the highest priority and utility. Although our work is focused on the needs of next generation HPC systems in Europe, the principles and evaluations are applicable globally.This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the projects ECOSCALE (grant agreement No 671632), EPI (grant agreement No 826647), EuroEXA (grant agreement No 754337), Eurolab4HPC (grant agreement No 800962), EVOLVE (grant agreement No 825061), EXA2PRO (grant agreement No 801015), ExaNest (grant agreement No 671553), ExaNoDe (grant agreement No 671578), EXDCI-2 (grant agreement No 800957), LEGaTO (grant agreement No 780681), MB2020 (grant agreement No 779877), RECIPE (grant agreement No 801137) and SDK4ED (grant agreement No 780572). The work was also supported by the European Commission’s Seventh Framework Programme under the projects CLERECO (grant agreement No 611404), the NCSA-Inria-ANL-BSC-JSCRiken-UTK Joint-Laboratory for Extreme Scale Computing – JLESC (https://jlesc.github.io/), OMPI-X project (No ECP-2.3.1.17) and the Spanish Government through Severo Ochoa programme (SEV-2015-0493). This work was sponsored in part by the U.S. Department of Energy's Office of Advanced Scientific Computing Research, program managers Robinson Pino and Lucy Nowell. This manuscript has been authored by UT-Battelle, LLC under Contract No DE-AC05-00OR22725 with the U.S. Department of Energy.Preprin

    Cache Vulnerability Equations for Protecting Data in Embedded Processor Caches from Soft Errors

    No full text
    Continuous technology scaling has brought us to a point, where transistors have become extremely susceptible to cosmic radiation strikes, or soft errors. Inside the processor, caches are most vulnerable to soft errors, and techniques at various levels of design abstraction, e.g., fabrication, gate design, circuit design, and microarchitecture-level, have been developed to protect data in caches. However, no work has been done to investigate the effect of code transformations on the vulnerability of data in caches. Data is vulnerable to soft errors in the cache only if it will be read by the processor, and not if it will be overwritten. Since code transformations can change the read-write pattern of program variables, they significantly effect the soft error vulnerability of program variables in the cache. We observe that often opportunity exists to significantly reduce the soft error vulnerability of cache data by trading-off a little performance. However, even if one wanted to exploit this trade-off, it is difficult, since there are no efficient techniques to estimate vulnerability of data in caches. To this end, this paper develops efficient static analysis method to estimate program vulnerability in caches, which enables the compiler to exploit the performance-vulnerability trade-offs in applications. Finally, as compared to simulation based estimation, static analysis techniques provide the insights into vulnerability calculations that provide some simple schemes to reduce program vulnerability

    Cache Vulnerability Equations for Protecting Data in Embedded Processor Caches from Soft Errors

    No full text
    Continuous technology scaling has brought us to a point, where transistors have become extremely susceptible to cosmic radiation strikes, or soft errors. Inside the processor, caches are most vulnerable to soft errors, and techniques at various levels of design abstraction, e. g., fabrication, gate design, circuit design, and microarchitecture-level, have been developed to protect data in caches. However, no work has been done to investigate the effect of code transformations on the vulnerability of data in caches. Data is vulnerable to soft errors in the cache only if it will be read by the processor, and not if it will be overwritten. Since code transformations can change the read-write pattern of program variables, they significantly effect the soft error vulnerability of program variables in the cache. We observe that often opportunity exists to significantly reduce the soft error vulnerability of cache data by trading-off a little performance. However, even if one wanted to exploit this trade-off, it is difficult, since there are no efficient techniques to estimate vulnerability of data in caches. To this end, this paper develops efficient static analysis method to estimate program vulnerability in caches, which enables the compiler to exploit the performance-vulnerability trade-offs in applications. Finally, as compared to simulation based estimation, static analysis techniques provide the insights into vulnerability calculations that provide some simple schemes to reduce program vulnerability.close3

    Systematic Methodology for the Quantitative Analysis of Pipeline-Register Reliability

    No full text

    SPKM: A Novel Graph Drawing based Algorithm for Application Mapping onto Coarse-Grained Reconfigurable Architectures

    No full text
    Abstract — Recently coarse-grained reconfigurable architectures (CGRAs) have drawn increasing attention due to their efficiency and flexibility. While many CGRAs have demonstrated impressive performance improvements, the effectiveness of CGRA platforms ultimately hinges on the compiler. Existing CGRA compilers do not model the details of the CGRA architecture, due to which they are, i) unable to map applications, even though a mapping exists, and ii) use too many PEs to map an application. In this paper, we model several CGRA details in our compiler and develop a graph mapping based approach (SPKM) for mapping applications onto CGRAs. On randomly generated graphs our technique can map on average 4.5X more applications than the previous approaches, while using fewer CGRA rows 62 % times, without any penalty in mapping time. We observe similar results on a suite of benchmarks collected from Livermore Loops, Multimedia and DSPStone benchmarks. I

    Revisiting Symptom-Based Fault Tolerant Techniques against Soft Errors

    No full text
    Aggressive technology scaling and near-threshold computing have made soft error reliability one of the leading design considerations in modern embedded microprocessors. Although traditional hardware/software redundancy-based schemes can provide a high level of protection, they incur significant overheads in terms of performance and hardware resources. The considerable overheads from such full redundancy-based techniques has motivated researchers to propose low-cost soft error protection schemes, such as symptom-based error protection schemes. The main idea behind a symptom-based error protection scheme is that soft errors in the system will quickly generate some symptoms, such as exceptions, branch mispredictions, cache or TLB misses, or unpredictable variable values. Therefore, monitoring such infrequent symptoms makes it possible to cover the manifestation of failures caused by soft errors. Symptom-based protection schemes have been suggested as shortcuts to achieve acceptable reliability with comparable overheads. Since the symptom-based protection schemes seem attractive due to their generality and simplicity, even state-of-the-art protection schemes exploit them as the baseline protections. However, our detailed analysis of the fault coverage and performance overheads of such schemes reveals that the user-visible failure coverage, particularly of ReStore, is limited (29% on average). By contrast, the runtime overheads are significant (40% on average) because the majority of the fault injection experiments, which were considered as detected/recovered failures by low-level symptoms, are actually benign faults by program-level masking effects

    Towards Resilient EU HPC Systems: A Blueprint

    Get PDF
    This document aims to spearhead a Europe-wide discussion on HPC system resilience and to help the European HPC community define best practices for resilience. It analyses a wide range of state-of-the-art resilience mechanisms and recommend the most effective approaches to employ in large-scale HPC systems. These guidelines will be useful in the allocation of available resources, as well as guiding researchers and research funding towards the enhancement of resilience approaches with the highest priority and utility. Although it is focused on the needs of next generation HPC systems in Europe, the principles and evaluations are applicable globally. This document is the first output of the ongoing European HPC resilience initiative and it covers individual nodes in HPC systems, encompassing CPU, memory, intra-node interconnect and emerging FPGA-based hardware accelerators. With community support and feedback on this initial document, we will update the analysis and expand the scope to include other types of accelerators, as well as networks and storage. The need for resilience features is analysed based on three guiding principles: 1. The resilience features implemented in HPC systems should assure that the failure rate of the system is below an acceptable threshold, representative of the technology, system size and target application. 2. Given the high cost incurred by uncorrected error propagation, hardware errors should be detected and corrected frequently and at low overhead, which is likely only possible in hardware. 3. Overheating is one of the main causes of unreliable device behaviour. Production HPC systems should prevent overheating while balancing power/energy and performance. Based on these principles, the main outcome of this document is that the following features should be given priority during the design, implementation and operation of any large-scale HPC system: ‱ ECC in main memory ‱ Memory demand and patrol scrub ‱ Memory address parity protection ‱ Error detection in CPU caches and registers ‱ Error detection in the intra-node interconnect ‱ Packet retry in the intra-node interconnect ‱ Reporting corrected errors to the BIOS or OS (system software requirement) ‱ Memory thermal throttling ‱ Dynamic voltage and frequency scaling for CPUs, FPGAs and ASICs ‱ Over-temperature shutdown mechanism for FPGAs ‱ ECC in FPGA on-chip data memories as well as in configuration memories The remaining state-of-the-art resilience features surveyed in this document should only be developed and implemented after a more detailed and specific cost–benefit analysis
    corecore