86 research outputs found

    Resiliency of high-performance computing systems: A fault-injection-based characterization of the high-speed network in the blue waters testbed

    Get PDF
    Supercomputers have played an essential role in the progress of science and engineering research. As the high-performance computing (HPC) community moves towards the next generation of HPC computing, it faces several challenges, one of which is reliability of HPC systems. Error rates are expected to significantly increase on exascale systems to the point where traditional application-level checkpointing may no longer be a viable fault tolerance mechanism. This poses serious ramifications for a system's ability to guarantee reliability and availability of its resources. It is becoming increasingly important to understand fault-to-failure propagation and to identify key areas of instrumentation in HPC systems for avoidance, detection, diagnosis, mitigation, and recovery of faults. This thesis presents a software-implemented, prototype-based fault injection tool called HPCArrow and a fault injection methodology as a means to investigate and evaluate HPC application and system resiliency. We demonstrate HPCArrow's capabilities through four fault injection campaigns on a Cray XE/XK hybrid testbed, covering single injections, time-varying or delayed injections, and injections during recovery. These injections emulate failures on network and compute components. The results of these campaigns provide insight into application-level and system-level resiliencies. Across various HPC application frameworks, there are notable deficiencies in fault tolerance. Our experiments also revealed a failure phenomenon that was previously unobserved in field data: application hangs, in which forward progress is not made, but jobs are not terminated until the maximum allowed time has elapsed. At the system level, failover procedures prove highly robust on small-scale systems, able to handle both single and multiple faults in the network

    Advanced Simulation and Computing FY12-13 Implementation Plan, Volume 2, Revision 0.5

    Full text link

    Investigating an API for resilient exascale computing.

    Get PDF
    Increased HPC capability comes with increased complexity, part counts, and fault occurrences. In- creasing the resilience of systems and applications to faults is a critical requirement facing the viability of exascale systems, as the overhead of traditional checkpoint/restart is projected to outweigh its bene ts due to fault rates outpacing I/O bandwidths. As faults occur and propagate throughout hardware and software layers, pervasive noti cation and handling mechanisms are necessary. This report describes an initial investigation of fault types and programming interfaces to mitigate them. Proof-of-concept APIs are presented for the frequent and important cases of memory errors and node failures, and a strategy proposed for lesystem failures. These involve changes to the operating system, runtime, I/O library, and application layers. While a single API for fault handling among hardware and OS and application system-wide remains elusive, the e ort increased our understanding of both the mountainous challenges and the promising trailheads.

    Exploiting task-based programming models for resilience

    Get PDF
    Hardware errors become more common as silicon technologies shrink and become more vulnerable, especially in memory cells, which are the most exposed to errors. Permanent and intermittent faults are caused by manufacturing variability and circuits ageing. While these can be mitigated once they are identified, their continuous rate of appearance throughout the lifetime of memory devices will always cause unexpected errors. In addition, transient faults are caused by effects such as radiation or small voltage/frequency margins, and there is no efficient way to shield against these events. Other constraints related to the diminishing sizes of transistors, such as power consumption and memory latency have caused the microprocessor industry to turn to increasingly complex processor architectures. To solve the difficulties arising from programming such architectures, programming models have emerged that rely on runtime systems. These systems form a new intermediate layer on the hardware-software abstraction stack, that performs tasks such as distributing work across computing resources: processor cores, accelerators, etc. These runtime systems dispose of a lot of information, both from the hardware and the applications, and offer thus many possibilities for optimisations. This thesis proposes solutions to the increasing fault rates in memory, across multiple resilience disciplines, from algorithm-based fault tolerance to hardware error correcting codes, through OS reliability strategies. These solutions rely for their efficiency on the opportunities presented by runtime systems. The first contribution of this thesis is an algorithmic-based resilience technique, allowing to tolerate detected errors in memory. This technique allows to recover data that is lost by performing computations that rely on simple redundancy relations identified in the program. The recovery is demonstrated for a family of iterative solvers, the Krylov subspace methods, and evaluated for the conjugate gradient solver. The runtime can transparently overlap the recovery with the computations of the algorithm, which allows to mask the already low overheads of this technique. The second part of this thesis proposes a metric to characterise the impact of faults in memory, which outperforms state-of-the-art metrics in precision and assurances on the error rate. This metric reveals a key insight into data that is not relevant to the program, and we propose an OS-level strategy to ignore errors in such data, by delaying the reporting of detected errors. This allows to reduce failure rates of running programs, by ignoring errors that have no impact. The architectural-level contribution of this thesis is a dynamically adaptable Error Correcting Code (ECC) scheme, that can increase protection of memory regions where the impact of errors is highest. A runtime methodology is presented to estimate the fault rate at runtime using our metric, through performance monitoring tools of current commodity processors. Guiding the dynamic ECC scheme online using the methodology's vulnerability estimates allows to decrease error rates of programs at a fraction of the redundancy cost required for a uniformly stronger ECC. This provides a useful and wide range of trade-offs between redundancy and error rates. The work presented in this thesis demonstrates that runtime systems allow to make the most of redundancy stored in memory, to help tackle increasing error rates in DRAM. This exploited redundancy can be an inherent part of algorithms that allows to tolerate higher fault rates, or in the form of dead data stored in memory. Redundancy can also be added to a program, in the form of ECC. In all cases, the runtime allows to decrease failure rates efficiently, by diminishing recovery costs, identifying redundant data, or targeting critical data. It is thus a very valuable tool for the future computing systems, as it can perform optimisations across different layers of abstractions.Los errores en memoria se vuelven más comunes a medida que las tecnologías de silicio reducen su tamaño. La variabilidad de fabricación y el envejecimiento de los circuitos causan fallos permanentes e intermitentes. Aunque se pueden mitigar una vez identificados, su continua tasa de aparición siempre causa errores inesperados. Además, la memoria también sufre de fallos transitorios contra los cuales no se puede proteger eficientemente. Estos fallos están causados por efectos como la radiación o los reducidos márgenes de voltaje y frecuencia. Otras restricciones coetáneas, como el consumo de energía y la latencia de la memoria, obligaron a las arquitecturas de computadores a volverse cada vez más complejas. Para programar tales procesadores, se desarrollaron modelos de programación basados en entornos de ejecución. Estos sistemas forman una nueva abstracción entre hardware y software, realizando tareas como la distribución del trabajo entre recursos informáticos: núcleos de procesadores, aceleradores, etc. Estos entornos de ejecución disponen de mucha información tanto sobre el hardware como sobre las aplicaciones, y ofrecen así muchas posibilidades de optimización. Esta tesis propone soluciones a los fallos en memoria entre múltiples disciplinas de resiliencia, desde la tolerancia a fallos basada en algoritmos, hasta los códigos de corrección de errores en hardware, incluyendo estrategias de resiliencia del sistema operativo. La eficiencia de estas soluciones depende de las oportunidades que presentan los entornos de ejecución. La primera contribución de esta tesis es una técnica a nivel algorítmico que permite corregir fallos encontrados mientras el programa su ejecuta. Para corregir fallos se han identificado redundancias simples en los datos del programa para toda una clase de algoritmos, los métodos del subespacio de Krylov (gradiente conjugado, GMRES, etc). La estrategia de recuperación de datos desarrollada permite corregir errores sin tener que reinicializar el algoritmo, y aprovecha el modelo de programación para superponer las computaciones del algoritmo y de la recuperación de datos. La segunda parte de esta tesis propone una métrica para caracterizar el impacto de los fallos en la memoria. Esta métrica supera en precisión a las métricas de vanguardia y permite identificar datos que son menos relevantes para el programa. Se propone una estrategia a nivel del sistema operativo retrasando la notificación de los errores detectados, que permite ignorar fallos en estos datos y reducir la tasa de fracaso del programa. Por último, la contribución a nivel arquitectónico de esta tesis es un esquema de Código de Corrección de Errores (ECC por sus siglas en inglés) adaptable dinámicamente. Este esquema puede aumentar la protección de las regiones de memoria donde el impacto de los errores es mayor. Se presenta una metodología para estimar el riesgo de fallo en tiempo de ejecución utilizando nuestra métrica, a través de las herramientas de monitorización del rendimiento disponibles en los procesadores actuales. El esquema de ECC guiado dinámicamente con estas estimaciones de vulnerabilidad permite disminuir la tasa de fracaso de los programas a una fracción del coste de redundancia requerido para un ECC uniformemente más fuerte. El trabajo presentado en esta tesis demuestra que los entornos de ejecución permiten aprovechar al máximo la redundancia contenida en la memoria, para contener el aumento de los errores en ella. Esta redundancia explotada puede ser una parte inherente de los algoritmos que permite tolerar más fallos, en forma de datos inutilizados almacenados en la memoria, o agregada a la memoria de un programa en forma de ECC. En todos los casos, el entorno de ejecución permite disminuir los efectos de los fallos de manera eficiente, disminuyendo los costes de recuperación, identificando datos redundantes, o focalizando esfuerzos de protección en los datos críticos.Postprint (published version

    Keeping checkpoint/restart viable for exascale systems

    Get PDF
    Next-generation exascale systems, those capable of performing a quintillion operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoints) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms

    Exploiting task-based programming models for resilience

    Get PDF
    Hardware errors become more common as silicon technologies shrink and become more vulnerable, especially in memory cells, which are the most exposed to errors. Permanent and intermittent faults are caused by manufacturing variability and circuits ageing. While these can be mitigated once they are identified, their continuous rate of appearance throughout the lifetime of memory devices will always cause unexpected errors. In addition, transient faults are caused by effects such as radiation or small voltage/frequency margins, and there is no efficient way to shield against these events. Other constraints related to the diminishing sizes of transistors, such as power consumption and memory latency have caused the microprocessor industry to turn to increasingly complex processor architectures. To solve the difficulties arising from programming such architectures, programming models have emerged that rely on runtime systems. These systems form a new intermediate layer on the hardware-software abstraction stack, that performs tasks such as distributing work across computing resources: processor cores, accelerators, etc. These runtime systems dispose of a lot of information, both from the hardware and the applications, and offer thus many possibilities for optimisations. This thesis proposes solutions to the increasing fault rates in memory, across multiple resilience disciplines, from algorithm-based fault tolerance to hardware error correcting codes, through OS reliability strategies. These solutions rely for their efficiency on the opportunities presented by runtime systems. The first contribution of this thesis is an algorithmic-based resilience technique, allowing to tolerate detected errors in memory. This technique allows to recover data that is lost by performing computations that rely on simple redundancy relations identified in the program. The recovery is demonstrated for a family of iterative solvers, the Krylov subspace methods, and evaluated for the conjugate gradient solver. The runtime can transparently overlap the recovery with the computations of the algorithm, which allows to mask the already low overheads of this technique. The second part of this thesis proposes a metric to characterise the impact of faults in memory, which outperforms state-of-the-art metrics in precision and assurances on the error rate. This metric reveals a key insight into data that is not relevant to the program, and we propose an OS-level strategy to ignore errors in such data, by delaying the reporting of detected errors. This allows to reduce failure rates of running programs, by ignoring errors that have no impact. The architectural-level contribution of this thesis is a dynamically adaptable Error Correcting Code (ECC) scheme, that can increase protection of memory regions where the impact of errors is highest. A runtime methodology is presented to estimate the fault rate at runtime using our metric, through performance monitoring tools of current commodity processors. Guiding the dynamic ECC scheme online using the methodology's vulnerability estimates allows to decrease error rates of programs at a fraction of the redundancy cost required for a uniformly stronger ECC. This provides a useful and wide range of trade-offs between redundancy and error rates. The work presented in this thesis demonstrates that runtime systems allow to make the most of redundancy stored in memory, to help tackle increasing error rates in DRAM. This exploited redundancy can be an inherent part of algorithms that allows to tolerate higher fault rates, or in the form of dead data stored in memory. Redundancy can also be added to a program, in the form of ECC. In all cases, the runtime allows to decrease failure rates efficiently, by diminishing recovery costs, identifying redundant data, or targeting critical data. It is thus a very valuable tool for the future computing systems, as it can perform optimisations across different layers of abstractions.Los errores en memoria se vuelven más comunes a medida que las tecnologías de silicio reducen su tamaño. La variabilidad de fabricación y el envejecimiento de los circuitos causan fallos permanentes e intermitentes. Aunque se pueden mitigar una vez identificados, su continua tasa de aparición siempre causa errores inesperados. Además, la memoria también sufre de fallos transitorios contra los cuales no se puede proteger eficientemente. Estos fallos están causados por efectos como la radiación o los reducidos márgenes de voltaje y frecuencia. Otras restricciones coetáneas, como el consumo de energía y la latencia de la memoria, obligaron a las arquitecturas de computadores a volverse cada vez más complejas. Para programar tales procesadores, se desarrollaron modelos de programación basados en entornos de ejecución. Estos sistemas forman una nueva abstracción entre hardware y software, realizando tareas como la distribución del trabajo entre recursos informáticos: núcleos de procesadores, aceleradores, etc. Estos entornos de ejecución disponen de mucha información tanto sobre el hardware como sobre las aplicaciones, y ofrecen así muchas posibilidades de optimización. Esta tesis propone soluciones a los fallos en memoria entre múltiples disciplinas de resiliencia, desde la tolerancia a fallos basada en algoritmos, hasta los códigos de corrección de errores en hardware, incluyendo estrategias de resiliencia del sistema operativo. La eficiencia de estas soluciones depende de las oportunidades que presentan los entornos de ejecución. La primera contribución de esta tesis es una técnica a nivel algorítmico que permite corregir fallos encontrados mientras el programa su ejecuta. Para corregir fallos se han identificado redundancias simples en los datos del programa para toda una clase de algoritmos, los métodos del subespacio de Krylov (gradiente conjugado, GMRES, etc). La estrategia de recuperación de datos desarrollada permite corregir errores sin tener que reinicializar el algoritmo, y aprovecha el modelo de programación para superponer las computaciones del algoritmo y de la recuperación de datos. La segunda parte de esta tesis propone una métrica para caracterizar el impacto de los fallos en la memoria. Esta métrica supera en precisión a las métricas de vanguardia y permite identificar datos que son menos relevantes para el programa. Se propone una estrategia a nivel del sistema operativo retrasando la notificación de los errores detectados, que permite ignorar fallos en estos datos y reducir la tasa de fracaso del programa. Por último, la contribución a nivel arquitectónico de esta tesis es un esquema de Código de Corrección de Errores (ECC por sus siglas en inglés) adaptable dinámicamente. Este esquema puede aumentar la protección de las regiones de memoria donde el impacto de los errores es mayor. Se presenta una metodología para estimar el riesgo de fallo en tiempo de ejecución utilizando nuestra métrica, a través de las herramientas de monitorización del rendimiento disponibles en los procesadores actuales. El esquema de ECC guiado dinámicamente con estas estimaciones de vulnerabilidad permite disminuir la tasa de fracaso de los programas a una fracción del coste de redundancia requerido para un ECC uniformemente más fuerte. El trabajo presentado en esta tesis demuestra que los entornos de ejecución permiten aprovechar al máximo la redundancia contenida en la memoria, para contener el aumento de los errores en ella. Esta redundancia explotada puede ser una parte inherente de los algoritmos que permite tolerar más fallos, en forma de datos inutilizados almacenados en la memoria, o agregada a la memoria de un programa en forma de ECC. En todos los casos, el entorno de ejecución permite disminuir los efectos de los fallos de manera eficiente, disminuyendo los costes de recuperación, identificando datos redundantes, o focalizando esfuerzos de protección en los datos críticos

    Managing Smartphone Testbeds with SmartLab

    Get PDF
    The explosive number of smartphones with ever growing sensing and computing capabilities have brought a paradigm shift to many traditional domains of the computing field. Re-programming smartphones and instrumenting them for application testing and data gathering at scale is currently a tedious and time-consuming process that poses significant logistical challenges. In this paper, we make three major contributions: First, we propose a comprehensive architecture, coined SmartLab1, for managing a cluster of both real and virtual smartphones that are either wired to a private cloud or connected over a wireless link. Second, we propose and describe a number of Android management optimizations (e.g., command pipelining, screen-capturing, file management), which can be useful to the community for building similar functionality into their systems. Third, we conduct extensive experiments and microbenchmarks to support our design choices providing qualitative evidence on the expected performance of each module comprising our architecture. This paper also overviews experiences of using SmartLab in a research-oriented setting and also ongoing and future development efforts

    EVOLUTION OF THE SUBCONTINENTAL LITHOSPHERE DURING MESOZOIC TETHYAN RIFTING: CONSTRAINTS FROM THE EXTERNAL LIGURIAN MANTLE SECTION (NORTHERN APENNINE, ITALY)

    Get PDF
    Our study is focussed on mantle bodies from the External Ligurian ophiolites, within the Monte Gavi and Monte Sant'Agostino areas. Here, two distinct pyroxenite-bearing mantle sections were recognized, mainly based on their plagioclase-facies evolution. The Monte Gavi mantle section is nearly undeformed and records reactive melt infiltration under plagioclase-facies conditions. This process involved both peridotites (clinopyroxene-poor lherzolites) and enclosed spinel pyroxenite layers, and occurred at 0.7–0.8 GPa. In the Monte Gavi peridotites and pyroxenites, the spinel-facies clinopyroxene was replaced by Ca-rich plagioclase and new orthopyroxene, typically associated with secondary clinopyroxene. The reactive melt migration caused increase of TiO2 contents in relict clinopyroxene and spinel, with the latter also recording a Cr2O3 increase. In the Monte Gavi peridotites and pyroxenites, geothermometers based on slowly diffusing elements (REE and Y) record high temperature conditions (1200-1250 °C) related to the melt infiltration event, followed by subsolidus cooling until ca. 900°C. The Monte Sant'Agostino mantle section is characterized by widespread ductile shearing with no evidence of melt infiltration. The deformation recorded by the Monte Sant'Agostino peridotites (clinopyroxene-rich lherzolites) occurred at 750–800 °C and 0.3–0.6 GPa, leading to protomylonitic to ultramylonitic textures with extreme grain size reduction (10–50 μm). Compared to the peridotites, the enclosed pyroxenite layers gave higher temperature-pressure estimates for the plagioclase-facies re-equilibration (870–930 °C and 0.8–0.9 GPa). We propose that the earlier plagioclase crystallization in the pyroxenites enhanced strain localization and formation of mylonite shear zones in the entire mantle section. We subdivide the subcontinental mantle section from the External Ligurian ophiolites into three distinct domains, developed in response to the rifting evolution that ultimately formed a Middle Jurassic ocean-continent transition: (1) a spinel tectonite domain, characterized by subsolidus static formation of plagioclase, i.e. the Suvero mantle section (Hidas et al., 2020), (2) a plagioclase mylonite domain experiencing melt-absent deformation and (3) a nearly undeformed domain that underwent reactive melt infiltration under plagioclase-facies conditions, exemplified by the the Monte Sant'Agostino and the Monte Gavi mantle sections, respectively. We relate mantle domains (1) and (2) to a rifting-driven uplift in the late Triassic accommodated by large-scale shear zones consisting of anhydrous plagioclase mylonites. Hidas K., Borghini G., Tommasi A., Zanetti A. & Rampone E. 2021. Interplay between melt infiltration and deformation in the deep lithospheric mantle (External Liguride ophiolite, North Italy). Lithos 380-381, 105855

    Impact of Etna’s volcanic emission on major ions and trace elements composition of the atmospheric deposition

    Get PDF
    Mt. Etna, on the eastern coast of Sicily (Italy), is one of the most active volcanoes on the planet and it is widely recognized as a big source of volcanic gases (e.g., CO2 and SO2), halogens, and a lot of trace elements, to the atmosphere in the Mediterranean region. Especially during eruptive periods, Etna’s emissions can be dispersed over long distances and cover wide areas. A group of trace elements has been recently brought to attention for their possible environmental and human health impacts, the Technology-critical elements. The current knowledge about their geochemical cycles is still scarce, nevertheless, recent studies (Brugnone et al., 2020) evidenced a contribution from the volcanic activity for some of them (Te, Tl, and REE). In 2021, in the framework of the research project “Pianeta Dinamico”, by INGV, a network of 10 bulk collectors was implemented to collect, monthly, atmospheric deposition samples. Four of these collectors are located on the flanks of Mt. Etna, other two are in the urban area of Catania and three are in the industrial area of Priolo, all most of the time downwind of the main craters. The last one, close to Cesarò (Nebrodi Regional Park), represents the regional background. The research aims to produce a database on major ions and trace element compositions of the bulk deposition and here we report the values of the main physical-chemical parameters and the deposition fluxes of major ions and trace elements from the first year of research. The pH ranged from 3.1 to 7.7, with a mean value of 5.6, in samples from the Etna area, while it ranged between 5.2 and 7.6, with a mean value of 6.4, in samples from the other study areas. The EC showed values ranging from 5 to 1032 μS cm-1, with a mean value of 65 μS cm-1. The most abundant ions were Cl- and SO42- for anions, Na+ and Ca+ for cations, whose mean deposition fluxes, considering all sampling sites, were 16.6, 6.8, 8.4, and 6.0 mg m-2 d, respectively. The highest deposition fluxes of volcanic refractory elements, such as Al, Fe, and Ti, were measured in the Etna’s sites, with mean values of 948, 464, and 34.3 μg m-2 d-1, respectively, higher than those detected in the other sampling sites, further away from the volcanic source (26.2, 12.4, 0.5 μg m-2 d-1, respectively). The same trend was also observed for volatile elements of prevailing volcanic origin, such as Tl (0.49 μg m-2 d-1), Te (0.07 μg m-2 d-1), As (0.95 μg m-2 d-1), Se (1.92 μg m-2 d-1), and Cd (0.39 μg m-2 d-1). Our preliminary results show that, close to a volcanic area, volcanic emissions must be considered among the major contributors of ions and trace elements to the atmosphere. Their deposition may significantly impact the pedosphere, hydrosphere, and biosphere and directly or indirectly human health
    corecore