6 research outputs found

    A simulation infrastructure for examining the performance of resilience strategies at scale.

    Get PDF
    Fault-tolerance is a major challenge for many current and future extreme-scale systems, with many studies showing it to be the key limiter to application scalability. While there are a number of studies investigating the performance of various resilience mechanisms, these are typically limited to scales orders of magnitude smaller than expected for next-generation systems and simple benchmark problems. In this paper we show how, with very minor changes, a previously published and validated simulation framework for investigating appli- cation performance of OS noise can be used to simulate the overheads of various resilience mechanisms at scale. Using this framework, we compare the failure-free performance of this simulator against an analytic model to validate its performance and demonstrate its ability to simulate the performance of two popular rollback recovery methods on traces from rea

    Deduplication potential of HPC applications' checkpoints

    Get PDF
    漏 2016 IEEE. HPC systems contain an increasing number of components, decreasing the mean time between failures. Checkpoint mechanisms help to overcome such failures for long-running applications. A viable solution to remove the resulting pressure from the I/O backends is to deduplicate the checkpoints. However, there is little knowledge about the potential to save I/Os for HPC applications by using deduplication within the checkpointing process. In this paper, we perform a broad study about the deduplication behavior of HPC application checkpointing and its impact on system design

    Using Rollback Avoidance to Mitigate Failures in Next-Generation Extreme-Scale Systems

    Get PDF
    High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in many important physical systems. The next major milestone in the development of HPC systems is the construction of the first supercomputer capable executing more than an exaflop, 10^18 floating point operations per second. On systems of this scale, failures will occur much more frequently than on current systems. As a result, resilience is a key obstacle to building next-generation extreme-scale systems. Coordinated checkpointing is currently the most widely-used mechanism for handling failures on HPC systems. Although coordinated checkpointing remains effective on current systems, increasing the scale of today\u27s systems to build next-generation systems will increase the cost of fault tolerance as more and more time is taken away from the application to protect against or recover from failure. Rollback avoidance techniques seek to mitigate the cost of checkpoint/restart by allowing an application to continue its execution rather than rolling back to an earlier checkpoint when failures occur. These techniques include failure prediction and preventive migration, replicated computation, fault-tolerant algorithms, and software-based memory fault correction. In this thesis, I examine how rollback avoidance techniques can be used to address failures on extreme-scale systems. Using a combination of analytic modeling and simulation, I evaluate the potential impact of rollback avoidance on these systems. I then present a novel rollback avoidance technique that exploits similarities in application memory. Finally, I examine the feasibility of using this technique to protect against memory faults in kernel memory

    Studies in Exascale Computer Architecture: Interconnect, Resiliency, and Checkpointing

    Full text link
    Today鈥檚 supercomputers are built from the state-of-the-art components to extract as much performance as possible to solve the most computationally intensive problems in the world. Building the next generation of exascale supercomputers, however, would require re-architecting many of these components to extract over 50x more performance than the current fastest supercomputer in the United States. To contribute towards this goal, two aspects of the compute node architecture were examined in this thesis: the on-chip interconnect topology and the memory and storage checkpointing platforms. As a first step, a skeleton exascale system was modeled to meet 1 exaflop of performance along with 100 petabytes of main memory. The model revealed that large kilo-core processors would be necessary to meet the exaflop performance goal; existing topologies, however, would not scale to those levels. To address this new challenge, we investigated and proposed asymmetric high-radix topologies that decoupled local and global communications and used different radix routers for switching network traffic at each level. The proposed topologies scaled more readily to higher numbers of cores with better latency and energy consumption than before. The vast number of components that the model revealed would be needed in these exascale systems cautioned towards better fault tolerance mechanisms. To address this challenge, we showed that local checkpoints within the compute node can be saved to a hybrid DRAM and SSD platform in order to write them faster without wearing out the SSD or consuming a lot of energy. A hybrid checkpointing platform allowed more frequent checkpoints to be made without sacrificing performance. Subsequently, we proposed switching to a DIMM-based SSD in order to perform fine-grained I/O operations that would be integral in interleaving checkpointing and computation while still providing persistence guarantees. Two more techniques that consolidate and overlap checkpointing were designed to better hide the checkpointing latency to the SSD.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/137096/1/sabeyrat_1.pd

    SEDAR: Detecci贸n y recuperaci贸n autom谩tica de fallos transitorios en sistemas de c贸mputo de altas prestaciones

    Get PDF
    El manejo de fallos es una preocupaci贸n creciente en el contexto del HPC; en el futuro, se esperan mayores variedades y tasas de errores, intervalos de detecci贸n m谩s largos y fallos silenciosos. Se proyecta que, en los pr贸ximos sistemas de exa-escala, los errores ocurran incluso varias veces al d铆a y se propaguen en grandes aplicaciones paralelas, generando desde ca铆das de procesos hasta corrupciones de resultados debidas a fallos no detectados. En este trabajo se propone SEDAR, una metodolog铆a que mejora la fiabilidad, frente a los fallos transitorios, de un sistema que ejecuta aplicaciones paralelas de paso de mensajes. La soluci贸n dise帽ada, basada en replicaci贸n de procesos para la detecci贸n, combinada con diferentes niveles de checkpointing (checkpoints de nivel de sistema o de nivel de aplicaci贸n) para recuperar autom谩ticamente, tiene el objetivo de ayudar a los usuarios de aplicaciones cient铆ficas a obtener ejecuciones confiables con resultados correctos. La detecci贸n se logra replicando internamente cada proceso de la aplicaci贸n en threads y monitorizando los contenidos de los mensajes entre los threads antes de enviar a otro proceso; adem谩s, los resultados finales se validan para prevenir la corrupci贸n del c贸mputo local. Esta estrategia permite relanzar la ejecuci贸n desde el comienzo ni bien se produce la detecci贸n, sin esperar innecesariamente hasta la conclusi贸n incorrecta. Para la recuperaci贸n, se utilizan checkpoints de nivel de sistema, pero debido a que no existe garant铆a de que un checkpoint particular no contenga errores silenciosos latentes, se requiere el almacenamiento y mantenimiento de m煤ltiples checkpoints, y se implementa un mecanismo para reintentar recuperaciones sucesivas desde checkpoints previos si el mismo error se detecta nuevamente. La 煤ltima opci贸n es utilizar un 煤nico checkpoint de capa de aplicaci贸n, que puede ser verificado para asegurar su validez como punto de recuperaci贸n seguro. En consecuencia, SEDAR se estructura en tres niveles: (1) s贸lo detecci贸n y parada segura con notificaci贸n al usuario; (2) recuperaci贸n basada en una cadena de checkpoints de nivel de sistema; y (3) recuperaci贸n basada en un 煤nico checkpoint v谩lido de capa de aplicaci贸n. Cada una de estas variantes brinda una cobertura particular, pero tiene limitaciones inherentes y costos propios de implementaci贸n; la posibilidad de elegir entre ellos provee flexibilidad para adaptar la relaci贸n costo-beneficio a las necesidades de un sistema particular. Se presenta una descripci贸n completa de la metodolog铆a, su comportamiento en presencia de fallos y los overheads temporales de emplear cada una de las alternativas. Se describe un modelo que considera varios escenarios de fallos y sus efectos predecibles sobre una aplicaci贸n de prueba para realizar una verificaci贸n funcional. Adem谩s, se lleva a cabo una validaci贸n experimental sobre una implementaci贸n real de la herramienta SEDAR, utilizando diferentes benchmarks con patrones de comunicaci贸n dis铆miles. El comportamiento en presencia de fallos, inyectados controladamente en distintos momentos de la ejecuci贸n, permite evaluar el desempe帽o y caracterizar el overhead asociado a su utilizaci贸n. Tomando en cuenta esto, tambi茅n se establecen las condiciones bajo las cuales vale la pena comenzar con la protecci贸n y almacenar varios checkpoints para recuperar, en lugar de simplemente detectar, detener la ejecuci贸n y relanzar. Las posibilidades de configurar el modo de uso, adapt谩ndolo a los requerimientos de cobertura y m谩ximo overhead permitido de un sistema particular, muestran que SEDAR es una metodolog铆a eficaz y viable para la tolerancia a fallos transitorios en entornos de HPC.Tesis con direcci贸n conjunta por convenio de colaboraci贸n entre la Universidad Nacional de La Plata (UNLP) y la Universidad Aut贸noma de Barcelona (UAB).Facultad de Inform谩tic
    corecore