91 research outputs found

    Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing

    Get PDF
    Abstract—Large applications executing on Grid or cluster architectures consisting of hundreds or thousands of computational nodes create problems with respect to reliability. The source of the problems are node failures and the need for dynamic configuration over extensive runtime. This paper presents two fault-tolerance mechanisms called Theft-Induced Checkpointing and Systematic Event Logging. These are transparent protocols capable of overcoming problems associated with both benign faults, i.e., crash faults, and node or subnet volatility. Specifically, the protocols base the state of the execution on a dataflow graph, allowing for efficient recovery in dynamic heterogeneous systems as well as multithreaded applications. By allowing recovery even under different numbers of processors, the approaches are especially suitable for applications with a need for adaptive or reactionary configuration control. The low-cost protocols offer the capability of controlling or bounding the overhead. A formal cost model is presented, followed by an experimental evaluation. It is shown that the overhead of the protocol is very small, and the maximum work lost by a crashed process is small and bounded. Index Terms—Grid computing, rollback recovery, checkpointing, event logging. Ç

    Optimal Asynchronous Garbage Collection for RDT Checkpointing Protocols

    Get PDF
    Communication-induced checkpointing protocols that ensure rollback-dependency trackability (RDT) guarantee important properties to the recovery system without explicit coordination. However, to the best of our knowledge, there was no garbage collection algorithm for them which did not use some type of process synchronization, like time assumptions or reliable control message exchanges. This paper addresses the problem of garbage collection for RDT checkpointing protocols and presents an optimal solution for the case where coordination is done only by means of timestamps piggybacked in application messages. Our algorithm uses the same timestamps as off-the-shelf RDT protocols and ensures the tight upper bound on the number of uncollected checkpoints for each process during all the system execution

    Enhancing Program Soft Error Resilience through Algorithmic Approaches

    Get PDF
    The rising count and shrinking feature size of transistors within modern computers is making them increasingly vulnerable to various types of soft faults. This problem is especially acute in high-performance computing (HPC) systems used for scientific computing, because these systems include many thousands of compute cores and nodes, all of which may be utilized in a single large-scale run. The increasing vulnerability of HPC applications to errors induced by soft faults is motivating extensive work on techniques to make these applications more resilient to such faults, ranging from generic techniques such as replication or checkpoint/restart to algorithm-specific error detection and tolerance techniques. Effective use of such techniques requires a detailed understanding of how a given application is affected by soft faults to ensure that (i) efforts to improve application resilience are spent in the code regions most vulnerable to faults, (ii) the appropriate resilience techniques is applied to each code region, and (iii) the understanding be obtained in an efficient manner. This thesis presents two tools: FaultTelescope helps application developers view the routine and application vulnerability to soft errors while ErrorSight helps perform modular fault characteristics analysis for more complex applications. This thesis also illustrates how these tools can be used in the context of representative applications and kernels. In addition to providing actionable insights into application behavior, the tools automatically selects the number of fault injection experiments required to efficiently generation error profiles of an application, ensuring that the information is statistically well-grounded without performing unnecessary experiments

    Overcoming Byzantine Failures Using Checkpointing

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryDARPA / F30602-00-C-0172 and F30602-02-C-013

    Exploiting task-based programming models for resilience

    Get PDF
    Hardware errors become more common as silicon technologies shrink and become more vulnerable, especially in memory cells, which are the most exposed to errors. Permanent and intermittent faults are caused by manufacturing variability and circuits ageing. While these can be mitigated once they are identified, their continuous rate of appearance throughout the lifetime of memory devices will always cause unexpected errors. In addition, transient faults are caused by effects such as radiation or small voltage/frequency margins, and there is no efficient way to shield against these events. Other constraints related to the diminishing sizes of transistors, such as power consumption and memory latency have caused the microprocessor industry to turn to increasingly complex processor architectures. To solve the difficulties arising from programming such architectures, programming models have emerged that rely on runtime systems. These systems form a new intermediate layer on the hardware-software abstraction stack, that performs tasks such as distributing work across computing resources: processor cores, accelerators, etc. These runtime systems dispose of a lot of information, both from the hardware and the applications, and offer thus many possibilities for optimisations. This thesis proposes solutions to the increasing fault rates in memory, across multiple resilience disciplines, from algorithm-based fault tolerance to hardware error correcting codes, through OS reliability strategies. These solutions rely for their efficiency on the opportunities presented by runtime systems. The first contribution of this thesis is an algorithmic-based resilience technique, allowing to tolerate detected errors in memory. This technique allows to recover data that is lost by performing computations that rely on simple redundancy relations identified in the program. The recovery is demonstrated for a family of iterative solvers, the Krylov subspace methods, and evaluated for the conjugate gradient solver. The runtime can transparently overlap the recovery with the computations of the algorithm, which allows to mask the already low overheads of this technique. The second part of this thesis proposes a metric to characterise the impact of faults in memory, which outperforms state-of-the-art metrics in precision and assurances on the error rate. This metric reveals a key insight into data that is not relevant to the program, and we propose an OS-level strategy to ignore errors in such data, by delaying the reporting of detected errors. This allows to reduce failure rates of running programs, by ignoring errors that have no impact. The architectural-level contribution of this thesis is a dynamically adaptable Error Correcting Code (ECC) scheme, that can increase protection of memory regions where the impact of errors is highest. A runtime methodology is presented to estimate the fault rate at runtime using our metric, through performance monitoring tools of current commodity processors. Guiding the dynamic ECC scheme online using the methodology's vulnerability estimates allows to decrease error rates of programs at a fraction of the redundancy cost required for a uniformly stronger ECC. This provides a useful and wide range of trade-offs between redundancy and error rates. The work presented in this thesis demonstrates that runtime systems allow to make the most of redundancy stored in memory, to help tackle increasing error rates in DRAM. This exploited redundancy can be an inherent part of algorithms that allows to tolerate higher fault rates, or in the form of dead data stored in memory. Redundancy can also be added to a program, in the form of ECC. In all cases, the runtime allows to decrease failure rates efficiently, by diminishing recovery costs, identifying redundant data, or targeting critical data. It is thus a very valuable tool for the future computing systems, as it can perform optimisations across different layers of abstractions.Los errores en memoria se vuelven más comunes a medida que las tecnologías de silicio reducen su tamaño. La variabilidad de fabricación y el envejecimiento de los circuitos causan fallos permanentes e intermitentes. Aunque se pueden mitigar una vez identificados, su continua tasa de aparición siempre causa errores inesperados. Además, la memoria también sufre de fallos transitorios contra los cuales no se puede proteger eficientemente. Estos fallos están causados por efectos como la radiación o los reducidos márgenes de voltaje y frecuencia. Otras restricciones coetáneas, como el consumo de energía y la latencia de la memoria, obligaron a las arquitecturas de computadores a volverse cada vez más complejas. Para programar tales procesadores, se desarrollaron modelos de programación basados en entornos de ejecución. Estos sistemas forman una nueva abstracción entre hardware y software, realizando tareas como la distribución del trabajo entre recursos informáticos: núcleos de procesadores, aceleradores, etc. Estos entornos de ejecución disponen de mucha información tanto sobre el hardware como sobre las aplicaciones, y ofrecen así muchas posibilidades de optimización. Esta tesis propone soluciones a los fallos en memoria entre múltiples disciplinas de resiliencia, desde la tolerancia a fallos basada en algoritmos, hasta los códigos de corrección de errores en hardware, incluyendo estrategias de resiliencia del sistema operativo. La eficiencia de estas soluciones depende de las oportunidades que presentan los entornos de ejecución. La primera contribución de esta tesis es una técnica a nivel algorítmico que permite corregir fallos encontrados mientras el programa su ejecuta. Para corregir fallos se han identificado redundancias simples en los datos del programa para toda una clase de algoritmos, los métodos del subespacio de Krylov (gradiente conjugado, GMRES, etc). La estrategia de recuperación de datos desarrollada permite corregir errores sin tener que reinicializar el algoritmo, y aprovecha el modelo de programación para superponer las computaciones del algoritmo y de la recuperación de datos. La segunda parte de esta tesis propone una métrica para caracterizar el impacto de los fallos en la memoria. Esta métrica supera en precisión a las métricas de vanguardia y permite identificar datos que son menos relevantes para el programa. Se propone una estrategia a nivel del sistema operativo retrasando la notificación de los errores detectados, que permite ignorar fallos en estos datos y reducir la tasa de fracaso del programa. Por último, la contribución a nivel arquitectónico de esta tesis es un esquema de Código de Corrección de Errores (ECC por sus siglas en inglés) adaptable dinámicamente. Este esquema puede aumentar la protección de las regiones de memoria donde el impacto de los errores es mayor. Se presenta una metodología para estimar el riesgo de fallo en tiempo de ejecución utilizando nuestra métrica, a través de las herramientas de monitorización del rendimiento disponibles en los procesadores actuales. El esquema de ECC guiado dinámicamente con estas estimaciones de vulnerabilidad permite disminuir la tasa de fracaso de los programas a una fracción del coste de redundancia requerido para un ECC uniformemente más fuerte. El trabajo presentado en esta tesis demuestra que los entornos de ejecución permiten aprovechar al máximo la redundancia contenida en la memoria, para contener el aumento de los errores en ella. Esta redundancia explotada puede ser una parte inherente de los algoritmos que permite tolerar más fallos, en forma de datos inutilizados almacenados en la memoria, o agregada a la memoria de un programa en forma de ECC. En todos los casos, el entorno de ejecución permite disminuir los efectos de los fallos de manera eficiente, disminuyendo los costes de recuperación, identificando datos redundantes, o focalizando esfuerzos de protección en los datos críticos.Postprint (published version

    Mitigation of failures in high performance computing via runtime techniques

    Get PDF
    As machines increase in scale, it is predicted that failure rates of supercomputers will correspondingly increase. Even though the mean time to failure (MTTF) of individual component is high, the large number of components significantly decreases the system MTTF. Meanwhile, the decreasing size of transistors has been critical to the increase in capacity of supercomputers. The smaller the transistors are, silent data corruptions (SDC) are likely to occur more frequently. SDCs do not inhibit execution, but may silently lead to incorrect results. In this thesis, we leverage runtime system and compiler techniques to mitigate a significant fraction of failures automatically with low overhead. The main goals of various system-level fault tolerance strategies designed in this thesis are: reducing the extra cost added to application execution while improving system reliability; automatically adjusting fault tolerance decisions without user intervention based on environmental changes; protecting applications not only from fail-stop failures but also from silent data corruptions. The main contributions of this thesis are development of a semi-blocking checkpoint protocol that overlaps application execution with fault tolerance operation to reduce the overhead of checkpointing, a runtime system technique for automatic checkpoint and restart without user intervention, a holistic framework (ACR) for automatically detecting and recovering from silent data corruptions and a framework called FlipBack that provides targeted protection against silent data corruption with low cost

    Reliability for exascale computing : system modelling and error mitigation for task-parallel HPC applications

    Get PDF
    As high performance computing (HPC) systems continue to grow, their fault rate increases. Applications running on these systems have to deal with rates on the order of hours or days. Furthermore, some studies for future Exascale systems predict the rates to be on the order of minutes. As a result, efficient fault tolerance solutions are needed to be able to tolerate frequent failures. A fault tolerance solution for future HPC and Exascale systems must be low-cost, efficient and highly scalable. It should have low overhead in fault-free execution and provide fast restart because long-running applications are expected to experience many faults during the execution. Meanwhile task-based dataflow parallel programming models (PM) are becoming a popular paradigm in HPC applications at large scale. For instance, we see the adaptation of task-based dataflow parallelism in OpenMP 4.0, OmpSs PM, Argobots and Intel Threading Building Blocks. In this thesis we propose fault-tolerance solutions for task-parallel dataflow HPC applications. Specifically, first we design and implement a checkpoint/restart and message-logging framework to recover from errors. We then develop performance models to investigate the benefits of our task-level frameworks when integrated with system-wide checkpointing. Moreover, we design and implement selective task replication mechanisms to detect and recover from silent data corruptions in task-parallel dataflow HPC applications. Finally, we introduce a runtime-based coding scheme to detect and recover from memory errors in these applications. Considering the span of all of our schemes, we see that they provide a fairly high failure coverage where both computation and memory is protected against errors.A medida que los Sistemas de Cómputo de Alto rendimiento (HPC por sus siglas en inglés) siguen creciendo, también las tasas de fallos aumentan. Las aplicaciones que se ejecutan en estos sistemas tienen una tasa de fallos que pueden estar en el orden de horas o días. Además, algunos estudios predicen que los fallos estarán en el orden de minutos en los Sistemas Exascale. Por lo tanto, son necesarias soluciones eficientes para la tolerancia a fallos que puedan tolerar fallos frecuentes. Las soluciones para tolerancia a fallos en los Sistemas futuros de HPC y Exascale tienen que ser de bajo costo, eficientes y altamente escalable. El sobrecosto en la ejecución sin fallos debe ser bajo y también se debe proporcionar reinicio rápido, ya que se espera que las aplicaciones de larga duración experimenten muchos fallos durante la ejecución. Por otra parte, los modelos de programación paralelas basados en tareas ordenadas de acuerdo a sus dependencias de datos, se están convirtiendo en un paradigma popular en aplicaciones HPC a gran escala. Por ejemplo, los siguientes modelos de programación paralela incluyen este tipo de modelo de programación OpenMP 4.0, OmpSs, Argobots e Intel Threading Building Blocks. En esta tesis proponemos soluciones de tolerancia a fallos para aplicaciones de HPC programadas en un modelo de programación paralelo basado tareas. Específicamente, en primer lugar, diseñamos e implementamos mecanismos “checkpoint/restart” y “message-logging” para recuperarse de los errores. Para investigar los beneficios de nuestras herramientas a nivel de tarea cuando se integra con los “system-wide checkpointing” se han desarrollado modelos de rendimiento. Por otra parte, diseñamos e implementamos mecanismos de replicación selectiva de tareas que permiten detectar y recuperarse de daños de datos silenciosos en aplicaciones programadas siguiendo el modelo de programación paralela basadas en tareas. Por último, se introduce un esquema de codificación que funciona en tiempo de ejecución para detectar y recuperarse de los errores de la memoria en estas aplicaciones. Todos los esquemas propuestos, en conjunto, proporcionan una cobertura bastante alta a los fallos tanto si estos se producen el cálculo o en la memoria.Postprint (published version

    Static analysis for facilitating secure and reliable software

    Get PDF
    Software security and reliability are aspects of major concern for software development enterprises that wish to deliver dependable software to their customers. Several static analysis-based approaches for facilitating the development of secure and reliable software have been proposed over the years. The purpose of the present thesis is to investigate these approaches and to extend their state of the art by addressing existing open issues that have not been sufficiently addressed yet. To this end, an empirical study was initially conducted with the purpose to investigate the ability of software metrics (e.g., complexity metrics) to discriminate between different types of vulnerabilities, and to examine whether potential interdependencies exist between different vulnerability types. The results of the analysis revealed that software metrics can be used only as weak indicators of specific security issues, while important interdependencies may exist between different types of vulnerabilities. The study also verified the capacity of software metrics (including previously uninvestigated metrics) to indicate the existence of vulnerabilities in general. Subsequently, a hierarchical security assessment model able to quantify the internal security level of software products, based on static analysis alerts and software metrics is proposed. The model is practical, since it is fully-automated and operationalized in the form of individual tools, while it is also sufficiently reliable since it was built based on data and well-accepted sources of information. An extensive evaluation of the model on a large volume of empirical data revealed that it is able to reliably assess software security both at product- and at class-level of granularity, with sufficient discretion power, while it may be also used for vulnerability prediction. The experimental results also provide further support regarding the ability of static analysis alerts and software metrics to indicate the existence of software vulnerabilities. Finally, a mathematical model for calculating the optimum checkpoint interval, i.e., the checkpoint interval that minimizes the execution time of software programs that adopt the application-level checkpoint and restart (ALCR) mechanism was proposed. The optimum checkpoint interval was found to depend on the failure rate of the application, the execution cost for establishing a checkpoint, and the execution cost for restarting a program after failure. Emphasis was given on programs with loops, while the results were illustrated through several numerical examples.Open Acces

    Scalable and Reliable Sparse Data Computation on Emergent High Performance Computing Systems

    Get PDF
    Heterogeneous systems with both CPUs and GPUs have become important system architectures in emergent High Performance Computing (HPC) systems. Heterogeneous systems must address both performance-scalability and power-scalability in the presence of failures. Aggressive power reduction pushes hardware to its operating limit and increases the failure rate. Resilience allows programs to progress when subjected to faults and is an integral component of large-scale systems, but incurs significant time and energy overhead. The future exascale systems are expected to have higher power consumption with higher fault rates. Sparse data computation is the fundamental kernel in many scientific applications. It is suitable for the studies of scalability and resilience on heterogeneous systems due to its computational characteristics. To deliver the promised performance within the given power budget, heterogeneous computing mandates a deep understanding of the interplay between scalability and resilience. Managing scalability and resilience is challenging in heterogeneous systems, due to the heterogeneous compute capability, power consumption, and varying failure rates between CPUs and GPUs. Scalability and resilience have been traditionally studied in isolation, and optimizing one typically detrimentally impacts the other. While prior works have been proved successful in optimizing scalability and resilience on CPU-based homogeneous systems, simply extending current approaches to heterogeneous systems results in suboptimal performance-scalability and/or power-scalability. To address the above multiple research challenges, we propose novel resilience and energy-efficiency technologies to optimize scalability and resilience for sparse data computation on heterogeneous systems with CPUs and GPUs. First, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes, and develop and prototype performance optimization and power management strategies to improve scalability for sparse linear solvers. Our results quantitatively reveal that each resilience scheme has its own advantages depending on the fault rate, system size, and power budget, and the forward recovery can further benefit from our performance and power optimizations for large-scale computing. Second, we design a novel resilience technique that relaxes the requirement of synchronization and identicalness for processes, and allows them to run in heterogeneous resources with power reduction. Our results show a significant reduction in energy for unmodified programs in various fault situations compared to exact replication techniques. Third, we propose a novel distributed sparse tensor decomposition that utilizes an asynchronous RDMA-based approach with OpenSHMEM to improve scalability on large-scale systems and prove that our method works well in heterogeneous systems. Our results show our irregularity-aware workload partition and balanced-asynchronous algorithms are scalable and outperform the state-of-the-art distributed implementations. We demonstrate that understanding different bottlenecks for various types of tensors plays critical roles in improving scalability
    corecore