254 research outputs found
HarvOS: Efficient code instrumentation for transiently-powered embedded sensing
We present code instrumentation strategies to allow transiently-powered embedded sensing devices efficiently checkpoint the system's state before energy is exhausted. Our solution, called HarvOS, operates at compile-time with limited developer intervention based on the control-flow graph of a program, while adapting to varying levels of remaining energy and possible program executions at run-time. In addition, the underlying design rationale allows the system to spare the energy-intensive probing of the energy buffer whenever possible. Compared to existing approaches, our evaluation indicates that HarvOS allows transiently-powered devices to complete a given workload with 68% fewer checkpoints, on average. Moreover, our performance in the number of required checkpoints rests only 19% far from that of an "oracle" that represents an ideal solution, yet unfeasible in practice, that knows exactly the last point in time when to checkpoint
CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance
In order to efficiently use the future generations of supercomputers, fault
tolerance and power consumption are two of the prime challenges anticipated by
the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has
been and still is the most widely used technique to deal with hard failures.
Application-level CR is the most effective CR technique in terms of overhead
efficiency but it takes a lot of implementation effort. This work presents the
implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic
Fault Tolerance), which serves two purposes. First, it provides an extendable
library that significantly eases the implementation of application-level
checkpointing. The most basic and frequently used checkpoint data types are
already part of CRAFT and can be directly used out of the box. The library can
be easily extended to add more data types. As means of overhead reduction, the
library offers a build-in asynchronous checkpointing mechanism and also
supports the Scalable Checkpoint/Restart (SCR) library for node level
checkpointing. Second, CRAFT provides an easier interface for User-Level
Failure Mitigation (ULFM) based dynamic process recovery, which significantly
reduces the complexity and effort of failure detection and communication
recovery mechanism. By utilizing both functionalities together, applications
can write application-level checkpoints and recover dynamically from process
failures with very limited programming effort. This work presents the design
and use of our library in detail. The associated overheads are thoroughly
analyzed using several benchmarks
FPGA checkpointing for scientific computing
The use of FPGAs in computational workloads is becoming increasingly popular due to the flexibility of these devices in comparison to ASICs, and their low power consumption compared to GPUs and CPUs. However, scientific applications run for long periods of time and the hardware is always subject to failures due to either soft or hard errors. Thus, it is important to protect these long running jobs with fault tolerance mechanisms. Checkpoint-Restart is a popular technique in high-performance computing that allows large scale applications to cope with frequent failures. In this work we approach the fault tolerance of CPU-FPGA heterogeneous applications from a high level by using OmpSs@FPGA environment and a multi-level checkpointing library. We analyse the performance of several different applications and we understand what kind of overheads we can expect from checkpointing computational workloads running on FPGAs. Our results demonstrate overheads as low as 0.16% and 0.66% when checkpointing very frequently, indicating that this technique is efficient and does not add a significant amount of overhead to the system. In addition, we showcase a proof of concept for checkpointing partial data of the FPGA task itself. This can prove useful for workloads in which most data is offloaded to the FPGA memory at once and do not constantly move all the data between the accelerator and the CPU.This research has received funding from the European Union’s Horizon 2020 research and innovation programme under projects EuroEXA (grant agreement nº 754337) and eProcessor (grant agreement nº 956702).Peer ReviewedPostprint (author's final draft
Analysis of Performance-impacting Factors on Checkpointing Frameworks: The CPPC Case Study
This is a post-peer-review, pre-copyedit version of an article published in The Computer Journal. The final authenticated version is available online at: https://doi.org/10.1093/comjnl/bxr018[Abstract] This paper focuses on the performance evaluation of Compiler for Portable Checkpointing (CPPC), a tool for the checkpointing of parallel message-passing applications. Its performance and the factors that impact it are transparently and rigorously identified and assessed. The tests were performed on a public supercomputing infrastructure, using a large number of very different applications and showing excellent results in terms of performance and effort required for integration into user codes. Statistical analysis techniques have been used to better approximate the performance of the tool. Quantitative and qualitative comparisons with other rollback-recovery approaches to fault tolerance are also included. All these data and comparisons are then discussed in an effort to extract meaningful conclusions about the state-of-the-art and future research trends in the rollback-recovery field.Minsiterio de Ciencia e Innovación; TIN2010-1673
Hibernus++: a self-calibrating and adaptive system for transiently-powered embedded devices
Energy harvesters are being used to power autonomous systems, but their output power is variable and intermittent. To sustain computation, these systems integrate batteries or supercapacitors to smooth out rapid changes in harvester output. Energy storage devices require time for charging and increase the size, mass and cost of systems. The field of transient computing moves away from this approach, by powering the system directly from the harvester output. To prevent an application from having to restart computation after a power outage, approaches such as Hibernus allow these systems to hibernate when supply failure is imminent. When the supply reaches the operating threshold, the last saved state is restored and the operation is continued from the point it was interrupted. This work proposes Hibernus++ to intelligently adapt the hibernate and restore thresholds in response to source dynamics and system load properties. Specifically, capabilities are built into the system to autonomously characterize the hardware platform and its performance during hibernation in order to set the hibernation threshold at a point which minimizes wasted energy and maximizes computation time. Similarly, the system auto-calibrates the restore threshold depending on the balance of energy supply and consumption in order to maximize computation time. Hibernus++ is validated both theoretically and experimentally on microcontroller hardware using both synthesized and real energy harvesters. Results show that Hibernus++ provides an average 16% reduction in energy consumption and an improvement of 17% in application execution time over stateof- the-art approaches
Application-level Fault Tolerance and Resilience in HPC Applications
Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo]
As necesidades computacionais das distintas ramas da ciencia medraron enormemente
nos últimos anos, o que provocou un gran crecemento no rendemento proporcionado
polos supercomputadores. Cada vez constrúense sistemas de computación
de altas prestacións de maior tamaño, con máis recursos hardware de distintos tipos,
o que fai que as taxas de fallo destes sistemas tamén medren. Polo tanto, o
estudo de técnicas de tolerancia a fallos eficientes é indispensábel para garantires
que os programas científicos poidan completar a súa execución, evitando ademais
que se dispare o consumo de enerxía. O checkpoint/restart é unha das técnicas máis
populares. Sen embargo, a maioría da investigación levada a cabo nas últimas décadas
céntrase en estratexias stop-and-restart para aplicacións de memoria distribuída
tralo acontecemento dun fallo-parada. Esta tese propón técnicas checkpoint/restart
a nivel de aplicación para os modelos de programación paralela roáis populares en
supercomputación. Implementáronse protocolos de checkpointing para aplicacións
híbridas MPI-OpenMP e aplicacións heteroxéneas baseadas en OpenCL, en ámbolos
dous casos prestando especial coidado á portabilidade e maleabilidade da solución.
En canto a aplicacións de memoria distribuída, proponse unha solución de resiliencia
que pode ser empregada de forma xenérica en aplicacións MPI SPMD, permitindo
detectar e reaccionar a fallos-parada sen abortar a execución. Neste caso, os procesos
fallidos vólvense a lanzar e o estado da aplicación recupérase cunha volta atrás global.
A maiores, esta solución de resiliencia optimizouse implementando unha volta
atrás local, na que só os procesos fallidos volven atrás, empregando un protocolo de
almacenaxe de mensaxes para garantires a consistencia e o progreso da execución.
Por último, propónse a extensión dunha librería de checkpointing para facilitares a implementación de estratexias de recuperación ad hoc ante conupcións de memoria.
En moitas ocasións, estos erros poden ser xestionados a nivel de aplicación, evitando
desencadear un fallo-parada e permitindo unha recuperación máis eficiente.[Resumen]
El rápido aumento de las necesidades de cómputo de distintas ramas de la ciencia
ha provocado un gran crecimiento en el rendimiento ofrecido por los supercomputadores.
Cada vez se construyen sistemas de computación de altas prestaciones mayores,
con más recursos hardware de distintos tipos, lo que hace que las tasas de
fallo del sistema aumenten. Por tanto, el estudio de técnicas de tolerancia a fallos
eficientes resulta indispensable para garantizar que los programas científicos puedan
completar su ejecución, evitando además que se dispare el consumo de energía. La
técnica checkpoint/restart es una de las más populares. Sin embargo, la mayor parte
de la investigación en este campo se ha centrado en estrategias stop-and-restart
para aplicaciones de memoria distribuida tras la ocurrencia de fallos-parada. Esta
tesis propone técnicas checkpoint/restart a nivel de aplicación para los modelos de
programación paralela más populares en supercomputación. Se han implementado
protocolos de checkpointing para aplicaciones híbridas MPI-OpenMP y aplicaciones
heterogéneas basadas en OpenCL, prestando en ambos casos especial atención a la
portabilidad y la maleabilidad de la solución. Con respecto a aplicaciones de memoria
distribuida, se propone una solución de resiliencia que puede ser usada de forma
genérica en aplicaciones MPI SPMD, permitiendo detectar y reaccionar a fallosparada
sin abortar la ejecución. En su lugar, se vuelven a lanzar los procesos fallidos
y se recupera el estado de la aplicación con una vuelta atrás global. A mayores, esta
solución de resiliencia ha sido optimizada implementando una vuelta atrás local, en
la que solo los procesos fallidos vuelven atrás, empleando un protocolo de almacenaje
de mensajes para garantizar la consistencia y el progreso de la ejecución. Por
último, se propone una extensión de una librería de checkpointing para facilitar la
implementación de estrategias de recuperación ad hoc ante corrupciones de memoria.
Muchas veces, este tipo de errores puede gestionarse a nivel de aplicación, evitando
desencadenar un fallo-parada y permitiendo una recuperación más eficiente.[Abstract]
The rapid increase in the computational demands of science has lead to a pronounced
growth in the performance offered by supercomputers. As High Performance
Computing (HPC) systems grow larger, including more hardware components
of different types, the system's failure rate becomes higher. Efficient fault
tolerance techniques are essential not only to ensure the execution completion but
also to save energy. Checkpoint/restart is one of the most popular fault tolerance
techniques. However, most of the research in this field is focused on stop-and-restart
strategies for distributed-memory applications in the event of fail-stop failures. Thís
thesis focuses on the implementation of application-level checkpoint/restart solutions
for the most popular parallel programming models used in HPC. Hence, we
have implemented checkpointing solutions to cope with fail-stop failures in hybrid
MPI-OpenMP applications and OpenCL-based programs. Both strategies maximize
the restart portability and malleability, ie., the recovery can take place on
machines with different CPU / accelerator architectures, and/ or operating systems,
and can be adapted to the available resources (number of cores/accelerators). Regarding
distributed-memory applications, we propose a resilience solution that can
be generally applied to SPMD MPI programs. Resilient applications can detect and
react to failures without aborting their execution upon fail-stop failures. Instead,
failed processes are re-spawned, and the application state is recovered through a
global rollback. Moreover, we have optimized this resilience proposal by implementing
a local rollback protocol, in which only failed processes rollback to a previous
state, while message logging enables global consistency and further progress of the
computation. Finally, we have extended a checkpointing library to facilitate the
implementation of ad hoc recovery strategies in the event of soft errors) caused by
memory corruptions. Many times, these errors can be handled at the software-Ievel,
tIms, avoiding fail-stop failures and enabling a more efficient recovery
- …