310 research outputs found
Application-level Fault Tolerance and Resilience in HPC Applications
Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo]
As necesidades computacionais das distintas ramas da ciencia medraron enormemente
nos últimos anos, o que provocou un gran crecemento no rendemento proporcionado
polos supercomputadores. Cada vez constrúense sistemas de computación
de altas prestacións de maior tamaño, con máis recursos hardware de distintos tipos,
o que fai que as taxas de fallo destes sistemas tamén medren. Polo tanto, o
estudo de técnicas de tolerancia a fallos eficientes é indispensábel para garantires
que os programas científicos poidan completar a súa execución, evitando ademais
que se dispare o consumo de enerxía. O checkpoint/restart é unha das técnicas máis
populares. Sen embargo, a maioría da investigación levada a cabo nas últimas décadas
céntrase en estratexias stop-and-restart para aplicacións de memoria distribuída
tralo acontecemento dun fallo-parada. Esta tese propón técnicas checkpoint/restart
a nivel de aplicación para os modelos de programación paralela roáis populares en
supercomputación. Implementáronse protocolos de checkpointing para aplicacións
híbridas MPI-OpenMP e aplicacións heteroxéneas baseadas en OpenCL, en ámbolos
dous casos prestando especial coidado á portabilidade e maleabilidade da solución.
En canto a aplicacións de memoria distribuída, proponse unha solución de resiliencia
que pode ser empregada de forma xenérica en aplicacións MPI SPMD, permitindo
detectar e reaccionar a fallos-parada sen abortar a execución. Neste caso, os procesos
fallidos vólvense a lanzar e o estado da aplicación recupérase cunha volta atrás global.
A maiores, esta solución de resiliencia optimizouse implementando unha volta
atrás local, na que só os procesos fallidos volven atrás, empregando un protocolo de
almacenaxe de mensaxes para garantires a consistencia e o progreso da execución.
Por último, propónse a extensión dunha librería de checkpointing para facilitares a implementación de estratexias de recuperación ad hoc ante conupcións de memoria.
En moitas ocasións, estos erros poden ser xestionados a nivel de aplicación, evitando
desencadear un fallo-parada e permitindo unha recuperación máis eficiente.[Resumen]
El rápido aumento de las necesidades de cómputo de distintas ramas de la ciencia
ha provocado un gran crecimiento en el rendimiento ofrecido por los supercomputadores.
Cada vez se construyen sistemas de computación de altas prestaciones mayores,
con más recursos hardware de distintos tipos, lo que hace que las tasas de
fallo del sistema aumenten. Por tanto, el estudio de técnicas de tolerancia a fallos
eficientes resulta indispensable para garantizar que los programas científicos puedan
completar su ejecución, evitando además que se dispare el consumo de energía. La
técnica checkpoint/restart es una de las más populares. Sin embargo, la mayor parte
de la investigación en este campo se ha centrado en estrategias stop-and-restart
para aplicaciones de memoria distribuida tras la ocurrencia de fallos-parada. Esta
tesis propone técnicas checkpoint/restart a nivel de aplicación para los modelos de
programación paralela más populares en supercomputación. Se han implementado
protocolos de checkpointing para aplicaciones híbridas MPI-OpenMP y aplicaciones
heterogéneas basadas en OpenCL, prestando en ambos casos especial atención a la
portabilidad y la maleabilidad de la solución. Con respecto a aplicaciones de memoria
distribuida, se propone una solución de resiliencia que puede ser usada de forma
genérica en aplicaciones MPI SPMD, permitiendo detectar y reaccionar a fallosparada
sin abortar la ejecución. En su lugar, se vuelven a lanzar los procesos fallidos
y se recupera el estado de la aplicación con una vuelta atrás global. A mayores, esta
solución de resiliencia ha sido optimizada implementando una vuelta atrás local, en
la que solo los procesos fallidos vuelven atrás, empleando un protocolo de almacenaje
de mensajes para garantizar la consistencia y el progreso de la ejecución. Por
último, se propone una extensión de una librería de checkpointing para facilitar la
implementación de estrategias de recuperación ad hoc ante corrupciones de memoria.
Muchas veces, este tipo de errores puede gestionarse a nivel de aplicación, evitando
desencadenar un fallo-parada y permitiendo una recuperación más eficiente.[Abstract]
The rapid increase in the computational demands of science has lead to a pronounced
growth in the performance offered by supercomputers. As High Performance
Computing (HPC) systems grow larger, including more hardware components
of different types, the system's failure rate becomes higher. Efficient fault
tolerance techniques are essential not only to ensure the execution completion but
also to save energy. Checkpoint/restart is one of the most popular fault tolerance
techniques. However, most of the research in this field is focused on stop-and-restart
strategies for distributed-memory applications in the event of fail-stop failures. Thís
thesis focuses on the implementation of application-level checkpoint/restart solutions
for the most popular parallel programming models used in HPC. Hence, we
have implemented checkpointing solutions to cope with fail-stop failures in hybrid
MPI-OpenMP applications and OpenCL-based programs. Both strategies maximize
the restart portability and malleability, ie., the recovery can take place on
machines with different CPU / accelerator architectures, and/ or operating systems,
and can be adapted to the available resources (number of cores/accelerators). Regarding
distributed-memory applications, we propose a resilience solution that can
be generally applied to SPMD MPI programs. Resilient applications can detect and
react to failures without aborting their execution upon fail-stop failures. Instead,
failed processes are re-spawned, and the application state is recovered through a
global rollback. Moreover, we have optimized this resilience proposal by implementing
a local rollback protocol, in which only failed processes rollback to a previous
state, while message logging enables global consistency and further progress of the
computation. Finally, we have extended a checkpointing library to facilitate the
implementation of ad hoc recovery strategies in the event of soft errors) caused by
memory corruptions. Many times, these errors can be handled at the software-Ievel,
tIms, avoiding fail-stop failures and enabling a more efficient recovery
Unified fault-tolerance framework for hybrid task-parallel message-passing applications
We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the FI-DGR 2013 scholarship
and the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and TIN2015-65316-P.Peer ReviewedPostprint (author's final draft
Reliability for exascale computing : system modelling and error mitigation for task-parallel HPC applications
As high performance computing (HPC) systems continue to grow, their fault rate increases. Applications running on these systems have to deal with rates on the order of hours or days. Furthermore, some studies for future Exascale systems predict the rates to be on the order of minutes. As a result, efficient fault tolerance solutions are needed to be able to tolerate frequent failures.
A fault tolerance solution for future HPC and Exascale systems must be low-cost, efficient and highly scalable. It should have low overhead in fault-free execution and provide fast restart because long-running applications are expected to experience many faults during the execution. Meanwhile task-based dataflow parallel programming models (PM) are becoming a popular paradigm in HPC applications at large scale. For instance, we see the adaptation of task-based dataflow parallelism in OpenMP 4.0, OmpSs PM, Argobots and Intel Threading Building Blocks.
In this thesis we propose fault-tolerance solutions for task-parallel dataflow HPC applications. Specifically, first we design and implement a checkpoint/restart and message-logging framework to recover from errors. We then develop performance models to investigate the benefits of our task-level frameworks when integrated with system-wide checkpointing. Moreover, we design and implement selective task replication mechanisms to detect and recover from silent data corruptions in task-parallel dataflow HPC applications. Finally, we introduce a runtime-based coding scheme to detect and recover from memory errors in these applications.
Considering the span of all of our schemes, we see that they provide a fairly high failure coverage where both computation and memory is protected against errors.A medida que los Sistemas de Cómputo de Alto rendimiento (HPC por sus siglas en inglés) siguen creciendo, también las tasas de fallos aumentan. Las aplicaciones que se ejecutan en estos sistemas tienen una tasa de fallos que pueden estar en el orden de horas o días. Además, algunos estudios predicen que los fallos estarán en el orden de minutos en los Sistemas Exascale. Por lo tanto, son necesarias soluciones eficientes para la tolerancia a fallos que puedan tolerar fallos frecuentes.
Las soluciones para tolerancia a fallos en los Sistemas futuros de HPC y Exascale tienen que ser de bajo costo, eficientes y altamente escalable.
El sobrecosto en la ejecución sin fallos debe ser bajo y también se debe proporcionar reinicio rápido, ya que se espera que las aplicaciones de larga duración experimenten muchos fallos durante la ejecución. Por otra parte, los modelos de programación paralelas basados en tareas ordenadas de acuerdo a sus dependencias de datos, se están convirtiendo en un paradigma popular en aplicaciones HPC a gran escala. Por ejemplo, los siguientes modelos de programación paralela incluyen este tipo de modelo de programación OpenMP 4.0, OmpSs, Argobots e Intel Threading Building Blocks.
En esta tesis proponemos soluciones de tolerancia a fallos para aplicaciones de HPC programadas en un modelo de programación paralelo basado tareas. Específicamente, en primer lugar, diseñamos e implementamos mecanismos “checkpoint/restart” y “message-logging” para recuperarse de los errores.
Para investigar los beneficios de nuestras herramientas a nivel de tarea cuando se integra con los “system-wide checkpointing” se han desarrollado modelos de rendimiento. Por otra parte, diseñamos e implementamos mecanismos de replicación selectiva de tareas que permiten detectar y recuperarse de daños de datos silenciosos en aplicaciones programadas siguiendo el modelo de programación paralela basadas en tareas. Por último, se introduce un esquema de codificación que funciona en tiempo de ejecución para detectar y recuperarse de los errores de la memoria en estas aplicaciones.
Todos los esquemas propuestos, en conjunto, proporcionan una cobertura bastante alta a los fallos tanto si estos se producen el cálculo o en la memoria.Postprint (published version
Checkpoint and run-time adaptation with pluggable parallelisation
Enabling applications for computational Grids requires new approaches to develop applications that can effectively cope with resource volatility. Applications must be resilient to resource faults, adapting the behaviour to available resources. This paper describes an approach to application-level adaptation that efficiently supports application-level checkpointing. The key of this work is the concept of pluggable parallelisation, which localises parallelisation issues into multiple modules that can be (un)plugged to match resource availability. This paper shows how pluggable parallelisation can be extended to effectively support checkpointing and run-time adaptation. We present the developed pluggable mechanism that helps the programmer to include checkpointing in the base (sequential). Based on these mechanisms and on previous work on pluggable parallelisation, our approach is able to automatically add support for checkpointing in parallel execution environments. Moreover, applications can adapt from a sequential execution to a multi-cluster configuration. Adaptation can be performed by checkpointing the application and restarting on a different mode or can be performed during run-time. Pluggable parallelisation intrinsically promotes the separation of software functionality from fault-tolerance and adaptation issues facilitating their analysis and evolution. The work presented in this paper reinforces this idea by showing the feasibility of the approach and performance benefits that can be achieved.(undefined
An aspect-oriented approach to fault-tolerance in grid platforms
Migrating traditional scientific applications to computational Grids requires programming tools that can help programmers to update application behaviour to this kind of platforms. Computational Grids are particularly suited for long running scientific applications, but they are also more prone to faults than desktop machines. The AspectGrid framework aims to develop methodologies and tools that can help to Grid-enable scientific applications, particularly focusing on techniques based on aspect-oriented programming. In this paper we present the aspect-oriented approach taken in the AspectGrid framework to address faults in computational Grids. In the proposed approach, scientific applications are enhanced with fault-tolerance capability by plugging additional modules. The proposed technique is portable across operating systems and minimises the changes required to base applications
AspectGrid: Aspect-Oriented Fault-Tolerance in Grid Platforms
Migrating traditional scientific applications to computational Grids requires programming tools that can help programmers update application behaviour to this kind of platforms. Computational Grids are particularly suited for long running scientific applications, but they are also more prone to faults than desktop machines. The AspectGrid framework aims to develop methodologies and tools that can help Grid-enable scientific applications, particularly focusing on techniques based on aspect-oriented programming. In this paper we present the aspect-oriented approach taken in the AspectGrid framework to address faults in computational Grids. In the proposed approach, scientific applications are enhanced with fault-tolerance capability by plugging additional modules. The proposed technique is portable across operating systems and minimises the changes required to base applications
Implementation-Oblivious Transparent Checkpoint-Restart for MPI
This work presents experience with traditional use cases of checkpointing on
a novel platform. A single codebase (MANA) transparently checkpoints production
workloads for major available MPI implementations: "develop once, run
everywhere". The new platform enables application developers to compile their
application against any of the available standards-compliant MPI
implementations, and test each MPI implementation according to performance or
other features.Comment: 17 pages, 4 figure
AspectGrid: aspect-oriented fault-tolerance in grid platforms
Migrating traditional scientific applications to computational Grids requires programming tools that can help programmers update application behaviour to this kind of platforms. Computational Grids are particularly suited for long running scientific applications, but they are also more prone to faults than desktop machines. The AspectGrid framework aims to develop methodologies and tools that can help Grid-enable scientific applications, particularly focusing on techniques based on aspect-oriented programming. In this paper we present the aspect-oriented approach taken in the AspectGrid framework to address faults in computational Grids. In the proposed approach, scientific applications are enhanced with fault-tolerance capability by plugging additional modules. The proposed technique is portable across operating systems and minimises the changes required to base applications
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.The PhD Symposium was a very good opportunity for the young researchers to share information and knowledge, to
present their current research, and to discuss topics with other students in order to look for synergies and common research
topics. The idea was very successful and the assessment made by the PhD Student was very good. It also helped to
achieve one of the major goals of the NESUS Action: to establish an open European research network targeting sustainable
solutions for ultrascale computing aiming at cross fertilization among HPC, large scale distributed systems, and big
data management, training, contributing to glue disparate researchers working across different areas and provide a meeting
ground for researchers in these separate areas to exchange ideas, to identify synergies, and to pursue common activities in
research topics such as sustainable software solutions (applications and system software stack), data management, energy
efficiency, and resilience.European Cooperation in Science and Technology. COS
Checkpointing as a Service in Heterogeneous Cloud Environments
A non-invasive, cloud-agnostic approach is demonstrated for extending
existing cloud platforms to include checkpoint-restart capability. Most cloud
platforms currently rely on each application to provide its own fault
tolerance. A uniform mechanism within the cloud itself serves two purposes: (a)
direct support for long-running jobs, which would otherwise require a custom
fault-tolerant mechanism for each application; and (b) the administrative
capability to manage an over-subscribed cloud by temporarily swapping out jobs
when higher priority jobs arrive. An advantage of this uniform approach is that
it also supports parallel and distributed computations, over both TCP and
InfiniBand, thus allowing traditional HPC applications to take advantage of an
existing cloud infrastructure. Additionally, an integrated health-monitoring
mechanism detects when long-running jobs either fail or incur exceptionally low
performance, perhaps due to resource starvation, and proactively suspends the
job. The cloud-agnostic feature is demonstrated by applying the implementation
to two very different cloud platforms: Snooze and OpenStack. The use of a
cloud-agnostic architecture also enables, for the first time, migration of
applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201
- …