93 research outputs found
A fault-tolerance protocol for parallel applications with communication imbalance
ArticuloThe predicted failure rates of future supercomputers
loom the groundbreaking research large machines are
expected to foster. Therefore, resilient extreme-scale applications
are an absolute necessity to effectively use the new generation
of supercomputers. Rollback-recovery techniques have been
traditionally used in HPC to provide resilience. Among those
techniques, message logging provides the appealing features of
saving energy, accelerating recovery, and having low performance
penalty. Its increased memory consumption is, however, an
important downside. This paper introduces memory-constrained
message logging (MCML), a general framework for decreasing the
memory footprint of message-logging protocols. In particular, we
demonstrate the effectiveness of MCML in maintaining message
logging feasible for applications with substantial communication
imbalance. This type of applications appear in many scientific
fields. We present experimental results with several parallel codes
running on up to 4,096 cores. Using those results and an analytical
model, we predict MCML can reduce execution time up to 25%
and energy consumption up to 15%, at extreme scale
A Pattern Language for High-Performance Computing Resilience
High-performance computing systems (HPC) provide powerful capabilities for
modeling, simulation, and data analytics for a broad class of computational
problems. They enable extreme performance of the order of quadrillion
floating-point arithmetic calculations per second by aggregating the power of
millions of compute, memory, networking and storage components. With the
rapidly growing scale and complexity of HPC systems for achieving even greater
performance, ensuring their reliable operation in the face of system
degradations and failures is a critical challenge. System fault events often
lead the scientific applications to produce incorrect results, or may even
cause their untimely termination. The sheer number of components in modern
extreme-scale HPC systems and the complex interactions and dependencies among
the hardware and software components, the applications, and the physical
environment makes the design of practical solutions that support fault
resilience a complex undertaking. To manage this complexity, we developed a
methodology for designing HPC resilience solutions using design patterns. We
codified the well-known techniques for handling faults, errors and failures
that have been devised, applied and improved upon over the past three decades
in the form of design patterns. In this paper, we present a pattern language to
enable a structured approach to the development of HPC resilience solutions.
The pattern language reveals the relations among the resilience patterns and
provides the means to explore alternative techniques for handling a specific
fault model that may have different efficiency and complexity characteristics.
Using the pattern language enables the design and implementation of
comprehensive resilience solutions as a set of interconnected resilience
patterns that can be instantiated across layers of the system stack.Comment: Proceedings of the 22nd European Conference on Pattern Languages of
Program
A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud
High Performance Computing (HPC) systems have been widely used by scientists and researchers in both industry and university laboratories to solve advanced computation problems. Most advanced computation problems are either data-intensive or computation-intensive. They may take hours, days or even weeks to complete execution. For example, some of the traditional HPC systems computations run on 100,000 processors for weeks. Consequently traditional HPC systems often require huge capital investments. As a result, scientists and researchers sometimes have to wait in long queues to access shared, expensive HPC systems. Cloud computing, on the other hand, offers new computing paradigms, capacity, and flexible solutions for both business and HPC applications. Some of the computation-intensive applications that are usually executed in traditional HPC systems can now be executed in the cloud. Cloud computing price model eliminates huge capital investments. However, even for cloud-based HPC systems, fault tolerance is still an issue of growing concern. The large number of virtual machines and electronic components, as well as software complexity and overall system reliability, availability and serviceability (RAS), are factors with which HPC systems in the cloud must contend. The reactive fault tolerance approach of checkpoint/restart, which is commonly used in HPC systems, does not scale well in the cloud due to resource sharing and distributed systems networks. Hence, the need for reliable fault tolerant HPC systems is even greater in a cloud environment. In this thesis we present a proactive fault tolerance approach to HPC systems in the cloud to reduce the wall-clock execution time, as well as dollar cost, in the presence of hardware failure. We have developed a generic fault tolerance algorithm for HPC systems in the cloud. We have further developed a cost model for executing computation-intensive applications on HPC systems in the cloud. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in the cloud can be considerably reduced compared to checkpoint and redundancy techniques used in traditional HPC systems
Local Rollback for Resilient Mpi Applications With Application-Level Checkpointing and Message Logging
[Abstract]
The resilience approach generally used in high-performance computing (HPC) relies on coordinated checkpoint/restart, a global rollback of all the processes that are running the application. However, in many instances, the failure has a more localized scope and its impact is usually restricted to a subset of the resources being used. Thus, a global rollback would result in unnecessary overhead and energy consumption, since all processes, including those unaffected by the failure, discard their state and roll back to the last checkpoint to repeat computations that were already done. The User Level Failure Mitigation (ULFM) interface – the last proposal for the inclusion of resilience features in the Message Passing Interface (MPI) standard – enables the deployment of more flexible recovery strategies, including localized recovery. This work proposes a local rollback approach that can be generally applied to Single Program, Multiple Data (SPMD) applications by combining ULFM, the ComPiler for Portable Checkpointing (CPPC) tool, and the Open MPI VProtocol system-level message logging component. Only failed processes are recovered from the last checkpoint, while consistency before further progress in the execution is achieved through a two-level message logging process. To further optimize this approach point-to-point communications are logged by the Open MPI VProtocol component, while collective communications are optimally logged at the application level—thereby decoupling the logging protocol from the particular collective implementation. This spatially coordinated protocol applied by CPPC reduces the log size, the log memory requirements and overall the resilience impact on the applications.This research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Projects TIN2016-75845-P and the predoctoral grants of Nuria Losada ref. BES-2014-068066 and ref. EEBB-I-17-12005); by EU under the COST Program Action IC1305 Network for Sustainable Ultrascale Computing (NESUS) and a HiPEAC Collaboration Grant and by the Galician Government (Xunta de Galicia) under the Consolidation Program of Competitive Research (ref. ED431C 2017/04). We gratefully thank Galicia Supercomputing Center for providing access to the FinisTerrae-II supercomputer.
This material is also based upon work supported by the US National Science Foundation, Office of Advanced Cyberinfrastructure , under Grants No. #1664142 and #1339763Xunta de Galicia; ED431C 2017/04US National Science Foundation, Office of Advanced Cyberinfrastructure; 1664142US National Science Foundation, Office of Advanced Cyberinfrastructure; 133976
Using migratable objects to enhance fault tolerance schemes in supercomputers
Supercomputers have seen an exponential increase in their size in the last two decades. Such a high growth rate is expected to take us to exascale in the timeframe 2018-2022. But, to bring a productive exascale environment about, it is necessary to focus on several key challenges. One of those challenges is fault tolerance. Machines at extreme scale will experience frequent failures and will require the system to avoid or overcome those failures. Various techniques have recently been developed to tolerate failures. The impact of these techniques and their scalability can be substantially enhanced by a parallel programming model called migratable objects. In this paper, we demonstrate how the migratable-objects model facilitates and improves several fault tolerance approaches. Our experimental results on thousands of cores suggest fault tolerance schemes based on migratable objects have low performance overhead and high scalability. Additionally, we present a performance model that predicts a significant benefit of using migratable objects to provide fault tolerance at extreme scale
Application-level Fault Tolerance and Resilience in HPC Applications
Programa Oficial de Doutoramento en Investigación en TecnoloxÃas da Información. 524V01[Resumo]
As necesidades computacionais das distintas ramas da ciencia medraron enormemente
nos últimos anos, o que provocou un gran crecemento no rendemento proporcionado
polos supercomputadores. Cada vez constrúense sistemas de computación
de altas prestacións de maior tamaño, con máis recursos hardware de distintos tipos,
o que fai que as taxas de fallo destes sistemas tamén medren. Polo tanto, o
estudo de técnicas de tolerancia a fallos eficientes é indispensábel para garantires
que os programas cientÃficos poidan completar a súa execución, evitando ademais
que se dispare o consumo de enerxÃa. O checkpoint/restart é unha das técnicas máis
populares. Sen embargo, a maiorÃa da investigación levada a cabo nas últimas décadas
céntrase en estratexias stop-and-restart para aplicacións de memoria distribuÃda
tralo acontecemento dun fallo-parada. Esta tese propón técnicas checkpoint/restart
a nivel de aplicación para os modelos de programación paralela roáis populares en
supercomputación. Implementáronse protocolos de checkpointing para aplicacións
hÃbridas MPI-OpenMP e aplicacións heteroxéneas baseadas en OpenCL, en ámbolos
dous casos prestando especial coidado á portabilidade e maleabilidade da solución.
En canto a aplicacións de memoria distribuÃda, proponse unha solución de resiliencia
que pode ser empregada de forma xenérica en aplicacións MPI SPMD, permitindo
detectar e reaccionar a fallos-parada sen abortar a execución. Neste caso, os procesos
fallidos vólvense a lanzar e o estado da aplicación recupérase cunha volta atrás global.
A maiores, esta solución de resiliencia optimizouse implementando unha volta
atrás local, na que só os procesos fallidos volven atrás, empregando un protocolo de
almacenaxe de mensaxes para garantires a consistencia e o progreso da execución.
Por último, propónse a extensión dunha librerÃa de checkpointing para facilitares a implementación de estratexias de recuperación ad hoc ante conupcións de memoria.
En moitas ocasións, estos erros poden ser xestionados a nivel de aplicación, evitando
desencadear un fallo-parada e permitindo unha recuperación máis eficiente.[Resumen]
El rápido aumento de las necesidades de cómputo de distintas ramas de la ciencia
ha provocado un gran crecimiento en el rendimiento ofrecido por los supercomputadores.
Cada vez se construyen sistemas de computación de altas prestaciones mayores,
con más recursos hardware de distintos tipos, lo que hace que las tasas de
fallo del sistema aumenten. Por tanto, el estudio de técnicas de tolerancia a fallos
eficientes resulta indispensable para garantizar que los programas cientÃficos puedan
completar su ejecución, evitando además que se dispare el consumo de energÃa. La
técnica checkpoint/restart es una de las más populares. Sin embargo, la mayor parte
de la investigación en este campo se ha centrado en estrategias stop-and-restart
para aplicaciones de memoria distribuida tras la ocurrencia de fallos-parada. Esta
tesis propone técnicas checkpoint/restart a nivel de aplicación para los modelos de
programación paralela más populares en supercomputación. Se han implementado
protocolos de checkpointing para aplicaciones hÃbridas MPI-OpenMP y aplicaciones
heterogéneas basadas en OpenCL, prestando en ambos casos especial atención a la
portabilidad y la maleabilidad de la solución. Con respecto a aplicaciones de memoria
distribuida, se propone una solución de resiliencia que puede ser usada de forma
genérica en aplicaciones MPI SPMD, permitiendo detectar y reaccionar a fallosparada
sin abortar la ejecución. En su lugar, se vuelven a lanzar los procesos fallidos
y se recupera el estado de la aplicación con una vuelta atrás global. A mayores, esta
solución de resiliencia ha sido optimizada implementando una vuelta atrás local, en
la que solo los procesos fallidos vuelven atrás, empleando un protocolo de almacenaje
de mensajes para garantizar la consistencia y el progreso de la ejecución. Por
último, se propone una extensión de una librerÃa de checkpointing para facilitar la
implementación de estrategias de recuperación ad hoc ante corrupciones de memoria.
Muchas veces, este tipo de errores puede gestionarse a nivel de aplicación, evitando
desencadenar un fallo-parada y permitiendo una recuperación más eficiente.[Abstract]
The rapid increase in the computational demands of science has lead to a pronounced
growth in the performance offered by supercomputers. As High Performance
Computing (HPC) systems grow larger, including more hardware components
of different types, the system's failure rate becomes higher. Efficient fault
tolerance techniques are essential not only to ensure the execution completion but
also to save energy. Checkpoint/restart is one of the most popular fault tolerance
techniques. However, most of the research in this field is focused on stop-and-restart
strategies for distributed-memory applications in the event of fail-stop failures. ThÃs
thesis focuses on the implementation of application-level checkpoint/restart solutions
for the most popular parallel programming models used in HPC. Hence, we
have implemented checkpointing solutions to cope with fail-stop failures in hybrid
MPI-OpenMP applications and OpenCL-based programs. Both strategies maximize
the restart portability and malleability, ie., the recovery can take place on
machines with different CPU / accelerator architectures, and/ or operating systems,
and can be adapted to the available resources (number of cores/accelerators). Regarding
distributed-memory applications, we propose a resilience solution that can
be generally applied to SPMD MPI programs. Resilient applications can detect and
react to failures without aborting their execution upon fail-stop failures. Instead,
failed processes are re-spawned, and the application state is recovered through a
global rollback. Moreover, we have optimized this resilience proposal by implementing
a local rollback protocol, in which only failed processes rollback to a previous
state, while message logging enables global consistency and further progress of the
computation. Finally, we have extended a checkpointing library to facilitate the
implementation of ad hoc recovery strategies in the event of soft errors) caused by
memory corruptions. Many times, these errors can be handled at the software-Ievel,
tIms, avoiding fail-stop failures and enabling a more efficient recovery
RADIC II : a fault tolerant architecture with flexible dynamic redundancy
The demand for computational power has been leading the improvement of the High Performance Computing (HPC) area, generally represented by the use of distributed systems like clusters of computers running parallel applications. In this area, fault tolerance plays an important role in order to provide high availability isolating the application from the faults effects. Performance and availability form an undissociable binomial for some kind of applications. Therefore, the fault tolerant solutions must take into consideration these two constraints when it has been designed. In this dissertation, we present a few side-effects that some fault tolerant solutions may presents when recovering a failed process. These effects may causes degradation of the system, affecting mainly the overall performance and availability. We introduce RADIC-II, a fault tolerant architecture for message passing based on RADIC (Redundant Array of Distributed Independent Fault Tolerance Controllers) architecture. RADIC-II keeps as maximum as possible the RADIC features of transparency, decentralization, flexibility and scalability, incorporating a flexible dynamic redundancy feature, allowing to mitigate or to avoid some recovery side-effects.La demanda de computadores más veloces ha provocado el incremento del área de computación de altas prestaciones, generalmente representado por el uso de sistemas distribuidos como los clusters de computadores ejecutando aplicaciones paralelas. En esta área, la tolerancia a fallos juega un papel muy importante a la hora de proveer alta disponibilidad, aislando los efectos causados por los fallos. Prestaciones y disponibilidad componen un binomio indisociable para algunos tipos de aplicaciones. Por eso, las soluciones de tolerancia a fallos deben tener en consideración estas dos restricciones desde el momento de su diseño. En esta disertación, presentamos algunos efectos colaterales que se puede presentar en ciertas soluciones tolerantes a fallos cuando recuperan un proceso fallado. Estos efectos pueden causar una degradación del sistema, afectando las prestaciones y disponibilidad finales. Presentamos RADIC-II, una arquitectura tolerante a fallos para paso de mensajes basada en la arquitectura RADIC (Redundant Array of Distributed Independent Fault Tolerance Controllers). RADIC-II mantiene al máximo posible las caracterÃsticas de transparencia, descentralización, flexibilidad y escalabilidad existentes en RADIC, e incorpora una flexible funcionalidad de redundancia dinámica, que permite mitigar o evitar algunos efectos colaterales en la recuperación
Unified fault-tolerance framework for hybrid task-parallel message-passing applications
We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the FI-DGR 2013 scholarship
and the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and TIN2015-65316-P.Peer ReviewedPostprint (author's final draft
- …