1,448 research outputs found
Scalable group-based checkpoint/restart for large-scale message-passing systems
The ever increasing number of processors used in parallel computers is making fault tolerance support in large-scale parallel systems more and more important. We discuss the inadequacies of existing system-level checkpointing solutions for message-passing applications as the system scales up. We analyze the coordination cost and blocking behavior of two current MPI implementations with checkpointing support. A group-based solution combining coordinated checkpointing and message logging is then proposed. Experiment results demonstrate its better performance and scalability than LAM/MPI and MPICH-VCL. To assist group formation, a method to analyze the communication behaviors of the application is proposed. ©2008 IEEE.published_or_final_versio
A Portable and Adaptable Fault Tolerance Solution for Heterogeneous Applications
[Abstract] Heterogeneous systems have increased their popularity in recent years due to the high performance and reduced energy consumption capabilities provided by using devices such as GPUs or Xeon Phi accelerators. This paper proposes a checkpoint-based fault tolerance solution for heterogeneous applications, allowing them to survive fail-stop failures in the host CPU or in any of the accelerators used. Besides, applications can be restarted changing the host CPU and/or the accelerator device architecture, and adapting the computation to the number of devices available during recovery. The proposed solution is built combining CPPC (ComPiler for Portable Checkpointing), an application-level checkpointing tool, and HPL (Heterogeneous Programming Library), a library that facilitates the development of OpenCL-based applications. Experimental results show the low overhead introduced by the proposal and prove its portability and adaptability benefits.This research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Projects TIN2013-42148-P, TIN2016-75845-P and the predoctoral Grant of Nuria Losada Ref. BES-2014-068066), by EU under the COST Program Action IC1305, Network for Sustainable Ultrascale Computing (NESUS), and by the Galician Government (Xunta de Galicia) and FEDER funds of the EU under the Consolidation Program of Competitive Research (Ref. GRC2013/055)Xunta de Galicia; GRC 2013/05
A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud
High Performance Computing (HPC) systems have been widely used by scientists and researchers in both industry and university laboratories to solve advanced computation problems. Most advanced computation problems are either data-intensive or computation-intensive. They may take hours, days or even weeks to complete execution. For example, some of the traditional HPC systems computations run on 100,000 processors for weeks. Consequently traditional HPC systems often require huge capital investments. As a result, scientists and researchers sometimes have to wait in long queues to access shared, expensive HPC systems. Cloud computing, on the other hand, offers new computing paradigms, capacity, and flexible solutions for both business and HPC applications. Some of the computation-intensive applications that are usually executed in traditional HPC systems can now be executed in the cloud. Cloud computing price model eliminates huge capital investments. However, even for cloud-based HPC systems, fault tolerance is still an issue of growing concern. The large number of virtual machines and electronic components, as well as software complexity and overall system reliability, availability and serviceability (RAS), are factors with which HPC systems in the cloud must contend. The reactive fault tolerance approach of checkpoint/restart, which is commonly used in HPC systems, does not scale well in the cloud due to resource sharing and distributed systems networks. Hence, the need for reliable fault tolerant HPC systems is even greater in a cloud environment. In this thesis we present a proactive fault tolerance approach to HPC systems in the cloud to reduce the wall-clock execution time, as well as dollar cost, in the presence of hardware failure. We have developed a generic fault tolerance algorithm for HPC systems in the cloud. We have further developed a cost model for executing computation-intensive applications on HPC systems in the cloud. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in the cloud can be considerably reduced compared to checkpoint and redundancy techniques used in traditional HPC systems
Application-level Fault Tolerance and Resilience in HPC Applications
Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo]
As necesidades computacionais das distintas ramas da ciencia medraron enormemente
nos últimos anos, o que provocou un gran crecemento no rendemento proporcionado
polos supercomputadores. Cada vez constrúense sistemas de computación
de altas prestacións de maior tamaño, con máis recursos hardware de distintos tipos,
o que fai que as taxas de fallo destes sistemas tamén medren. Polo tanto, o
estudo de técnicas de tolerancia a fallos eficientes é indispensábel para garantires
que os programas científicos poidan completar a súa execución, evitando ademais
que se dispare o consumo de enerxía. O checkpoint/restart é unha das técnicas máis
populares. Sen embargo, a maioría da investigación levada a cabo nas últimas décadas
céntrase en estratexias stop-and-restart para aplicacións de memoria distribuída
tralo acontecemento dun fallo-parada. Esta tese propón técnicas checkpoint/restart
a nivel de aplicación para os modelos de programación paralela roáis populares en
supercomputación. Implementáronse protocolos de checkpointing para aplicacións
híbridas MPI-OpenMP e aplicacións heteroxéneas baseadas en OpenCL, en ámbolos
dous casos prestando especial coidado á portabilidade e maleabilidade da solución.
En canto a aplicacións de memoria distribuída, proponse unha solución de resiliencia
que pode ser empregada de forma xenérica en aplicacións MPI SPMD, permitindo
detectar e reaccionar a fallos-parada sen abortar a execución. Neste caso, os procesos
fallidos vólvense a lanzar e o estado da aplicación recupérase cunha volta atrás global.
A maiores, esta solución de resiliencia optimizouse implementando unha volta
atrás local, na que só os procesos fallidos volven atrás, empregando un protocolo de
almacenaxe de mensaxes para garantires a consistencia e o progreso da execución.
Por último, propónse a extensión dunha librería de checkpointing para facilitares a implementación de estratexias de recuperación ad hoc ante conupcións de memoria.
En moitas ocasións, estos erros poden ser xestionados a nivel de aplicación, evitando
desencadear un fallo-parada e permitindo unha recuperación máis eficiente.[Resumen]
El rápido aumento de las necesidades de cómputo de distintas ramas de la ciencia
ha provocado un gran crecimiento en el rendimiento ofrecido por los supercomputadores.
Cada vez se construyen sistemas de computación de altas prestaciones mayores,
con más recursos hardware de distintos tipos, lo que hace que las tasas de
fallo del sistema aumenten. Por tanto, el estudio de técnicas de tolerancia a fallos
eficientes resulta indispensable para garantizar que los programas científicos puedan
completar su ejecución, evitando además que se dispare el consumo de energía. La
técnica checkpoint/restart es una de las más populares. Sin embargo, la mayor parte
de la investigación en este campo se ha centrado en estrategias stop-and-restart
para aplicaciones de memoria distribuida tras la ocurrencia de fallos-parada. Esta
tesis propone técnicas checkpoint/restart a nivel de aplicación para los modelos de
programación paralela más populares en supercomputación. Se han implementado
protocolos de checkpointing para aplicaciones híbridas MPI-OpenMP y aplicaciones
heterogéneas basadas en OpenCL, prestando en ambos casos especial atención a la
portabilidad y la maleabilidad de la solución. Con respecto a aplicaciones de memoria
distribuida, se propone una solución de resiliencia que puede ser usada de forma
genérica en aplicaciones MPI SPMD, permitiendo detectar y reaccionar a fallosparada
sin abortar la ejecución. En su lugar, se vuelven a lanzar los procesos fallidos
y se recupera el estado de la aplicación con una vuelta atrás global. A mayores, esta
solución de resiliencia ha sido optimizada implementando una vuelta atrás local, en
la que solo los procesos fallidos vuelven atrás, empleando un protocolo de almacenaje
de mensajes para garantizar la consistencia y el progreso de la ejecución. Por
último, se propone una extensión de una librería de checkpointing para facilitar la
implementación de estrategias de recuperación ad hoc ante corrupciones de memoria.
Muchas veces, este tipo de errores puede gestionarse a nivel de aplicación, evitando
desencadenar un fallo-parada y permitiendo una recuperación más eficiente.[Abstract]
The rapid increase in the computational demands of science has lead to a pronounced
growth in the performance offered by supercomputers. As High Performance
Computing (HPC) systems grow larger, including more hardware components
of different types, the system's failure rate becomes higher. Efficient fault
tolerance techniques are essential not only to ensure the execution completion but
also to save energy. Checkpoint/restart is one of the most popular fault tolerance
techniques. However, most of the research in this field is focused on stop-and-restart
strategies for distributed-memory applications in the event of fail-stop failures. Thís
thesis focuses on the implementation of application-level checkpoint/restart solutions
for the most popular parallel programming models used in HPC. Hence, we
have implemented checkpointing solutions to cope with fail-stop failures in hybrid
MPI-OpenMP applications and OpenCL-based programs. Both strategies maximize
the restart portability and malleability, ie., the recovery can take place on
machines with different CPU / accelerator architectures, and/ or operating systems,
and can be adapted to the available resources (number of cores/accelerators). Regarding
distributed-memory applications, we propose a resilience solution that can
be generally applied to SPMD MPI programs. Resilient applications can detect and
react to failures without aborting their execution upon fail-stop failures. Instead,
failed processes are re-spawned, and the application state is recovered through a
global rollback. Moreover, we have optimized this resilience proposal by implementing
a local rollback protocol, in which only failed processes rollback to a previous
state, while message logging enables global consistency and further progress of the
computation. Finally, we have extended a checkpointing library to facilitate the
implementation of ad hoc recovery strategies in the event of soft errors) caused by
memory corruptions. Many times, these errors can be handled at the software-Ievel,
tIms, avoiding fail-stop failures and enabling a more efficient recovery
Distributed real-time fault tolerance in a virtualized separation kernel
Computers are increasingly being placed in scenarios where a computer error
could result in the loss of human life or significant financial loss. Fault
tolerant techniques must be employed to prevent an error from resulting in a
fault causing such losses. Two types of errors that are common in real-time and
embedded system are soft errors, i.e. data bit corruption, and timing errors,
such as missed deadlines. Purely software based techniques to address these
types of errors have the advantage of not requiring specialized hardware and are
able to use more readily available commercial off-the-shelf hardware. Timing
errors are addressed using Adaptive Mixed-Criticality, a scheduling technique
where higher criticality tasks are given precedence over those of lower
criticality when it is impossible to guarantee the schedulability of all tasks.
While mixed-criticality scheduling has gained attention in recent years, most
approaches assume a periodic task model and that the system has a single
criticality level which dictates the available budget to all tasks. In practice
these assumptions do not hold: different types of tasks are better served by
different scheduling approaches and only a subset of high critical tasks might
require additional capacity to meet deadlines. In the latter case, this occurs
when a process has experienced a fault and requires additional capacity to
perform the recovery.
In this thesis, soft errors are addressed using a novel real-time fault
tolerance method based on a virtualized separation kernel. Instead of executing
redundant copies of an application on separate machines, the applications are
consolidated onto one multi-core processor and use hardware virtualization
extensions to partition the applications. This allows new recovery schemes to
be explored. In addition, the maximum recovery time is sufficiently bounded to
ensure recovery occurs in a timely manner without affecting the normal execution
of the application. A virtualized separation kernel in combination with
Adaptive Mixed-Criticality techniques creates a fault tolerant system that
predictably detects and recovers from timing and soft errors
Transfer of tilted sample information in transmission electron microscopy
When a transmission electron microscope is used in imaging mode, information carried by
the sample function is transformed by the optics of the instrument during the imaging process.
A mathematical description of this physical process (the so-called imaging function)
is a requirement for an accurate analysis and the interpretation of electron microscopy
experimental data. When the sample is not imaged in tilted geometry (no defocus gradient
is present across its extent), the imaging function has a well-known and extensively
studied form : the Contrast Transfer Function (CTF) (Reimer, 1997). Several
electron microscopy techniques, however, require the sample to be tilted to fully
explore its 3-dimensional structure. Only recently a rigorous mathematical description
for the imaging process under these conditions, derived from physical
first principles, has been made available: the Tilted Contrast Imaging Function
(TCIF) (Philippsen et al., 2006).
The present work discusses in depth the nature and the characteristics of the TCIF model,
expanding it to include astigmatism. A robust and efficient software implementation is
presented, developed with the context of the IPLT software development framework
(Philippsen et al., 2007). Computer simulations of images of tilted samples are then
used to qualitatively and quantitatively analyze features of experimental images.
No computationally-feasible analytical method for the inversion of the TCIF model
is currently available, and its effects on experimental images are usually corrected using
a number of heuristic methods that involve some approximations of the imaging parameters.
Using computer simulations of tilted images, this work estimates the errors introduced
by these approximations, and suggests optimal correction strategies for electron tomography
and crystallography imaging conditions. Furthermore, this work describes possible approaches
for the determination of the imaging parameters through the analysis of the experimental images,
and for a non-analytical inversion of the effects of the TCIF model, showing preliminary
results of their implementation applied to computer simulated-images.
References:
Reimer, L. (1997). Transmission Electron Microscopy. Physics of Image Formation
and Microanalysis. Springer-Verlag GmbH, 4. A. edition.
Philippsen, A., Engel, H. and Engel, A. (2006). The contrast-imaging function
for tilted specimens. Ultramicroscopy, 107(2-3):202–12.
Philippsen, A., Schenk, A. D., Signorell, G. A., Mariani, V. and Berneche, S.et al.
(2007). Collaborative EM image processing with the IPLT image processing
library and toolbox. Journal of Structural Biology, 157(1):28–37
Smart Fridge / Dumb Grid? Demand Dispatch for the Power Grid of 2020
In discussions at the 2015 HICSS meeting, it was argued that loads can
provide most of the ancillary services required today and in the future.
Through load-level and grid-level control design, high-quality ancillary
service for the grid is obtained without impacting quality of service delivered
to the consumer. This approach to grid regulation is called demand dispatch:
loads are providing service continuously and automatically, without consumer
interference.
In this paper we ask, what intelligence is required at the grid-level? In
particular, does the grid-operator require more than one-way communication to
the loads? Our main conclusion: risk is not great in lower frequency ranges,
e.g., PJM's RegA or BPA's balancing reserves. In particular, ancillary services
from refrigerators and pool-pumps can be obtained successfully with only
one-way communication. This requires intelligence at the loads, and much less
intelligence at the grid level
- …