Search CORE

10 research outputs found

H-RADIC: The Fault Tolerance Framework for Virtual Clusters on Multi-Cloud Environments

Author: Castro-León Marcela
Luque Fadón Emilio
Rexachs del Rosario Dolores
Royo Ambrosio
Villamayor Jorge
Publication venue
Publication date: 01/06/2018
Field of study

Even though the cloud platform promises to be reliable, several availability incidents prove that they are not. How can we be sure that a parallel application finishes the execution even if a site is affected by a failure? This paper presents H-RADIC, an approach based on RADIC architecture, that executes a parallel application in at least 3 different virtual clusters or sites. The execution state of each site is saved periodically in another site and it is recovered in case of failure. The paper details the configuration of the architecture and the experiments results using 3 virtual clusters running NAS parallel applications protected with DMTCP, a very well-known distributed multi-threaded checkpoint tool. Our experiments show that the execution time was increased between a 5% to 36% without failures and 27% to 66% in case of failures.Facultad de Informátic

Servicio de Difusión de la Creación Intelectual

H-RADIC: The Fault Tolerance Framework for Virtual Clusters on Multi-Cloud Environments

Author: Castro-León Marcela
Luque Fadón Emilio
Rexachs del Rosario Dolores
Royo Ambrosio
Villamayor Jorge
Publication venue
Publication date: 04/10/2018
Field of study

H-RADIC: una solución de tolerancia a fallos para clústeres virtuales en ambientes multi-nube

Author: Castro-León Marcela
Luque Fadón Emilio
Rexachs del Rosario Dolores
Royo Ambrosio
Villamayor Jorge
Publication venue
Publication date: 01/12/2018
Field of study

Even though the cloud platform promises to be reliable, several availability incidents prove that it is not. How can we be sure that a parallel application finishes it´s execution even if a site is affected by a failure? This paper presents H-RADIC, an approach based on RADIC architecture, that executes parallel applications protected by RADIC in at least 3 different virtual clusters or sites. The execution state of each site is saved periodically in another site and it is recovered in case of failure. The paper details the configuration of the architecture and the experiment´s results using 3 clusters running NAS parallel applications protected with DMTCP, a very well-known distributed multi-threaded checkpoint tool. Our experiments show that by adding a cluster protector it will be possible to implement the next level in the hierarchy, where the first level in the RADIC hierarchy works as an observer at a site level. In adition, the experiments showed that the protection implementation is out of the critical path of the application and it depends on the utilized resources.Aunque las plataformas en la nube parecen ser muy confiables, varios incidentes de disponibilidad han podemos asegurarnos que una aplicación paralela termina su ejecución cuando el sitio en la nube ha sido afectado por una falla? Este articulo presenta HRADIC, un enfoque basado en la arquitectura RADIC, esta ejecuta aplicaciones paralelas en al menos 3 diferentes sitios o clústeres virtuales, todos protegidos por RADIC, donde el estado de la ejecución de cada sitio es guardado periódicamente en otro de los sitios y de ahí es recuperado en el caso de una falla. El articulo detalla la configuración de la arquitectura y los resultados de los experimentos usando 3 clústeres ejecutando aplicaciones NAS en paralelo, protegidas con DMTCP (una herramienta para realizar múltiples checkpoints). Nuestros experimentos muestran que al agregar un protector del clúster es posible implementar un nivel más en la jerarquía de RADIC, donde el primer nivel funciona como observador. Los experimentos muestran que la implementación de este protector esta fuera del camino critico de la ampliación y depende solamente de la utilización de recursos.Facultad de Informátic

H-RADIC: una solución de tolerancia a fallos para clústeres virtuales en ambientes multi-nube

Author: Castro-León Marcela
Luque Fadón Emilio
Rexachs del Rosario Dolores
Royo Ambrosio
Villamayor Jorge
Publication venue
Publication date: 01/12/2018
Field of study

Servicio de Difusión de la Creación Intelectual

H-RADIC: una solución de tolerancia a fallos para clústeres virtuales en ambientes multi-nube

Author: Castro-León Marcela
Luque Fadón Emilio
Rexachs del Rosario Dolores
Royo Ambrosio
Villamayor Jorge
Publication venue
Publication date: 21/12/2018
Field of study

Outcomes of the fault tolerance configuration

Author: Duarte Angelo
Fialho Leonardo
Luque Fadón Emilio
Rexachs del Rosario Dolores
Publication venue
Publication date: 13/09/2012
Field of study

This paper presents the influence of the fault tolerance configuration on different applications using performance metrics. Two configuration parameters are analysed: the heartbeat/watchdog interval and the checkpoint interval. In addition, even message logging is mandatory, an analysis of its overhead on different applications is presented. The impact of message logging on applications has been analysed according to the nature of the communication primitives used on the application. This analysis shows why for different applications the message logging introduces different overhead.Presentado en el IX Workshop Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

Servicio de Difusión de la Creación Intelectual

High availability for parallel computers

Author: Luque Fadón Emilio
Rexachs del Rosario Dolores
Publication venue
Publication date: 01/10/2010
Field of study

Fault tolerance has become an important issue for parallel applications in the last few years. The parallel systems' users want them to be reliable considering two main dimensions, availability and data consistency. Availability can be provided with solutions such as RADIC, a fault tolerant architecture with different protection levels, offering high availability with transparency, decentralization, flexibility and scalability for message-passing systems. Transient faults may cause an application running in a computer system to be removed from execution, however the biggest risk of transient faults is to provoke undetected data corruption that changes the final result of the application without anyone knowing. To evaluate the effects of transient faults in the robustness of applications and validate new fault detection mechanism and strategies, we have developed a full-system simulation fault injection environmentFacultad de Informátic

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Servicio de Difusión de la Creación Intelectual

Fault tolerance at system level based on RADIC architecture

Author: Castro León Marcela
Luque Emilio
Meyer Hugo Daniel
Rexachs del Rosario Dolores Isabel
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

The increasing failure rate in High Performance Computing encourages the investigation of fault tolerance mechanisms to guarantee the execution of an application in spite of node faults. This paper presents an automatic and scalable fault tolerant model designed to be transparent for applications and for message passing libraries. The model consists of detecting failures in the communication socket caused by a faulty node. In those cases, the affected processes are recovered in a healthy node and the connections are reestablished without losing data. The Redundant Array of Distributed Independent Controllers architecture proposes a decentralized model for all the tasks required in a fault tolerance system: protection, detection, recovery and masking. Decentralized algorithms allow the application to scale, which is a key property for current HPC system. Three different rollback recovery protocols are defined and discussed with the aim of offering alternatives to reduce overhead when multicore systems are used. A prototype has been implemented to carry out an exhaustive experimental evaluation through Master/Worker and Single Program Multiple Data execution models. Multiple workloads and an increasing number of processes have been taken into account to compare the above mentioned protocols. The executions take place in two multicore Linux clusters with different socket communications libraries

Elsevier - Publisher Connector

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Diposit Digital de Documents de la UAB

Challenges and Issues of the Integration of RADIC into Open MPI

Author: E. Gabriel
G. Santos
W. Gropp
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Crossref

Libro de Actas JCC&BD 2018 : VI Jornadas de Cloud Computing & Big Data

Author: De Giusti Armando Eduardo
Publication venue: Facultad de Informática (UNLP)
Publication date: 01/01/2018
Field of study

Se recopilan las ponencias presentadas en las VI Jornadas de Cloud Computing & Big Data (JCC&BD), realizadas entre el 25 al 29 de junio de 2018 en la Facultad de Informática de la Universidad Nacional de La Plata.Universidad Nacional de La Plata (UNLP) - Facultad de Informátic

Servicio de Difusión de la Creación Intelectual