10 research outputs found

    H-RADIC: The Fault Tolerance Framework for Virtual Clusters on Multi-Cloud Environments

    Get PDF
    Even though the cloud platform promises to be reliable, several availability incidents prove that they are not. How can we be sure that a parallel application finishes the execution even if a site is affected by a failure? This paper presents H-RADIC, an approach based on RADIC architecture, that executes a parallel application in at least 3 different virtual clusters or sites. The execution state of each site is saved periodically in another site and it is recovered in case of failure. The paper details the configuration of the architecture and the experiments results using 3 virtual clusters running NAS parallel applications protected with DMTCP, a very well-known distributed multi-threaded checkpoint tool. Our experiments show that the execution time was increased between a 5% to 36% without failures and 27% to 66% in case of failures.Facultad de Inform谩tic

    H-RADIC: The Fault Tolerance Framework for Virtual Clusters on Multi-Cloud Environments

    Get PDF
    Even though the cloud platform promises to be reliable, several availability incidents prove that they are not. How can we be sure that a parallel application finishes the execution even if a site is affected by a failure? This paper presents H-RADIC, an approach based on RADIC architecture, that executes a parallel application in at least 3 different virtual clusters or sites. The execution state of each site is saved periodically in another site and it is recovered in case of failure. The paper details the configuration of the architecture and the experiments results using 3 virtual clusters running NAS parallel applications protected with DMTCP, a very well-known distributed multi-threaded checkpoint tool. Our experiments show that the execution time was increased between a 5% to 36% without failures and 27% to 66% in case of failures.Facultad de Inform谩tic

    H-RADIC: una soluci贸n de tolerancia a fallos para cl煤steres virtuales en ambientes multi-nube

    Get PDF
    Even though the cloud platform promises to be reliable, several availability incidents prove that it is not. How can we be sure that a parallel application finishes it麓s execution even if a site is affected by a failure? This paper presents H-RADIC, an approach based on RADIC architecture, that executes parallel applications protected by RADIC in at least 3 different virtual clusters or sites. The execution state of each site is saved periodically in another site and it is recovered in case of failure. The paper details the configuration of the architecture and the experiment麓s results using 3 clusters running NAS parallel applications protected with DMTCP, a very well-known distributed multi-threaded checkpoint tool. Our experiments show that by adding a cluster protector it will be possible to implement the next level in the hierarchy, where the first level in the RADIC hierarchy works as an observer at a site level. In adition, the experiments showed that the protection implementation is out of the critical path of the application and it depends on the utilized resources.Aunque las plataformas en la nube parecen ser muy confiables, varios incidentes de disponibilidad han podemos asegurarnos que una aplicaci贸n paralela termina su ejecuci贸n cuando el sitio en la nube ha sido afectado por una falla? Este articulo presenta HRADIC, un enfoque basado en la arquitectura RADIC, esta ejecuta aplicaciones paralelas en al menos 3 diferentes sitios o cl煤steres virtuales, todos protegidos por RADIC, donde el estado de la ejecuci贸n de cada sitio es guardado peri贸dicamente en otro de los sitios y de ah铆 es recuperado en el caso de una falla. El articulo detalla la configuraci贸n de la arquitectura y los resultados de los experimentos usando 3 cl煤steres ejecutando aplicaciones NAS en paralelo, protegidas con DMTCP (una herramienta para realizar m煤ltiples checkpoints). Nuestros experimentos muestran que al agregar un protector del cl煤ster es posible implementar un nivel m谩s en la jerarqu铆a de RADIC, donde el primer nivel funciona como observador. Los experimentos muestran que la implementaci贸n de este protector esta fuera del camino critico de la ampliaci贸n y depende solamente de la utilizaci贸n de recursos.Facultad de Inform谩tic

    H-RADIC: una soluci贸n de tolerancia a fallos para cl煤steres virtuales en ambientes multi-nube

    Get PDF
    Even though the cloud platform promises to be reliable, several availability incidents prove that it is not. How can we be sure that a parallel application finishes it麓s execution even if a site is affected by a failure? This paper presents H-RADIC, an approach based on RADIC architecture, that executes parallel applications protected by RADIC in at least 3 different virtual clusters or sites. The execution state of each site is saved periodically in another site and it is recovered in case of failure. The paper details the configuration of the architecture and the experiment麓s results using 3 clusters running NAS parallel applications protected with DMTCP, a very well-known distributed multi-threaded checkpoint tool. Our experiments show that by adding a cluster protector it will be possible to implement the next level in the hierarchy, where the first level in the RADIC hierarchy works as an observer at a site level. In adition, the experiments showed that the protection implementation is out of the critical path of the application and it depends on the utilized resources.Aunque las plataformas en la nube parecen ser muy confiables, varios incidentes de disponibilidad han podemos asegurarnos que una aplicaci贸n paralela termina su ejecuci贸n cuando el sitio en la nube ha sido afectado por una falla? Este articulo presenta HRADIC, un enfoque basado en la arquitectura RADIC, esta ejecuta aplicaciones paralelas en al menos 3 diferentes sitios o cl煤steres virtuales, todos protegidos por RADIC, donde el estado de la ejecuci贸n de cada sitio es guardado peri贸dicamente en otro de los sitios y de ah铆 es recuperado en el caso de una falla. El articulo detalla la configuraci贸n de la arquitectura y los resultados de los experimentos usando 3 cl煤steres ejecutando aplicaciones NAS en paralelo, protegidas con DMTCP (una herramienta para realizar m煤ltiples checkpoints). Nuestros experimentos muestran que al agregar un protector del cl煤ster es posible implementar un nivel m谩s en la jerarqu铆a de RADIC, donde el primer nivel funciona como observador. Los experimentos muestran que la implementaci贸n de este protector esta fuera del camino critico de la ampliaci贸n y depende solamente de la utilizaci贸n de recursos.Facultad de Inform谩tic

    H-RADIC: una soluci贸n de tolerancia a fallos para cl煤steres virtuales en ambientes multi-nube

    Get PDF
    Even though the cloud platform promises to be reliable, several availability incidents prove that it is not. How can we be sure that a parallel application finishes it麓s execution even if a site is affected by a failure? This paper presents H-RADIC, an approach based on RADIC architecture, that executes parallel applications protected by RADIC in at least 3 different virtual clusters or sites. The execution state of each site is saved periodically in another site and it is recovered in case of failure. The paper details the configuration of the architecture and the experiment麓s results using 3 clusters running NAS parallel applications protected with DMTCP, a very well-known distributed multi-threaded checkpoint tool. Our experiments show that by adding a cluster protector it will be possible to implement the next level in the hierarchy, where the first level in the RADIC hierarchy works as an observer at a site level. In adition, the experiments showed that the protection implementation is out of the critical path of the application and it depends on the utilized resources.Aunque las plataformas en la nube parecen ser muy confiables, varios incidentes de disponibilidad han podemos asegurarnos que una aplicaci贸n paralela termina su ejecuci贸n cuando el sitio en la nube ha sido afectado por una falla? Este articulo presenta HRADIC, un enfoque basado en la arquitectura RADIC, esta ejecuta aplicaciones paralelas en al menos 3 diferentes sitios o cl煤steres virtuales, todos protegidos por RADIC, donde el estado de la ejecuci贸n de cada sitio es guardado peri贸dicamente en otro de los sitios y de ah铆 es recuperado en el caso de una falla. El articulo detalla la configuraci贸n de la arquitectura y los resultados de los experimentos usando 3 cl煤steres ejecutando aplicaciones NAS en paralelo, protegidas con DMTCP (una herramienta para realizar m煤ltiples checkpoints). Nuestros experimentos muestran que al agregar un protector del cl煤ster es posible implementar un nivel m谩s en la jerarqu铆a de RADIC, donde el primer nivel funciona como observador. Los experimentos muestran que la implementaci贸n de este protector esta fuera del camino critico de la ampliaci贸n y depende solamente de la utilizaci贸n de recursos.Facultad de Inform谩tic

    Outcomes of the fault tolerance configuration

    Get PDF
    This paper presents the influence of the fault tolerance configuration on different applications using performance metrics. Two configuration parameters are analysed: the heartbeat/watchdog interval and the checkpoint interval. In addition, even message logging is mandatory, an analysis of its overhead on different applications is presented. The impact of message logging on applications has been analysed according to the nature of the communication primitives used on the application. This analysis shows why for different applications the message logging introduces different overhead.Presentado en el IX Workshop Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Inform谩tica (RedUNCI

    High availability for parallel computers

    Get PDF
    Fault tolerance has become an important issue for parallel applications in the last few years. The parallel systems' users want them to be reliable considering two main dimensions, availability and data consistency. Availability can be provided with solutions such as RADIC, a fault tolerant architecture with different protection levels, offering high availability with transparency, decentralization, flexibility and scalability for message-passing systems. Transient faults may cause an application running in a computer system to be removed from execution, however the biggest risk of transient faults is to provoke undetected data corruption that changes the final result of the application without anyone knowing. To evaluate the effects of transient faults in the robustness of applications and validate new fault detection mechanism and strategies, we have developed a full-system simulation fault injection environmentFacultad de Inform谩tic

    Fault tolerance at system level based on RADIC architecture

    Get PDF
    The increasing failure rate in High Performance Computing encourages the investigation of fault tolerance mechanisms to guarantee the execution of an application in spite of node faults. This paper presents an automatic and scalable fault tolerant model designed to be transparent for applications and for message passing libraries. The model consists of detecting failures in the communication socket caused by a faulty node. In those cases, the affected processes are recovered in a healthy node and the connections are reestablished without losing data. The Redundant Array of Distributed Independent Controllers architecture proposes a decentralized model for all the tasks required in a fault tolerance system: protection, detection, recovery and masking. Decentralized algorithms allow the application to scale, which is a key property for current HPC system. Three different rollback recovery protocols are defined and discussed with the aim of offering alternatives to reduce overhead when multicore systems are used. A prototype has been implemented to carry out an exhaustive experimental evaluation through Master/Worker and Single Program Multiple Data execution models. Multiple workloads and an increasing number of processes have been taken into account to compare the above mentioned protocols. The executions take place in two multicore Linux clusters with different socket communications libraries

    Challenges and Issues of the Integration of RADIC into Open MPI

    No full text

    Libro de Actas JCC&BD 2018 : VI Jornadas de Cloud Computing & Big Data

    Get PDF
    Se recopilan las ponencias presentadas en las VI Jornadas de Cloud Computing & Big Data (JCC&BD), realizadas entre el 25 al 29 de junio de 2018 en la Facultad de Inform谩tica de la Universidad Nacional de La Plata.Universidad Nacional de La Plata (UNLP) - Facultad de Inform谩tic
    corecore