437 research outputs found

    Hybrid Message Logging. Combining advantages of Sender-based and Receiver-based Approaches

    Get PDF
    AbstractWith the growing scale of High Performance Computing applications comes an increase in the number of interruptions as a consequence of hardware failures. As the tendency is to scale parallel executions to hundred of thousands of processes, fault tolerance is becoming an important matter. Uncoordinated fault tolerance protocols, such as message logging, seem to be the best option since coordinated protocols might compromise applications scalability. Considering that most of the overhead during failure-free executions is caused by message logging approaches, in this paper we propose a Hybrid Message Logging protocol. It focuses on combining the fast recovery feature of pessimistic receiver-based message logging with the low protection overhead introduced by pessimistic sender-based message logging. The Hybrid Message Logging aims to reduce the overhead introduced by pessimistic receiver-based approaches by allowing applications to continue normally before a received message is properly saved. In order to guarantee that no message is lost, a pessimistic sender-based logging is used to temporarily save messages while the receiver fully saves its received messages. Experiments have shown that we can achieve up to 43% overhead reduction compared to a pessimistic receiver- based logging approach

    Hybrid Message Pessimistic Logging : improving current pessimistic message logging protocols

    Get PDF
    With the growing scale of HPC applications, there has been an increase in the number of interruptions as a consequence of hardware failures. The remarkable decrease of Mean Time Between Failures (MTBF) in current systems encourages the research of suitable fault tolerance solutions. Message logging combined with uncoordinated checkpoint compose a scalable rollback-recovery solution. However, message logging techniques are usually responsible for most of the overhead during failure-free executions. Taking this into consideration, this paper proposes the Hybrid Message Pessimistic Logging (HMPL) which focuses on combining the fast recovery feature of pessimistic receiver-based message logging with the low failure-free overhead introduced by pessimistic sender-based message logging. The HMPL manages messages using a distributed controller and storage to avoid harming system's scalability. Experiments show that the HMPL is able to reduce overhead by 34% during failure-free executions and 20% in faulty executions when compared with a pessimistic receiver-based message logging

    Managing receiver-based message logging overheads in parallel applications

    Get PDF
    Using rollback-recovery based fault tolerance (FT) techniques in applications executed on Multicore Clusters is still a challenge, because the overheads added depend on the applications' behavior and resource utilization. Many FT mechanisms have been developed in re- cent years, but analysis is lacking concerning how parallel applications are a ected when applying such mechanisms. In this work we address the combination of process mapping and FT tasks mapping on multicore environments. Our main goal is to determine the con guration of a pessimistic receiver-based message logging approach which generates the least disturbance to the parallel application. We propose to characterize the parallel application in combination with the message logging approach in order to determine the most signi cant aspects of the application such as computation communication ratio and then, according to the values obtained, we suggest a con guration that can minimize the added overhead for each speci c scenario. In this work we show that in some situations is better to save some resources for the FT tasks in order to lower the disturbance in parallel executions and also to save memory for these FT tasks. Initial results have demonstrated that when saving resources for the FT tasks we can achieve 25% overhead reduction when using a pessimistic message logging aproach as FT support.WPDP- XIII Workshop procesamiento distribuido y paraleloRed de Universidades con Carreras en Informática (RedUNCI

    Coordinated Fault-Tolerance for High-Performance Computing Final Project Report

    Full text link

    A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud

    Get PDF
    High Performance Computing (HPC) systems have been widely used by scientists and researchers in both industry and university laboratories to solve advanced computation problems. Most advanced computation problems are either data-intensive or computation-intensive. They may take hours, days or even weeks to complete execution. For example, some of the traditional HPC systems computations run on 100,000 processors for weeks. Consequently traditional HPC systems often require huge capital investments. As a result, scientists and researchers sometimes have to wait in long queues to access shared, expensive HPC systems. Cloud computing, on the other hand, offers new computing paradigms, capacity, and flexible solutions for both business and HPC applications. Some of the computation-intensive applications that are usually executed in traditional HPC systems can now be executed in the cloud. Cloud computing price model eliminates huge capital investments. However, even for cloud-based HPC systems, fault tolerance is still an issue of growing concern. The large number of virtual machines and electronic components, as well as software complexity and overall system reliability, availability and serviceability (RAS), are factors with which HPC systems in the cloud must contend. The reactive fault tolerance approach of checkpoint/restart, which is commonly used in HPC systems, does not scale well in the cloud due to resource sharing and distributed systems networks. Hence, the need for reliable fault tolerant HPC systems is even greater in a cloud environment. In this thesis we present a proactive fault tolerance approach to HPC systems in the cloud to reduce the wall-clock execution time, as well as dollar cost, in the presence of hardware failure. We have developed a generic fault tolerance algorithm for HPC systems in the cloud. We have further developed a cost model for executing computation-intensive applications on HPC systems in the cloud. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in the cloud can be considerably reduced compared to checkpoint and redundancy techniques used in traditional HPC systems
    corecore