Search CORE

113 research outputs found

Collective Vector Clocks: Low-Overhead Transparent Checkpointing for MPI

Author: Cooperman Gene
Xu Yao
Publication venue
Publication date: 06/10/2023
Field of study

Taking snapshots of the state of a distributed computation is useful for off-line analysis of the computational state, for later restarting from the saved snapshot, for cloning a copy of the computation, and for migration to a new cluster. The problem is made more difficult when supporting collective operations across processes, such as barrier, reduce operations, scatter and gather, etc. Some processes may have reached the barrier or other collective operation, while other processes wait a long time to reach that same barrier or collective operation. At least two solutions are well-known in the literature: (I) draining in-flight network messages and then freezing the network at checkpoint time; and (ii) adding a barrier prior to the collective operation, and either completing the operation or aborting the barrier if not all processes are present. Both solutions suffer important drawbacks. The code in the first solution must be updated whenever one ports to a newer network. The second solution implies additional barrier-related network traffic prior to each collective operation. This work presents a third solution that avoids both drawbacks. There is no additional barrier-related traffic, and the solution is implemented entirely above the network layer. The work is demonstrated in the context of transparent checkpointing of MPI libraries for parallel computation, where each of the first two solutions have already been used in prior systems, and then abandoned due to the aforementioned drawbacks. Experiments demonstrate the low runtime overhead of this new, network-agnostic approach. The approach is also extended to non-blocking, collective operations in order to handle overlapping of computation and communication.Comment: 16 pages, 6 figure

arXiv.org e-Print Archive

High Performance Pipelined Process Migration with RDMA

Author: Besseron Xavier
Ouyang Xiangyong
Panda Dhabaleswar K.
Rajachandrasekar Raghunath
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2011
Field of study

Recommended from our members

Coordinated Fault Tolerance for High-Performance Computing

Author: al. et
Bosilca George
Dongarra Jack
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 08/04/2013
Field of study

Our work to meet our goal of end-to-end fault tolerance has focused on two areas: (1) improving fault tolerance in various software currently available and widely used throughout the HEC domain and (2) using fault information exchange and coordination to achieve holistic, systemwide fault tolerance and understanding how to design and implement interfaces for integrating fault tolerance features for multiple layers of the software stack—from the application, math libraries, and programming language runtime to other common system software such as jobs schedulers, resource managers, and monitoring tools

UNT Digital Library

A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud

Author: Egwutuoha Ifeanyi Paulinus
Publication venue: Faculty of Engineering and Information Technologies, School of Electrical and Information Engineering
Publication date: 01/01/2014
Field of study

High Performance Computing (HPC) systems have been widely used by scientists and researchers in both industry and university laboratories to solve advanced computation problems. Most advanced computation problems are either data-intensive or computation-intensive. They may take hours, days or even weeks to complete execution. For example, some of the traditional HPC systems computations run on 100,000 processors for weeks. Consequently traditional HPC systems often require huge capital investments. As a result, scientists and researchers sometimes have to wait in long queues to access shared, expensive HPC systems. Cloud computing, on the other hand, offers new computing paradigms, capacity, and flexible solutions for both business and HPC applications. Some of the computation-intensive applications that are usually executed in traditional HPC systems can now be executed in the cloud. Cloud computing price model eliminates huge capital investments. However, even for cloud-based HPC systems, fault tolerance is still an issue of growing concern. The large number of virtual machines and electronic components, as well as software complexity and overall system reliability, availability and serviceability (RAS), are factors with which HPC systems in the cloud must contend. The reactive fault tolerance approach of checkpoint/restart, which is commonly used in HPC systems, does not scale well in the cloud due to resource sharing and distributed systems networks. Hence, the need for reliable fault tolerant HPC systems is even greater in a cloud environment. In this thesis we present a proactive fault tolerance approach to HPC systems in the cloud to reduce the wall-clock execution time, as well as dollar cost, in the presence of hardware failure. We have developed a generic fault tolerance algorithm for HPC systems in the cloud. We have further developed a cost model for executing computation-intensive applications on HPC systems in the cloud. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in the cloud can be considerably reduced compared to checkpoint and redundancy techniques used in traditional HPC systems

The MIG Framework: Enabling Transparent Process Migration in Open MPI

Author: Fornaciari William
Libutti Simone
Massari Giuseppe
Pozzi Gianmario
Reghenzani Federico
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

This paper introduces the mig framework: an Open MPI extension to transparently support the migration of application processes, over different nodes of a distributed High-Performance Computing (HPC) system. The framework provides mechanism on top of which suitable resource managers can implement policies to react to hardware faults, address performance variability, improve resource utilization, perform a fine-grained load balancing and power thermal management. Compared to other state-of-the-art approaches, the mig framework does not require changes in the application code. Moreover, it is highly maintainable, since it is mainly a self-contained solution that has required a very few changes in other already existing Open MPI frameworks. Experimental results have shown that the proposed extension does not introduce significant overhead in the application execution, while the penalty due to performing a migration can be properly taken into account by a resource manager

Archivio istituzionale della ricerca - Politecnico di Milano

Coordinated Fault-Tolerance for High-Performance Computing Final Project Report

Author
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study