871 research outputs found

    A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud

    Get PDF
    High Performance Computing (HPC) systems have been widely used by scientists and researchers in both industry and university laboratories to solve advanced computation problems. Most advanced computation problems are either data-intensive or computation-intensive. They may take hours, days or even weeks to complete execution. For example, some of the traditional HPC systems computations run on 100,000 processors for weeks. Consequently traditional HPC systems often require huge capital investments. As a result, scientists and researchers sometimes have to wait in long queues to access shared, expensive HPC systems. Cloud computing, on the other hand, offers new computing paradigms, capacity, and flexible solutions for both business and HPC applications. Some of the computation-intensive applications that are usually executed in traditional HPC systems can now be executed in the cloud. Cloud computing price model eliminates huge capital investments. However, even for cloud-based HPC systems, fault tolerance is still an issue of growing concern. The large number of virtual machines and electronic components, as well as software complexity and overall system reliability, availability and serviceability (RAS), are factors with which HPC systems in the cloud must contend. The reactive fault tolerance approach of checkpoint/restart, which is commonly used in HPC systems, does not scale well in the cloud due to resource sharing and distributed systems networks. Hence, the need for reliable fault tolerant HPC systems is even greater in a cloud environment. In this thesis we present a proactive fault tolerance approach to HPC systems in the cloud to reduce the wall-clock execution time, as well as dollar cost, in the presence of hardware failure. We have developed a generic fault tolerance algorithm for HPC systems in the cloud. We have further developed a cost model for executing computation-intensive applications on HPC systems in the cloud. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in the cloud can be considerably reduced compared to checkpoint and redundancy techniques used in traditional HPC systems

    Using migratable objects to enhance fault tolerance schemes in supercomputers

    Get PDF
    Supercomputers have seen an exponential increase in their size in the last two decades. Such a high growth rate is expected to take us to exascale in the timeframe 2018-2022. But, to bring a productive exascale environment about, it is necessary to focus on several key challenges. One of those challenges is fault tolerance. Machines at extreme scale will experience frequent failures and will require the system to avoid or overcome those failures. Various techniques have recently been developed to tolerate failures. The impact of these techniques and their scalability can be substantially enhanced by a parallel programming model called migratable objects. In this paper, we demonstrate how the migratable-objects model facilitates and improves several fault tolerance approaches. Our experimental results on thousands of cores suggest fault tolerance schemes based on migratable objects have low performance overhead and high scalability. Additionally, we present a performance model that predicts a significant benefit of using migratable objects to provide fault tolerance at extreme scale

    Reducing the overhead of an MPI application-level migration approach

    Get PDF
    [Abstract] Process migration provides many benefits for parallel environments including dynamic load balance, data access locality, or fault tolerance. This work proposes a solution that reduces the memory and I/O overhead in an application-level checkpoint-based migration approach. The proposal splits the checkpoint files in order to overlap the writing of the state in the terminating processes with the read and restarting operation in the newly spawned processes. It has been tested using the MPI NAS Parallel Benchmarks, showing encouraging results, both in terms of memory consumption and I/O migration times.Ministerio de Economía y Competitividad; TIN2013-42148-PGalicia. Consellería de Cultura, Educación e Ordenación Universitaria; GRC2013/05

    Zeroing memory deallocator to reduce checkpoint sizes in virtualized HPC environments

    Get PDF
    Virtualization has become an indispensable tool in data centers and cloud environments to flexibly assign virtual machines (VMs) to resources. Virtualization also becomes more and more attractive for high-performance computing (HPC). This is mainly due to the strong isolation of VMs which enables: (1) the sharing of cluster nodes and optimization of the system’s overall utilization; (2) load balancing by means of migrations due to the reduction of residual dependencies; and (3) the creation of system-level checkpoints increasing the fault tolerance in an application-transparent way. On the downside, the additional virtualization layer conceals information that is only available on the process level. This information has a direct influence on the checkpoint size which should be kept as small as possible. In this paper, we propose a novel technique for checkpoint size reduction in virtualized environments. We exploit the fact that the hypervisor detects zero pages which are omitted when capturing a checkpoint. Moreover, compression techniques are applied for a further reduction of the checkpoint size. We therefore fill freed memory regions with zeros supporting both the zero-page detection and the compression. We evaluate our approach by taking the example of HPC applications. The results reveal a reduction of the checkpoint size by up to 9% when compression is disabled in the hypervisor and up to 49% with compression enabled. Furthermore, memory zeroing is able to reduce VM migration time by up to 10% when compression is disabled and by up to 60% when compression is enabled

    Checkpointing as a Service in Heterogeneous Cloud Environments

    Get PDF
    A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201

    MOLAR: Modular Linux and Adaptive Runtime Support for HEC OS/R Research

    Full text link

    The MIG Framework: Enabling Transparent Process Migration in Open MPI

    Get PDF
    This paper introduces the mig framework: an Open MPI extension to transparently support the migration of application processes, over different nodes of a distributed High-Performance Computing (HPC) system. The framework provides mechanism on top of which suitable resource managers can implement policies to react to hardware faults, address performance variability, improve resource utilization, perform a fine-grained load balancing and power thermal management. Compared to other state-of-the-art approaches, the mig framework does not require changes in the application code. Moreover, it is highly maintainable, since it is mainly a self-contained solution that has required a very few changes in other already existing Open MPI frameworks. Experimental results have shown that the proposed extension does not introduce significant overhead in the application execution, while the penalty due to performing a migration can be properly taken into account by a resource manager

    Evaluating the Performance Impact of Xen on MPI and Process Execution For HPC Systems

    Full text link
    corecore